Quantifying Seagrass Distribution in Coastal Water with Deep Learning Models †

: Coastal ecosystems are critically affected by seagrass, both economically and ecologically. However, reliable seagrass distribution information is lacking in nearly all parts of the world because of the excessive costs associated with its assessment. In this paper, we develop two deep learning models for automatic seagrass distribution quantiﬁcation based on 8-band satellite imagery. Speciﬁcally, we implemented a deep capsule network (DCN) and a deep convolutional neural network (CNN) to assess seagrass distribution through regression. The DCN model ﬁrst determines whether seagrass is presented in the image through classiﬁcation. Second, if seagrass is presented in the image, it quantiﬁes the seagrass through regression. During training, the regression and classiﬁcation modules are jointly optimized to achieve end-to-end learning. The CNN model is strictly trained for regression in seagrass and non-seagrass patches. In addition, we propose a transfer learning approach to transfer knowledge in the trained deep models at one location to perform seagrass quantiﬁcation at a different location. We evaluate the proposed methods in three WorldView-2 satellite images taken from the coastal area in Florida. Experimental results show that the proposed deep DCN and CNN models performed similarly and achieved much better results than a linear regression model and a support vector machine. We also demonstrate that using transfer learning techniques for the quantiﬁcation of seagrass signiﬁcantly improved the results as compared to directly applying the deep models to new locations.


Introduction
Seagrass constitutes a significantly important economic, ecological, and social well-being component of coastal ecosystems [1,2]. Economically, seagrass ecosystems are valued 33 and 23 times more than oceanic and terrestrial ecosystems, respectively. Ecologically, seagrass provides numerous benefits such as organic fertilization, sediment trapping, or pollution filtering [2]. However, trustworthy information about seagrass distribution is missing in nearly most of the planet due to the excessive costs of its mapping [2]. This paper analyzes different deep learning approaches for quantification of seagrass in satellite images and compares them against traditional machine learning methods. Specifically, our methods quantify the leaf area index (LAI) of each pixel based on multispectral satellite images. LAI is defined as leaf area per square area [3], and it is a critical biophysical component of seagrass [2]. The LAI index is denoted as a floating number ranging from 0 to 10, with '0' as no seagrass and '10' as the largest seagrass density per area.
Our ultimate goal is to automatically quantify LAI index using satellite images with minimum workforce for field observations. To achieve this goal, we need to address the following two questions: 1.
Can we train a deep learning model to successfully predict the level of LAI based on multispectral satellite images? 2.
Can we generalize a deep model trained with images from one location to predict seagrass LAI levels at a different location?
To address the first question, we develop two deep learning models for seagrass quantification: (1) a convolutional neural network (CNN) for regression of LAI, and a deep capsule network (DCN) that is optimized jointly for simultaneous classification and regression. To answer the second question, we train the deep learning models with one multispectral image and develop a transfer learning approach to generalize the models to other two images collected at different locations. The multispectral images utilized in this study are WorldView-2 satellite images. Each image has a resolution of 1.24 m and a total of 8 visible and near-infrared (VNIR) bands. An experienced operator labeled some pixels in the images as sea, sand, seagrass, or land and applied a physics model [4] to seagrass pixels to obtain LAI index. The physics model has reported an error rate of 10% [4], making it not completely suitable as ground truth for training. To resolve this challenge, an experienced operator identified several regions where the LAI mappings are reliable, and those LAI mappings are used as ground truth. The major contributions of this paper are: • Two deep learning models for regression of seagrass LAI that outperform traditional methods for regression. • A transfer learning approach that performs seagrass LAI level mapping at a new location with minimum workforce site observation.
The remainder of the paper is organized as follows. A review of the literature is provided in Section 2. The proposed methods are described in Section 3. Results of the proposed method are presented in Section 4. Finally, discussions and conclusions are given in Sections 5 and 6, respectively.

Deep Learning
Deep learning models achieved superb results in different areas in recent years such as object detection and tracking [5][6][7][8], image classification [9][10][11][12][13], remote sensing [12,[14][15][16][17][18], speech recognition [19,20], autonomous driving [21,22], cybersecurity [23,24], and medical imaging [25]. Among various structures, CNNs are the most popular models for different applications. A CNN model consists of convolutional layers for feature learning. The convolutional layers are usually followed by fully connected layers to perform classification or regression. Many CNN based image classification methods are able to perform end-to-end learning where feature learning and classification are jointly optimized, and it has been proved that this feature learning capability is a key contributing factor to the success of CNNs.
Though regression problems using CNNs are not as common as classification, some examples can be found in the literature. CNNs have been utilized for regression by adding a regression layer on top of the convolutional layers. Niu et al. [26] proposed a CNN model for age estimation which consists of K − 1 output layers with each performing a sequential classification of whether the input facial image's age was older than a rank of K, thus transforming the classification model into an ordinal regression model. Yuan et al. used a generic 2D CNN regression model for simultaneous face detection and segmentation and achieved competitive results [27]. Girshick developed a fast region-based CNN method for object detection [28] where the CNN model produced two outputs: a softmax probability for predicting the presence of the object, and regression outputs defining the position of the detected object. A similar method was used by Gidaris and Komodaris in [29], where they developed a CNN model for bounding box regression, which allowed the authors to refine the location of the detected objects. CNN models can also be used for image synthesis. In [30], Li et al. developed a CNN model to generate Positron Emission Tomography (PET) images from Magnetic Resonance Imaging (MRI) images to improve Alzheimer's disease diagnosis.

Capsule Networks
Capsule networks were introduced in late 2017 by Sabour et al. [31]. In these models, neurons in filter maps are grouped to form a set of capsules, which represent instantiation parameters of an entity in a given image, and information between different capsule layers are communicated through routing. The first implementation of capsule networks achieved a 99.75% accuracy on the MNIST dataset, which still represents the state of the art. Capsule networks have two unique properties as compared to CNNs: being able to identify overlapped objects in images and capable of performing simultaneous classification and regression.The same research team recently proposed a modification of their capsule network and obtained state-of-the-art results on smallNORB [32].
The last capsule layer of a DCN model comprises a set of capsule vectors, where each vector corresponds to one class in the training dataset and the length of the vector is treated as the posterior probability of the class for classification. In addition, the model reconstructs each input image using the corresponding capsule vectors. These reconstructed images are used for regularization during the training process. The errors between input images and their reconstructions are then backpropagated to optimize all the weights in the network. The unique configuration of DCN makes it able to perform classification and regression simultaneously. Sabour et al. demonstrated that the reconstruction stage is also an important contributor to the superb results obtained by the model applied to MNIST [31].
Recently, DCN models have been applied to more complex data. The application of DCN models to the CIFAR-10 dataset was studied in [33], where the authors obtained an accuracy of 77.55%. This performance is significantly worse than the current state-of-the-art results (96.53%). In the medical image analysis field, it has been demonstrated that capsule networks outperform CNNs in different tasks such as classification of brain tumor type [34], diagnosis of thoracic disease [35] and reconstruction of image stimuli from functional MRI [36]. In [37], the authors showed how a capsule network could be successfully implemented in the deep reinforcement learning framework to create intelligence agents in games. Additionally, LaLonde and Bagci applied a capsule network to an object segmentation task [38] and showed that the number of parameters of the capsule network can be reduced by 94.5% as compared to the traditional design, while still improving its accuracy. In our recent study [39], a DCN model was implemented as a generative model to readjust a trained capsule network for classification of seagrass at different locations. To the best of our knowledge, our study is the first attempt to apply capsule networks for seagrass quantification.

Seagrass LAI Mapping
Besides our previous studies on seagrass identification [39] and LAI mapping [40], research in the literature on seagrass mapping is mostly focused on analyzing performances of manual mapping approaches [41,42]. A remote sensing method was developed by Yang et al. [43]. Instead of quantifying the seagrass distribution in satellite images, they manually determined whether seagrass was presented in a given region and achieved an accuracy slightly better than 80%. A few works proposed automatic methods for seagrass quantification. For example, Wicaksono et al. implemented an automatic algorithm for seagrass LAI mapping and achieved a mean square error (MSE) of 0.72 [2]. In [44], Pu et al. implemented a regression model for LAI quantification that achieved MSEs of 0.78 and 0.59 using data taken by Hyperion (HYP) and Advanced Land Imager (ALI) satellites, respectively. In [45], Dierssen et al. developed a remote sensing strategy to estimate LAI levels of seagrass with MSEs ranging from 0.88 to 0.98. To the best of our knowledge, our team has developed the first approaches for classification [39] and regression [40] of seagrass. In this paper, we focus on the regression part and analyze which deep learning model is more suitable for LAI regression.

Transfer Learning
Transfer learning aims to solve problems in a target domain where training data samples are limited by using a model trained in a similar source domain where rich training data is available [46]. DeCaf is among the first successful implementations of transfer learning, where Donahouse et al. [47] generalized a CNN model trained on ImageNet [9] to different domains and achieved state-of-the-art performances. In [48], Yosinki et al. froze the first few layers in a model trained in the source domain and re-trained the other layers with labeled data from the target domain. They achieved better accuracies by fine tuning.
While remote sensing images may be widely available, reliable mapping and labeling are often missing due to the associated excessive cost [2,49,50]. For this reason, transfer learning is a common tool used in remote sensing applications. For instance, Hu et al. [51] transferred knowledge from existing pre-trained CNNs on the ImageNet dataset [9] to high-resolution remote sensing images for scene classification, improving accuracy by about 2.5-5% as compared to the state-of-the-art methods that only focus on exploring low-level hand-crafted features. Xie et al. were able to transfer knowledge from a pre-trained CNN model [9] to predict nighttime light intensity from daytime satellite imagery and to estimate poverty with an accuracy of 76.1% [49]. Jun et al. [50] demonstrated how an active learning model for characterizing land cover in satellite images can be trained in one region and then be applied to different regions. Their results showed that the active learning approach produced better results than applying the model directly to the regions with limited training samples.
Transfer learning has been applied to other deep models in the literature. In [52], the authors demonstrated that the performance of a deep belief network (DBN) for post-traumatic stress disorder (PTSD) diagnosis could be significantly improved using transfer learning. Chowdhury et al. [24] developed a few-shot deep learning approach for intrusion detection, in which they extracted features of a small dataset from some trained CNN and DBN models, and trained a simple classifier to improve the intrusion detection performances. Transfer learning with CNN for regression problems has been studied in [49,53]. However, transfer learning with capsule networks has not been investigated for seagrass quantification.

Datasets
Three multispectral images taken by the WorldView-2 satellite at three different coastal locations in Florida are utilized in this study. The images have eight spectral bands with a resolution of 1.24 m, as shown in Figure 1. The images have spatial sizes of 12,208 × 6717, 8962 × 7227 and 6143 × 9793 pixels, respectively. A patch of 5 × 5 × 8 is extracted for each pixel centered in the patch. The patch then will be classified as sea, land, seagrass, or sand. Additionally, the physics model [4] computed LAI index for each pixel.

Data Labeling
The physics model reported a 10% error for LAI mapping [4]. Therefore, we do not use the whole mapped images as ground truth to train our models. However, some regions in the LAI mappings by the physics model are considered to be more reliable than others. In our study, several regions in the images where the LAI mappings are more accurate are selected by an experienced operator (co-author of the physics model in [4]). These regions are treated as ground truth for training the deep models. Figure 1 shows the selected regions where cyan, blue, red and green boxes represent sand, sea, land, and seagrass, respectively. Additionally, Figure 2 shows the LAI mappings of the whole images. When we train our models, we only use the selected regions highlighted in Figure 1. Table 1 shows the number of pixels in the selected regions per class in each of the satellite images used in the study. Note that the labeled pixels in the selected regions are unbalanced. To address this issue, we balance the training samples by randomly downsampling majority classes and upsampling minority classes to ensure that each class has roughly the same number of training examples.

Joint Optimization of Classification and Regression in Capsule Networks for Seagrass Mapping
We design a DCN model for simultaneous classification (sea, sand, seagrass, land) and regression (LAI mapping) in multispectral satellite images as shown in Figure 3. Inputs of the model are image patches of size 5 × 5 × 8. The first layer is a convolutional layer with 32 2 × 2 × 8 filters and a stride of 1. This layer is followed by a second convolution layer with 64 2 × 2 × 32 kernels and a stride of 1. The output of this layer is organized as 8 blocks of capsules of size 3 × 3 × 8 in the PrimaryCaps layer. The reconstruction part of the original DCN model [31] is replaced by a linear regression layer for seagrass mapping. This layer quantifies the LAI level of seagrass based on the seagrass capsule vector from FeatureCaps. We define LAI of an image patch as the LAI of its center pixel. This structure allows us to jointly optimize LAI regression and seagrass classification. The FeatureCaps layer performs classification for the four classes (sea, land, seagrass, and sand) with a separate margin loss for the kth class [31]: where T k = 1 if the class of k is present, m + = 0.9, m − = 0.1 and v k is the magnitude of the kth vector in FeatureCaps representing the posterior probability for the kth class. λ is set as the default value of 0.5 and the total loss for the four classes is the sum of each individual loss. We set the number of routings from the PrimaryCaps layer to the FeatureCaps layer in the DCN model to 3. During training, if a seagrass image patch is fed as input, the seagrass vector in the FeatureCaps layer is used to train the regression model for LAI quantification. Then, the error of the regression is used during back-propagation to update the weights of the DCN model, jointly optimizing classification and regression. For other types of image patches (sea, sand, land), the regression step is skipped, and only the classification model is optimized.  Figure 4 shows the CNN model implemented for regression of LAI. The CNN Model has 2 convolutional layers for representation learning. The first convolutional layer has 32 kernels with a size of 2 × 2 × 8, and the second layer has 16 filters of size 4 × 4 × 32. The fully connected layer has a total of 16 hidden units, which matches the size of the vectors in the FeatureCaps layer in the DCN model. The last layer uses this representation to compute LAI through linear regression.

Convolutional Neural Network for Seagrass Mapping
Additionally, we develop an SVM model and a linear regression model to quantify LAI based on the image patches directly. These models offer baseline performances for comparison. For every unlabeled patch that is classified as seagrass by the 1-NN rule, predict its LAI value using the linear regression model trained in the previous step. LAI for every non-seagrass patch is set to '0.'

5.
These procedures are repeated for the image taken at St. George Sound for LAI prediction.
The transfer learning approach is also applied to the CNN model. When performing transfer learning with CNN, we extract the features from the last fully connected layer (16 features) in both the classification and regression step. The other parts of the transfer learning approach are identical to the method using the DCN model.

Model Structure Determination
We determine the hyper-parameters of the proposed models through 3-fold cross-validation (CV) in the patches from the selected regions in the image taken at St. Joseph Bay (Figure 1a). After several experiments, we conclude that the best patch size is 5 × 5 × 8 pixels and the DCN model has two convolutional layers with 32 and 64 filters of size 2 × 2, respectively. To ensure a fair comparison, we choose that both the DCN and CNN models have roughly the same number of parameters for representation learning. Specifically, each model has around 9000 parameters for feature learning and 17 parameters for LAI regression (including bias).
It is worthy to note that though the numbers of parameters for representation learning are the same in CNN and DCN, there are around 38,000 parameters in DCN's capsule layers for routing. In total, there are approximately 47K parameters in the DCN model, which are nearly four times more parameters than that in the CNN model.

Cross-Validation in the Selected Regions
In our first experiment, we determine if the proposed models are able to quantify LAI for seagrass in the regions selected by the experienced operator. We perform 3-fold CV in the selected regions in each satellite image as shown in Figure 1. We train both deep learning models until their learning losses converge, which generally happens before 100 training epochs. We use root mean squared error (RMSE) as our metric to assess the performance of each model. Table 2 shows the results for each model. It can be seen that the deep learning models (CNN and DCN) outperform linear regression and SVM. The performances of CNN and DCN are similar, but generally CNN produces the best results, achieving an average RMSE of 0.19.

End-to-End LAI Mapping
To perform end-to-end mapping, we train the deep learning models using image patches from the selected regions and predict LAI for the whole image. During training, the DCN model first classifies a given patch as sea, sand, seagrass, or land. Then, it sets LAI to '0' if the patch is classified as non-seagrass. If the patch is classified as seagrass, it is then mapped to the predicted LAI index. Using this method, the model is able to perform end-to-end mapping by jointly optimizing classification and regression.
We perform two experiments to illustrate the effect of end-to-end learning. In the first experiment, the direct linear regression model, SVM, and CNN are trained with seagrass patches only as pure regression models, but the produced LAI mappings are masked out by the classification results of the capsule network (Figures 5-7) as they would not be able to correctly quantify LAI for non-seagrass patches. In the second experiment, we use seagrass and non-seagrass (LAI = 0) patches from the selected regions to train our models and obtain 'raw' mappings of LAI with no mask applied to them (Figures 8-10). It is important to note that, since the LAI quantification obtained by the physics model is not considered as ground truth, these figures are not reliable indicators of the performance of our model and are shown in this paper only for visualization purposes. RMSEs obtained in the selected regions (Table 2) are true performance metrics for comparison.   Figure 7. LAI mapping at Saint George Sound produced by models trained with patches from the selected regions. The LAI maps are masked to show only those pixels that are classified as seagrass by the capsule network.   Figure 10. LAI mapping at Saint George Sound produced by models trained with patches from the selected regions. No mask was applied in this case.

Transfer Learning with Deep Models
We first train our deep learning models for LAI quantification with all the selected patches from the satellite image taken at St. Joseph Bay, and then we use the trained models as a feature extractor to transfer their knowledge to the other two locations (Keeton Beach and St. George Sound). Finally, we randomly select 50, 100, 500, and 1000 patches from the two new locations to train a linear regression model for LAI quantification at each new location. These selected patches are balanced among the four classes. To train a linear regression model for a new location, the selected labeled image patches are passed through the trained DCN/CNN model, and outputs from the FeatureCaps/FC layer are stored as training data. The outputs belonging to seagrass patches are then used to train a linear regression model for LAI quantification. For the remaining unlabeled image patches, we extract new representations at the FeatureCaps/FC layer and classify them to one of the four classes using the stored training data samples based on the 1-NN rule. If an image patch is classified as seagrass, we predict its LAI using the trained linear regression model. Otherwise, we set its LAI to 0.
We perform each experiment five times and show results of the 1-NN classification accuracies for the labeled patches at Keeton Beach and St. Joseph Bay in Table 3. It can be seen that while the classification results are very similar between both models, the DCN generally performs slightly better than the CNN. The RMSE results of LAI quantification by transfer learning are shown in Table 4. When the transfer learning approach is applied to the image taken at Keeton beach, the DCN outperforms the CNN in cases with a small number of training samples (50,100). In cases with a larger amount of training samples, there is no significant difference between CNN and DCN. Fine tuning always makes both the DCN and CNN worse indicating that over-fitting may happen. At St. George Sound, the DCN outperforms the CNN in transfer learning regardless of the number of training samples from St. George Sound. However, the best results at this location are always obtained when performing fine tuning with the CNN. In all cases, our transfer learning approach significantly outperforms direct mapping using linear regression and SVM. Additionally, it can be seen that the errors when using the networks without transfer learning (0 samples) are significantly larger. For visualization purposes, we show the transfer learning results after fine tuning with 500 labeled patches from the new locations in Figures 11 and 12. Note again that these two Figures are for visualization only, the models' comparison should be based on the accuracies computed in the selected regions (Tables 3 and 4) where LAI mappings by the physics model are more reliable.  Figure 11. LAI mapping at Keeton Beach produced by our transfer learning approach using 500 patches. Figure 12. LAI mapping at Saint George Sound produced by our transfer learning approach using 500 patches.

Computational Complexity
We carry out the experiments using a computer with 64 GB of RAM and an Intel Xeon E5-2687W v3 @ 3.10 GHz (10 cores). On average, one epoch of training requires 85.39 s and 13.17 s by the DCN and CNN models, respectively. Testing the DCN model takes 0.13 milliseconds/patch, while testing on the CNN model takes 0.023 milliseconds/patch. In total, testing on one entire image takes about 1.5 h with DCN and 0.42 h with CNN. Table 5 includes the training and testing time by each model.

Discussions
Our experimental results show that seagrass quantification using deep learning models is feasible and are better options than traditional machine learning methods. While all models use linear regression to predict LAI, DCN, and CNN perform LAI regression using new representations learned from raw image patches. In contrast, the linear regression and SVM models use raw pixel values in image patches of size 5 × 5 × 8 to quantify LAI. Our results show that both DCN and CNN outperform linear regression and SVM, which demonstrates that information in the learned representations is more suitable than raw image patches for LAI quantification.
The CNN model performs better than DCN when quantifying seagrass within the selected regions in the same image (Section 4.2). Specifically, RMSEs obtained by CNN/DCN are 0.04/0.07, 0.08/0.12, and 0.45/0.46 at Keeton Beach, St. George Sound, and St. Joseph Bay respectively. While these performance differences are not significant, training the CNN model takes approximately 6.5 times less than training DCN, which makes CNN the best option in this case. Figures 5-12 show that our models can effectively perform end-to-end seagrass LAI mapping. The results reported in these figures may be significantly different with respect to the physics model. However, the physics model has a reported error of 10% and should not be treated as ground truth. To assess the performance of each model, we should only compare RMSEs achieved by the models within the selected regions, as reported in Tables 2-4. The end-to-end mappings shown in this paper are for visualization purposes only. New on-site validation is planned to generate more ground truth data for model evaluation.
Our results show that the new representations learned by the DCN and CNN models are much better than the raw image patches for seagrass identification and LAI quantification in transfer learning. The DCN model achieves slightly better classification accuracies (Table 3) than the CNN model at the two new locations. For LAI mapping, the DCN model generally can achieve better results than CNN (without fine tuning) as shown in Table 4. If fine tuning is applied, performances of both the CNN and DCN models drop at Keeton Beach, indicating that over-fitting may happen and degrades the models' performances. At St. George Sound, DCN always performs better than CNN and fine tuning improves the performances of both models. Overall, transfer learning with DCN and CNN significantly improves seagrass quantification at different locations using as less as 50 samples from the new locations for training, as compared to the direct regression models including linear regression and SVM. With these experiments, we demonstrate that generalization of deep models using a limited number of samples from different locations for seagrass quantification is feasible.
Our study has limitations. First, the LAI mapping produced by the physics model has a regression error of about 10% [4]. While the mapping in the regions selected as ground truth for training by the experienced operator is more reliable, the exact mapping error in the selected regions has not been investigated. Results shown in Tables 2-4 are based on the selected regions where the accuracies are more reliable, while the end-to-end mappings shown in Figures 5-12 are for the whole images, and should be viewed as for visualization purpose only. Second, all regions considered in this study are from the coast of Florida. While the distribution of seagrass differs in each of the images, it is likely that the distribution difference will be even larger if the images are from different parts of the world. The ultimate goal of this work is to map seagrass distribution globally, and the model developed in this work needs to be further validated with more images from different locations, ideally from different hemispheres.

Conclusions
In this paper, we studied the quantification of seagrass distribution using two deep learning models: a convolutional neural network and a deep capsule network. We evaluated the proposed model for seagrass quantification at three different locations in Florida. The proposed models achieved significantly better results than a linear regression model and a support vector machine. We demonstrated that using transfer learning techniques for the quantification of seagrass significantly improved the results as compared to directly applying the deep models to new locations. Using our technique, seagrass can be accurately quantified with minimum workforce site observation.