Finer Resolution Mapping of Marine Aquaculture Areas Using WorldView-2 Imagery and a Hierarchical Cascade Convolutional Neural Network

: Marine aquaculture plays an important role in seafood supplement, economic development, and coastal ecosystem service provision. The precise delineation of marine aquaculture areas from high spatial resolution (HSR) imagery is vital for the sustainable development and management of coastal marine resources. However, various sizes and detailed structures of marine objects make it di ﬃ cult for accurate mapping from HSR images by using conventional methods. Therefore, this study attempts to extract marine aquaculture areas by using an automatic labeling method based on the convolutional neural network (CNN), i.e., an end-to-end hierarchical cascade network (HCNet). Speciﬁcally, for marine objects of various sizes, we propose to improve the classiﬁcation performance by utilizing multi-scale contextual information. Technically, based on the output of a CNN encoder, we employ atrous convolutions to capture multi-scale contextual information and aggregate them in a hierarchical cascade way. Meanwhile, for marine objects with detailed structures, we propose to reﬁne the detailed information gradually by using a series of long-span connections with ﬁne resolution features from the shallow layers. In addition, to decrease the semantic gaps between features in di ﬀ erent levels, we propose to reﬁne the feature space (i.e., channel and spatial dimensions) using an attention-based module. Experimental results show that our proposed HCNet can e ﬀ ectively identify and distinguish di ﬀ erent kinds of marine aquaculture, with 98% of overall accuracy. It also achieves better classiﬁcation performance compared with object-based support vector machine and state-of-the-art CNN-based methods, such as FCN-32s, U-Net, and DeeplabV2. Our developed method lays a solid foundation for the intelligent monitoring and management of coastal marine resources.


Introduction
Marine aquaculture, the farming of aquatic organisms such as marine fish, shellfish, aquatic plants in the marine environment, provides great potential for the increasing demand of seafood production and the economic development in coastal areas [1][2][3].Globally, the production of marine aquaculture has increased to 28.7 million tons in 2016, which doubles the almost 14.2 million tons in 2000 [4,5].The rapid growth faces limitation in the availability of suitable land space and the environmental carrying capacity of land-based sites.Therefore, marine aquaculture, especially the widely used raft culture and cage culture areas that are mainly cultivated with marine plants and fish, respectively, has been rapidly developed in inshore areas.However, extensive and disordered marine aquaculture might cause serious environmental problems and socio-economic losses [6][7][8].Although the Chinese government has formulated a series of laws and regulations at local and national levels, such as the Marine Environmental Protection Law, overall marine functional zonation, and nature reserve schemes, it is still a big challenge for comprehensive coastal management in China.Thus, accurate mapping and monitoring of marine aquaculture are imperative for the management and sustainable development of coastal marine resources.
In facing of various spatial and temporal scales in a complex marine environment, remote sensing technology has substantially improved our ability to observe remote and vast areas at a fraction of the cost of traditional surveys [9].To extract the marine aquaculture areas from remotely sensed images, previous studies have tried various methods including visual interpretation, spatial structure enhanced analyses [10,11], object-based image analysis (OBIA) [12][13][14], and deep convolutional neural networks (CNNs) [15].Visual interpretation is used less because it is labor-intensive and time-consuming.Spatial structures enhancement analysis (such as texture and neighborhood characteristics analyses) is frequently used in pixel-based classification methods.OBIA has been widely used in the past few decades.It firstly segments the image and then performs classification based on these segments [16].Thus, it can achieve a good classification performance by utilizing abundant features based on the representative segments.
In recent years, deep CNNs consisting of multiple trainable layers that can automatically learn representative and discriminative features [17,18] have achieved great success in the computer vision field [19,20].In the remote sensing domain, deep CNNs have also been actively studied and shown obvious improvements on object detection [21] and scene classification [22].Recent studies have further explored the ability of deep CNNs for dense prediction on the remotely sensed images.A straightforward method is to directly label a pixel by performing classification with its adjacent areas in a sliding-window way [23][24][25].However, such methods have limited classification performance due to their fixed receptive field and huge time consumption [26].Although some studies attempt to solve these problems by using the segment-based patches as basic classification union [27][28][29], they can be largely influenced by the segmentation accuracy.Besides, most of them are not trained end-to-end.To solve these problems, most recent studies have tried to perform pixel-wise classification exploiting fully convolutional networks (FCNs) [30], which replace fully connected layers with convolutional layers in classical CNN schemes.The main advantage of FCNs is that they allow pixel-wise labeling while the whole image as input.
However, there are some critical limitations for the FCNs to label the marine aquaculture areas in high spatial resolution (HSR) images accurately.The first challenge is the coexistence of confusing objects of various sizes, such as the large and continuous island areas versus a high diversity of small aquaculture areas in the sea areas.To tackle such problems, many researchers have concentrated on the use of multi-scale features, where objects at different scales can be prominent accordingly.One of the commonly used methods is to use multi-scale images as input to the deep CNNs [31][32][33].However, such methods usually take more time because of the repetitive computing for multi-scale versions of the input images.On the other hand, some studies also try to aggregate multi-scale features, which are created by atrous convolution [15,34] or pooling operations [35,36] at multiple scales, or multi-kernel convolution [37].However, as pooling with larger pooling sizes or convolution operation with larger atrous rates becomes less effective (i.e., more pooling or convolution operations would be applied to the padded zeros instead of the valid filter weights), such methods are limited to certain ranges of reception fields, resulting in a limitation for achieving better classification performance.
Meanwhile, due to the consecutive down-sampling processes in FCNs, the final feature maps are much smaller than the original image, leading to coarser prediction results and a decrease in classification accuracies.Therefore, it is a tough problem to perform accurate semantic labeling with such coarse feature maps, especially for marine objects with detailed structures in HSR images.To solve this problem, researchers have tried to restore the detailed spatial information by combining fine resolution features from shallow layers, such as multi-level feature fusing [38][39][40], up-pooling or deconvolution with recorded pooling indices [41,42].However, most current methods directly stack these multi-level feature maps, ignoring the adverse noises from the shallow layers.Meanwhile, some studies also attempted to refine the classification results by combining them with boundary detection results [43,44].However, this requires extra modules and supervision for boundary detection.
In summary, although current FCN-based approaches have achieved great success in dense prediction, it is difficult to perform a fine mapping of the marine aquaculture areas fully exploiting the information in HSR images.First, most current approaches are less effective at acquiring multi-scale contextual information, making it difficult to detect various objects in the marine environment.Second, most of the existing strategies are less effective for the utilization of finer feature maps from shallow layers, making it difficult to restore the detailed structures of marine objects in HSR images.
To solve these problems, it is necessary to combine much effective multi-scale contextual information and fine resolution features from shallow layers.Inspired by this idea, we propose a novel model called the hierarchical cascade convolutional neural network (HCNet) to address the problems of fine mapping of marine aquaculture areas from HSR images.In addition, we also employ several attention-based modules throughout the network to refine the feature space.Finally, we compare our proposed HCNet with the conventional OBIA method and several state-of-the-art FCN-based methods.Both of them have been widely used and achieved great success in the classification of HSR images or nature images.

Study Area
A typical marine aquaculture area of 110 km 2 around Sandu Island was selected as our study area, which is located at Ningde City, Fujian Province, China (Figure 1).It is located in the subtropical monsoon climate zoon with the annual average precipitation of 1631 mm and a mean temperature of 14.7-19.8• C. As located in the semi-closed natural harbor, which helps weaken typhoons and accumulate nutrients in seawater, it has developed extensive marine aquaculture areas of various sizes that mainly include cage culture areas (CCA, see a1 and a2 in Figure 1) and raft culture areas (RCA, see b1 and b2 in Figure 1).studies also attempted to refine the classification results by combining them with boundary detection results [43,44].However, this requires extra modules and supervision for boundary detection.In summary, although current FCN-based approaches have achieved great success in dense prediction, it is difficult to perform a fine mapping of the marine aquaculture areas fully exploiting the information in HSR images.First, most current approaches are less effective at acquiring multiscale contextual information, making it difficult to detect various objects in the marine environment.Second, most of the existing strategies are less effective for the utilization of finer feature maps from shallow layers, making it difficult to restore the detailed structures of marine objects in HSR images.
To solve these problems, it is necessary to combine much effective multi-scale contextual information and fine resolution features from shallow layers.Inspired by this idea, we propose a novel model called the hierarchical cascade convolutional neural network (HCNet) to address the problems of fine mapping of marine aquaculture areas from HSR images.In addition, we also employ several attention-based modules throughout the network to refine the feature space.Finally, we compare our proposed HCNet with the conventional OBIA method and several state-of-the-art FCNbased methods.Both of them have been widely used and achieved great success in the classification of HSR images or nature images.

Study Area
A typical marine aquaculture area of 110 km 2 around Sandu Island was selected as our study area, which is located at Ningde City, Fujian Province, China (Figure 1).It is located in the subtropical monsoon climate zoon with the annual average precipitation of 1631 mm and a mean temperature of 14.7-19.8°C.As located in the semi-closed natural harbor, which helps weaken typhoons and accumulate nutrients in seawater, it has developed extensive marine aquaculture areas of various sizes that mainly include cage culture areas (CCA, see a1 and a2 in Figure 1) and raft culture areas (RCA, see b1 and b2 in Figure 1).The CCA are composed of accommodations and a large number of fish cages, which are constructed of plastic foam float and woodblocks.Since most of them are not standard productions from the factory, each of them has a complex and unique structure, making their extraction from HSR images difficult.
The RCA are generally cultivated with kelp or agar, which are widely distributed in the study area.The cultivated plants are usually twined on the belt that are linked to the fixed styrofoam floats.Therefore, the cultivated areas are mainly influenced by the density of plants and cultivated belts, making the RCA largely different from each other in HSR images.Meanwhile, as the cultivated belts are submerged in seawater, the features of RCA in HSR images may also be influenced by the unstable environment, such as waves or turbid seawater.
As there is no cloud or haze in the whole aquaculture areas, we did not perform atmospheric correction in the preprocessing steps [46].The MSS images and PAN image were firstly orthorectified into the Universal Transverse Mercator (UTM) projection system, and fused using Gram-Schmidt pan-sharpening method in ENVI (v5.3.1,Exelis Visual Information Solutions, Boulder, CO, USA, 2014).Eventually, we used the fused imagery consisting of eight bands with a spatial resolution of 0.5 m in the following classification process.

Hierarchical Convolutional Neural Network
As illustrated in Figure 2, the general workflow of the proposed HCNet mainly consists of three steps.Specifically, we first used a conventional CNN as an encoder to extract the high-dimensional feature maps based on the input imagery.Based on the output feature maps from the encoder, we used a hierarchical cascade structure to extract and aggregate the semantic information from local to global scale gradually.With the extracted multi-scale contextual information, we applied a coarser-to-fine strategy to restore the detailed information of marine objects in HSR imagery.In the following section, we will describe four important parts of the proposed HCNet, including (1) encoder based on VGG-Net [47]; (2) hierarchical contextual information aggregation; (3) detailed structure refinement; and (4) feature space refinement.

Encoder Based on VGG-Net
As illustrated in Figure 2, we first used the encoder network to transform the input imagery to high-dimensional abstract feature maps.To this aim, we employed the widely used VGG-16 network as the backbone of our proposed HCNet for its high performance.The VGG-16 network is structured with five blocks of convolutional layers followed by three fully connected layers.Detailed information about the model architecture can be found in [47].To avoid the loss of spatial information and accelerate the training process, following similar encoder architecture to [38,41,42], we directly removed all the fully connected layers of the original model, which contain approximately 89% of the total 138 million parameters.As high-resolution feature maps are instrumental in the following process of our multi-scale context feature extractor, we avoided down-sampling after the last two max-pooling layers by setting these pooling layers with both stride and padding of one.As a result, our encoder can obtain high-resolution feature maps, which are 1/8 of the input size instead of 1/32 in the original VGG-16 network.
As there is no cloud or haze in the whole aquaculture areas, we did not perform atmospheric correction in the preprocessing steps [46].The MSS images and PAN image were firstly orthorectified into the Universal Transverse Mercator (UTM) projection system, and fused using Gram-Schmidt pan-sharpening method in ENVI (v5.3.1,Exelis Visual Information Solutions, Boulder, CO, USA, 2014).Eventually, we used the fused imagery consisting of eight bands with a spatial resolution of 0.5 m in the following classification process.As illustrated in Figure 2, the general workflow of the proposed HCNet mainly consists of three steps.Specifically, we first used a conventional CNN as an encoder to extract the high-dimensional feature maps based on the input imagery.Based on the output feature maps from the encoder, we

Hierarchical Contextual Information Aggregation
In CNN, extensive and powerful semantic information can be obtained by increasing the depth of the architecture, gaining from larger reception field and more non-linear operations [48].However, semantic information captured from a single scale may lose the hierarchical dependence of the objects with their surrounding environment, leading to a decrease in the ability to recognize confusing objects of various sizes.Therefore, multi-scale semantic information, which can capture the relationships between the target objects and their surrounding environment, is important for the identification of confusing marine objects.
To obtain multi-scale feature maps at different reception fields, we applied atrous convolution [49] in this study.As shown in Figure 3a, an atrous kernel can increase its reception field without increasing additional parameters by dilating the kernel with zeros [34].However, since the number of valid weights of feature maps decrease as the atrous rate increases, it is still difficult to obtain a larger reception field using a larger atrous rate with the current fusing strategy (e.g., direct concatenation, as shown in Figure 3b).For example, when applying a 3 × 3 kernel with an atrous rate that is close to the size of feature maps, only the central weight of the kernel is valid, which functions the same as the kernel with a size of one.To solve this problem, we developed a novel hierarchical cascade architecture, as illustrated in the middle part of Figure 2. By using a hierarchical cascade architecture, it was expected to enlarge the reception fields and increase the sampling rate while acquiring multi-scale contextual information.For instance, as shown in Figure 3a, the reception field of the original convolutional layer with an atrous rate of 4 is 9, with contributions from only three pixels.In a hierarchical cascade architecture, as the layer at a higher level calculates features based on feature maps from lower levels, the reception field at a higher level has increased to 13, as shown in Figure 3c.Meanwhile, the final calculation contributes from the information of seven pixels instead of the original three.
of the architecture, gaining from larger reception field and more non-linear operations [48].However, semantic information captured from a single scale may lose the hierarchical dependence of the objects with their surrounding environment, leading to a decrease in the ability to recognize confusing objects of various sizes.Therefore, multi-scale semantic information, which can capture the relationships between the target objects and their surrounding environment, is important for the identification of confusing marine objects.To obtain multi-scale feature maps at different reception fields, we applied atrous convolution [49] in this study.As shown in Figure 3 (a), an atrous kernel can increase its reception field without increasing additional parameters by dilating the kernel with zeros [34].However, since the number of valid weights of feature maps decrease as the atrous rate increases, it is still difficult to obtain a larger reception field using a larger atrous rate with the current fusing strategy (e.g., direct concatenation, as shown in Figure 3 (b)).For example, when applying a 3×3 kernel with an atrous rate that is close to the size of feature maps, only the central weight of the kernel is valid, which functions the same as the kernel with a size of one.To solve this problem, we developed a novel hierarchical cascade architecture, as illustrated in the middle part of Figure 2. By using a hierarchical cascade architecture, it was expected to enlarge the reception fields and increase the sampling rate Specifically, we acquired a series of feature maps with local to global contextual information by organizing the atrous convolutional layers in a hierarchical cascade fashion, where the atrous rates increase layer by layer (2, 4, 6, and 8 in our experiment).A layer with a smaller atrous rate was put in the upper part, while a layer with a larger atrous rate was put in the lower part.The outputs of each atrous convolutional layer were concatenated with the input feature maps and all the outputs from previous atrous convolutional layers.The concatenated feature maps were then fed into the following atrous convolutional layer.In this way, we can obtain increasingly larger reception fields in the following atrous convolutional layers.Meanwhile, each intermediate concatenated feature maps contain semantic information from different scales.Each atrous convolutional layer in the hierarchical structure can be formulated as: where F o represent feature maps from the output of our encoder network.C k,D l [•] represents an atrous convolution operation with kernel size k and atrous rate D at l-level.F l (l = 1, . . ., n) represent the feature maps at l-level in the hierarchical cascade structure.'C' represents the concatenation operation.L(•) is the feature space refinement process, which is used to refine the fused multi-scale features and will be described in Section 3.2.4.D l represents the atrous rate for capturing the corresponding feature maps at l-level.

Detailed Structure Refinement
Apart from the confusing marine objects of various sizes, the objects with fine structures in HSR images also increase the difficulty for accurate mapping.In fact, with increased down-sampling (i.e., "striding") and pooling operations, CNN causes a decrease in the size of the feature map.Taking the widely used VGG-Net [47] as an example, the last feature maps have only a size of 1/32 of the original image size.Thus, it is difficult to restore the detailed information in the original resolution, especially for the objects with detailed structures.
In CNN, it has been found that fine resolution feature maps from shallow layers can help restore the fine structures [38,50].Based on such findings, we proposed to combine the low-level feature maps from the encoder for detailed structure refinement with a coarse-to-fine strategy.However, due to the existence of inherent semantic gaps between different-level feature maps, which presented as adverse noises from the shallow layers, directly stacking these feature maps might not be the best way to proceed.To solve this problem, we gradually concatenated the refined feature maps from shallow layers and the up-sampling feature maps from previous layers by using long-span connections.
After that, we fused them by using a convolution operation (which was 512, 512, and 256 kernels with a size of 3 × 3 for each operation in our experiments), as illustrated in Figure 2. It can be formulated as: where F i represent feature maps produced from the previous layers.F i ' represent the reutilized feature maps from corresponding shallow layers in the encoder network.L(•) is the feature space refine process, which will be described in Section 3.2.4.Υ(•) is the bilinear interpolation process.'C' represents the concatenation operation.C k,m [•] represents a convolution operation with m kernels and a size of k.F r represent the generated feature maps.

Feature Space Refinement
To increase the feature representation and decrease the semantic gaps between different-level feature maps, we proposed to refine the feature space by using the attention mechanism: focusing on the important parts and suppressing adverse noise or unnecessary parts of the feature maps.
As shown in Figure 4, the proposed strategy for feature space refinement includes two aspects: channel and spatial refinement by using simple yet effective attention-based structures.Each of the single refining processes can be formulated as: where F represent the feature maps to be utilized from a shallow or previous layer.Φ c is the channel attention module.⊗ represents the element-wise multiplication.F c represent the channel refined feature maps.Φ s is the spatial attention module.F c_s represent the final channel and spatial refined feature maps.The following paragraphs describe the details of each attention module.
Remote Sens. 2017, 9, x FOR PEER REVIEW 7 of 19 ' ○ C ' represents the concatenation operation. , [•] represents a convolution operation with m kernels and a size of k.Fr represent the generated feature maps.

Feature Space Refinement
To increase the feature representation and decrease the semantic gaps between different-level feature maps, we proposed to refine the feature space by using the attention mechanism: focusing on the important parts and suppressing adverse noise or unnecessary parts of the feature maps.As shown in Figure 4, the proposed strategy for feature space refinement includes two aspects: channel and spatial refinement by using simple yet effective attention-based structures.Each of the single refining processes can be formulated as: where F represent the feature maps to be utilized from a shallow or previous layer.Φ is the channel attention module.′ ⊗ ′ represents the element-wise multiplication.′ represent the channel refined feature maps.Φ is the spatial attention module.′ _ represent the final channel and spatial refined feature maps.The following paragraphs describe the details of each attention module.
We produced the attention maps by exploiting the inter-channel or inter-spatial relationships of feature maps.To produce the channel attention map, we firstly aggregated global spatial information of a feature map by employing the global average pooling operation, generating a global spatial We produced the attention maps by exploiting the inter-channel or inter-spatial relationships of feature maps.To produce the channel attention map, we firstly aggregated global spatial information of a feature map by employing the global average pooling operation, generating a global spatial context descriptor.After that, the descriptor was fed into a multi-layer perceptron (MLP) with one hidden layer to produce the channel attention map.To control the capacity and computational cost, we reduced the size of the hidden layer to 1/r, where r is the reduction ratio (16 in our experiments).The process of acquiring channel attention can be formulated as: where C represents the channel number of the feature maps.W 1 and W 2 represent the weights of MLP layers with a size of 1 × 1 × C r and 1 × 1 × C, respectively.C_AvgPool(•) represents global average pooling operation on each channel of the feature maps.δ is the ReLU activation function.σ is the sigmoid function.
Differently from the channel attention map, the spatial attention map is expected to find the most informative region of the feature maps.To compute the spatial attention map, we first applied the global average pooling operation along the channel axis, generating a global channel context descriptor.After that, similar to the process for acquiring channel attention map, we fed the flattened descriptor to the MLP with one hidden layer at a reduced ratio of r (16 in our experiments).Finally, we reshaped the output to the two-dimensional spatial attention maps.The process of acquiring our spatial attention map can be formulated as: where H and W represent the height and width of the feature maps, respectively.W 1 and W 2 represent the weights of MLP layers with a size of H×W r × 1 and H × W × 1, respectively.S_AvgPool(•) represents global average pooling operation along the channel axis of the feature maps.

Implementation Details
As shown in Figure 2, we employed the encoder, which is a variant of VGG-Net with 16-layers, to produce high-dimensional abstract features from input imagery.Based on the output of the encoder network, we captured the hierarchical contextual information by using a group of atrous convolution operations with the atrous rates of 2, 4, 6, and 8.Meanwhile, to avoid growing too wide and controlling the model's size, we used kernels with a size of 1 × 1 after each concatenation in the hierarchical cascade structure, making all channels of the concatenated feature maps reduce to 512, which is same as the output of the encoder.We also set 512 as the kernel numbers for all the atrous convolution layers to make weights for contextual information of all levels equal.
As for the detailed structure refinement, we only chose three layers in the encoder part for refinement as illustrated in Figure 2. The reasons are as follows: (1) although shallow layers carry much detailed information, those layers contain much noise that is adverse for restoring the detailed structures; (2) it is also hard to train the CNN well with more complex structures and parameters, especially with a typical small dataset in remote sensing.Besides, we chose the last convolution layer in each block before the pooling layers for refinement, because they contain much detailed information in these layers.We then used a 1 × 1 kernel after each concatenation to control the model's size, reducing the feature maps to a specific number (i.e., 512, 512, and 256, respectively), which is the same as the corresponding convolution layers in the encoder.Finally, a convolution layer with four 1 × 1 kernels was employed to predict the label maps, which were further up-sampled by a factor of eight and passed through softmax activation, where the categorical cross entropy is employed to measure the error between the predicted and actual values.
In the training process, 6141 patches with a size of 256 × 256 cropped from the pre-processed imagery were utilized as inputs of our proposed HCNet.The ground truth map of each patch was obtained by visual interpretation and corrected by ground survey (released at https://github.com/yyong-fu/HCNet).Among them, we randomly selected 70% of the dataset for training and the remaining 30% for testing.
In the experiments, we implemented the HCNet using the high-level application programming interface Keras (version 2.2.4) with tensorflow (version 1.8.0) as the computation backbone.All the algorithms were programmed using python 3.5.2.We trained the HCNet for 20 epochs using a batch size of four, and Adam optimization with a learning rate of 0.0001, β 1 of 0.9, and β 2 of 0.999.We carried all the experiments on a computer with a 4.20-GHz Intel(R) Core i7-7700K CPU, 16 GB of memory, and an NVIDIA GeForce GTX 1070 graphics processing unit (GPU).

Object-Based Support Vector Machine (SVM) Classification
To provide a comparison with our proposed approach, we compared HCNet with the widely used OBIA approach.Over the past decades, OBIA has achieved great success in the classification of remote sensing, especially for HSR images.Meanwhile, a wide range of remote sensing applications proved that the support vector machine (SVM) is an effective and reliable classifier [51].Thus, as a typical and reliable method for classification of remote sensing images, the object-based SVM is a suitable method for our classification and comparison purposes.
The first important process is to obtain image objects via segmentation, which are the basic classification units.Here, we employed the widely used multi-resolution segmentation (MRS) algorithm, which is implemented in the eCognition software (version 9.0), to produce semantically meaningful image objects.Each of them was expected to be seen as a proper representation of an instance of some type of geo-object.Three key parameters control the segmentation process: scale parameter (SP), shape, and compactness.Instead of using the "trial and error" method, we employed the Estimation of Scale Parameter (ESP) 2 tool [52] for the selection of optimal SP.The ESP 2 tool iteratively segments the image with SPs increasing in a fixed step size, and calculates the local variance value, which is the mean standard deviation of the objects, for every step.Figure 5 shows the local variance values that are plotted against the corresponding scale parameters.Based on this figure, the local maximum points of the curve indicate the candidates of optimal SP.The graph shows that the scale of 112 represents the first sharp break after a continue decreasing.Thus, we set 112 as the optimal SP.We gave the weight of shape parameter less importance by assigning a value of 0.1, as the various shapes of CCA and RCA exist in the study area.We then assigned the weight of compactness value of 0.5 to treat them equally.Table 1.Object features used for image analysis with the object-based SVM method.GLCM: graylevel co-occurrence matrix.GLDV: gray-level difference vector.Once the semantically meaningful image objects were obtained, we constructed the initial feature space with 45 commonly used features, which consist of the typical spectral, geometric, and textural aspects of the segments (Table 1).Detailed information about these features can be found in [53].To select the most representative features from the initial feature space, we utilized a wrapper method which is implemented in the Weka software (v3.8,University of Waikato, New Zealand, 2016).The wrapper method evaluates attribute subsets by using a learning scheme.Cross-validation was employed to estimate the accuracy values for every subset of the attributes.Eventually, we selected 18 features for classification: spectral features (mean (bands 4, 5, 7, and 8), standard deviation (band 3, and 6), Max.diff.),geometrical features (border length, width, border index, roundness), and textural features with all directions (homogeneity, contrast, dissimilarity, entropy, mean, correlation calculated from gray-level co-occurrence matrix, and entropy calculated from gray-level difference vector).For the configuration of SVM classifier, we employed Radial Basis Function as the kernel function.We then used a simple grid search method to determine the optimal penalty factor and the gamma parameter based on LibSVM [54].Finally, the optimal penalty factor and gamma parameter value were 1.6 and 0.14, respectively.

FCN-Based Methods
Because of the high performance in recent remote sensing applications, we also selected several state-of-the-art FCN-based methods for comparison.For the FCN-based models, we directly selected the FCN-32s [30], U-Net [38], and DeeplabV2 [34] for comparison.We selected these models because all these models are either VGG-16 Net or similar architectures-based networks, with long-span connections or multi-scale contextual aggregation strategies, which are very suitable to compare with our proposed structures.The FCN-32s is the first proposed FCN-based method, which does not use the multi-scale contextual information or any long-span connections.Therefore, it represents a baseline for all the FCN-based methods.The U-Net has a U-shaped structure containing an encoder on the left side and a decoder on the right side.The up-sampled features in the decoder are combined with symmetric high-resolution features from the encoder to enable precise localization and high classification performance.Unlike U-Net, the DeeplabV2 proposed to use atrous spatial pyramid pooling to capture objects and image context at multi-scales and then used a fully connected conditional random field to improve the localization and classification performance.Detailed information about these model architectures can be found in [30,34,38].
We selected the same patches employed in our proposed method for training or testing these deep models.In the training phase, we modified the number of outputs to four for all these models.After that, we trained all these models from scratch.The training parameters and strategies adopted for these models are the same as ours.

Accuracy Assessment and Comparison
In this study, we compared our proposed HCNet with the widely used object-based SVM method and several FCN-based models.We conducted accuracy assessments on the final classification results of the testing dataset, with totally 30% of the whole study area.To construct the error matrix, we confirmed whether these pixels were correctly labeled by visual interpretation.Finally, we calculated the accuracy statistics based on the error matrix, including producer accuracy (PA), user accuracy (UA), overall accuracy (OA), and kappa coefficient.
To quantitatively assess the classification performance of our proposed method and other methods, two commonly used overall accuracy metrics, including F1 score (F1) and intersection over union (IoU), were calculated.F1 is calculated as: Recall = TP TP + FN (10) where TP, FP, and FN refer to true positives, false positives, and false negatives, respectively.IoU is calculated as: where A p is the set of predicted pixels.A GT is the set of ground truth pixels.'∪'and '∩' represent the union and intersection operation, respectively.|•| represents the number of pixels in the set.
To evaluate the classification performance of RCA and CCA using different methods, we calculated these accuracy metrics for each category.In addition, we used the mean overall accuracy metrics of RCA and CCA to evaluate the average performance of different methods.

Classification Results and Accuracy Assessment
The final classification results using the proposed HCNet are shown in Figure 6.After a visual inspection on the final classification results, most of the RCA and CCA were identified successfully.We also noticed that some ponds in the land area are misclassified as sea area.This is because all of them are complete seawater within an image patch for its limited size.calculated these accuracy metrics for each category.In addition, we used the mean overall accuracy metrics of RCA and CCA to evaluate the average performance of different methods.To quantitatively assess the classification performance, we used a testing dataset with over 1200 randomly selected patches for accuracy assessment, which accounts for 30% of the whole area.Table 2 shows the confusion matrix of the classification results.We find that the sea area and land area have the best classification performance, with over 98% of PA and UA values.The RCA and CCA have relatively UA values of 95.1% and 96.4%, respectively, indicating that over 95% of the classified CCA and RCA are indeed CCA and RCA, respectively.The CCA also have a relatively high PA value of 96.5%, indicating that over 95% of the CCA in the imagery are correctly labeled.Thus, the CCA and RCA are classified successfully, with OA greater than 95%, and a high kappa coefficient value of 0.97.

Accuracy Comparison
In this study, we compared our proposed approach with the object-based SVM and several state-of-the-art FCN-based methods.Table 3 shows the experiments setup and time complexity of different classification schemes.The time complexity was obtained by averaging the time to perform classification on the testing dataset, which contains over 1200 images with a size of 256 × 256 pixels.As is shown in Table 3, the OB-SVM spends the longest time for inference compared with other methods.With the acceleration of GPU, our proposed method and other FCN-based methods take less time.Furthermore, our proposed HCNet takes the least time for inference.This is mainly because we reduced the trainable parameters in our model: (1) we removed the last three fully connected layers in the original VGG-16 architecture in our encoder; (2) we used the 1 × 1 convolution operations to control the model size; (3) we used the bilinear interpolation instead of deconvolution for up-sampling operations.To provide a quantitative assessment for the performance of different methods, several commonly used accuracy metrics, including F1 and IoU were calculated on the testing dataset for the CCA and RCA, respectively (Table 4).The mean F1 and IoU values of CCA and RCA were also calculated to assess the global performance.As shown in Table 4, approaches using U-Net and DeeplabV2 achieve a similar accuracy level, with a mean IoU value of approximately 88%.The object-based SVM achieves the lowest accuracy values, with only a mean IoU value of approximately 80%.Our proposed method achieves the best performance, with the highest mean F1 value of 95.29%, and the highest mean IoU value of 91.03%.

OBIA vs. Our Approach
In this study, we first compared our proposed HCNet with the object-based SVM method.The object-based methods have been widely used for classification in the past few years, especially for the HSR images.Differently from the traditional pixel-based methods, object-based methods use segments of an imagery as basic units for classification.Classification based on segments has a lot of benefits, including a decrease on spectral variability and an increase on spatial and contextual information such as geometrical features [55].Thus, uniform spectral character and abundant features of the image objects increase the classification accuracy and eliminate salt-and-pepper noise.However, when the spectral and geometric features are similar, it is hard for the segmentation algorithms to obtain high-quality image objects.Besides, it also takes extra time and computation.In the classification phase, it is also hard to design and choose discriminative features as input of the classifier.Thus, both of the uncertainties in the necessary procedures limit the classification performance.
To overcome such limitations, our proposed methods mainly contributes in two aspects.First, differently from the standard procedure of "segmentation and then classification" in OBIA, our method is implemented in an "end-to-end" way, which can avoid the segmentation error and is more efficient for large-scale HSR image classification.Second, the two methods are different in the feature design process.The feature space employed in the OBIA generally consists of handcrafted features, which are designed based on statistical analyses of a local area in the HSR imagery.There remains an inherent tradeoff for these handcrafted features between high discrimination performance and robustness [56].In contrast, our proposed HCNet can automatically learn multi-level semantic information from local to scene scales.Thus, our proposed method can achieve a better classification performance, with a nearly 10% improvement in terms of mean IoU at the pixel level.

Conventional FCN-Based Methods vs. Our Approach
FCN, which is a fully convolutional version of CNN, has become the most state-of-the-art dense classification method in recent years [26,57,58].However, there are mainly two problems for conventional FCN-based methods to precisely identify the boundaries of marine aquaculture areas.First, it is difficult to capture semantic information of confusing marine objects of various sizes through a single and fixed reception field.In addition, consecutive pooling operations largely reduce the resolution of the final feature maps.Thus, it is difficult for the predicted results to restore the original resolution as input imagery by learning.As shown in Figure 7, the predicted boundaries of CCA and RCA from the FCN-32s model have been largely smoothed.Meanwhile, some small objects are also misclassified or neglected by the FCN-32s model.
incorporating information from shallow layers or feature pyramid, it is still hard for them to identify objects at different scales while retaining the detailed information.In our study, we fully combined the information from shallow layers and feature pyramid to improve the classification results.In addition, we also enlarged the reception field for feature maps from the feature pyramid and increased the representation of feature maps from shallow layers, which is helpful for the prediction.Thus, as shown in Figure 7, our methods significantly improved the classification performance, with an improvement in the mean IoU value of nearly three percentage points.

Image Ground truth FCN-32s U-Net DeeplabV2
Ours-HCNet Figure 7.The classification results of CCA and RCA by using our proposed method and other comparison methods.The black circles indicate that our proposed HCNet obtains the best performance.

Ablation Analysis
To explore the benefits brought by different proposed structures, we conducted ablation experiments on our proposed HCNet.We used the simplest encoder based model, which is mainly composed of the encoder (see Figure 2) followed by an up-sampling rate of eight and the classification layer, as the baseline method.Table 5 shows the accuracy assessment results for variants of HCNet by adding different structures gradually.As can be seen, the classification performance of each category improves by adding our proposed structures.As shown in Table 5, multi-scale contextual information fused in a parallel stack way can only improve the classification performance slightly.In contrast, our proposed hierarchical cascade structure can substantially improve classification performance, with an improvement in the mean IoU value of nearly 2.4 percentage points.Moreover, when applying with the detailed structure refinement and feature space refinement strategies, the classification performance improves even further.
Table 5. Quantitative comparison for the ablation experiments on our proposed HCNet.'Mul' represents aggregating the multi-scale information in the commonly used direct stacking way.'Mul+HCI' represents aggregating the multi-scale information in our proposed hierarchical cascade way.'Mul+HCI+DIR' represents aggregating the multi-scale information by using our proposed hierarchical cascade method and adding the detailed structure refinement strategy.Although some studies, i.e., U-Net, DeeplabV2, try to improve the classification results by incorporating information from shallow layers or feature pyramid, it is still hard for them to identify objects at different scales while retaining the detailed information.In our study, we fully combined the information from shallow layers and feature pyramid to improve the classification results.In addition, we also enlarged the reception field for feature maps from the feature pyramid and increased the representation of feature maps from shallow layers, which is helpful for the prediction.Thus, as shown in Figure 7, our methods significantly improved the classification performance, with an improvement in the mean IoU value of nearly three percentage points.

Ablation Analysis
To explore the benefits brought by different proposed structures, we conducted ablation experiments on our proposed HCNet.We used the simplest encoder based model, which is mainly composed of the encoder (see Figure 2) followed by an up-sampling rate of eight and the classification layer, as the baseline method.Table 5 shows the accuracy assessment results for variants of HCNet by adding different structures gradually.As can be seen, the classification performance of each category improves by adding our proposed structures.As shown in Table 5, multi-scale contextual information fused in a parallel stack way can only improve the classification performance slightly.In contrast, our proposed hierarchical cascade structure can substantially improve classification performance, with an improvement in the mean IoU value of nearly 2.4 percentage points.Moreover, when applying with the detailed structure refinement and feature space refinement strategies, the classification performance improves even further.
Table 5. Quantitative comparison for the ablation experiments on our proposed HCNet.'Mul' represents aggregating the multi-scale information in the commonly used direct stacking way.'Mul+HCI' represents aggregating the multi-scale information in our proposed hierarchical cascade way.'Mul+HCI+DIR' represents aggregating the multi-scale information by using our proposed hierarchical cascade method and adding the detailed structure refinement strategy.'Mul+HCI+DIR+FSR' represents aggregating the multi-scale information by using our proposed hierarchical cascade method and adding the detailed structure refinement and feature space refinement strategies, as shown in Figure 2.

Potential Applications and Limitations
There are four carefully designed structures in our proposed HCNet, which mainly include the encoder, hierarchical cascade architecture, long-span connections, and the attention-based module.The encoder can be employed in most present CNN architectures for its fast convergence, and reduced consumption of memory.The combination of the hierarchical cascade architecture and long-span connections is helpful in capturing a large contextual information while maintaining the detailed information.Thus, it would be helpful for the classification of objects or geographical landscapes with complex components from HSR imagery, such as buildings [59,60], urban function zones [61], and fashions of rural settlements [62].In addition, some analyses of natural imagery may also benefit from such structures, such as the identification of urban street scenes [63,64], agricultural trees [65], cells [66,67], and bacteria [68,69].In addition, as the attention-based module is very helpful for the neural networks to find the most representative parts from abundant features, it can also be helpful for the feature space refinement in high spectral resolution image applications [70].
Meanwhile, there are several limitations of our proposed HCNet.First, although the HCNet can successfully identify marine aquaculture areas from HSR imagery with a spatial resolution of 0.5 m, it is relatively time consuming for the HCNet to perform classification on all the split patches, because some patches may not cover the targets.Thus, further research on the detection of the existence of targets before classification may also be helpful.Second, future studies may try to accelerate the training and inference process by using a series of model compressing methods, such as parameter pruning and sharing [71], low-rank factorization [72], network quantization [73], and knowledge distillation [74].Third, our proposed method only applies to marine aquaculture areas covering the water surface.However, there are still a few submersible cages in some aquaculture areas, such as Shandong Province in the northeast of China.

Conclusions
In this study, we proposed a novel end-to-end hierarchical cascade neural network to identify and discriminate different types of marine aquaculture areas from HSR imagery.Our proposed HCNet achieves a high classification performance by focusing on three aspects: (1) a hierarchical cascade structure has been employed to capture multi-scale contextual information by enlarging the reception field, which is helpful to identify confusing objects of various sizes; (2) a coarse-to-fine refinement strategy is proposed to refine the target objects gradually, which is helpful for restoring the detailed information for marine objects with detailed structures; and (3) an attention-based module is proposed to refine the feature space, including both the channel and spatial dimensions.
Experimental results show that our proposed HCNet successfully identified the CCA and RCA, with OA greater than 95%, and a high kappa coefficient value of 0.97.Compared with the conventional OBIA and the state-of-the-art FCN-based methods, our proposed HCNet achieves significant improvements on both visual and quantitative performances.In addition, our proposed method also has less time complexity than comparable methods.
Future studies may focus on testing our method on discriminating other types of confusing land cover and land use with detailed structures.Meanwhile, to speed up the process of mapping aquaculture areas from HSR images and enhance its applicability, researchers may focus on finding an approach to apply image segmentation preprocessing for interesting areas and accelerate the deep model.Additionally, as the training process of deep models needs a lot of precious manually labeled ground truth, it is necessary to investigate a method to train the models in less supervised way.

Figure 1 .
Figure 1.The study area Sandu Island is a typical marine aquaculture area in Ningde City, Fujian Province, China.The image here shows a Worldview-2 image of the study area in true color with image examples for cage culture areas (CCA) and raft culture areas (RCA) on the satellite (a1 and b1, respectively) and ground (a2 and b2, respectively).

Figure 1 .
Figure 1.The study area Sandu Island is a typical marine aquaculture area in Ningde City, Fujian Province, China.The image here shows a Worldview-2 image of the study area in true color with image examples for cage culture areas (CCA) and raft culture areas (RCA) on the satellite (a1 and b1, respectively) and ground (a2 and b2, respectively).

Figure 3 .
Figure 3. (a) One dimensional atrous convolution with an atrous rate of 4. 'D' represents the atrous rate.(b) An illustration of the commonly used direct concatenation strategy.(c) One dimensional atrous convolution with an atrous rate of 4 in the hierarchical cascade way.

Figure 3 .
Figure 3. (a) One dimensional atrous convolution with an atrous rate of 4. 'D' represents the atrous rate.(b) An illustration of the commonly used direct concatenation strategy.(c) One dimensional atrous convolution with an atrous rate of 4 in the hierarchical cascade way.

Figure 4 .
Figure 4. Overview of the channel and spatial attention mechanism used in this study.F represent the original feature maps.H, W, C represent the height, width and channel number of the feature maps, respectively.MLP represents the multi-layer perception with one hidden layer.Fc' represent the channel refined feature maps.′ ⊗ ′ represents the element-wise multiplication.F'c_s represent the final channel and spatial refined feature maps.

Figure 4 .
Figure 4. Overview of the channel and spatial attention mechanism used in this study.F represent the original feature maps.H, W, C represent the height, width and channel number of the feature maps, respectively.MLP represents the multi-layer perception with one hidden layer.F c ' represent the channel refined feature maps.⊗ represents the element-wise multiplication.F' c_s represent the final channel and spatial refined feature maps.

Figure 5 .
Figure 5. Scatter diagrams produced by the Estimation of Scale Parameter (ESP) 2 tool.The local variance (LV) and the rate of change of LV (ROC-LV) values are plotted against corresponding scale parameters.The gray vertical dotted line shows our selected optimal SP.

Figure 5 .
Figure 5. Scatter diagrams produced by the Estimation of Scale Parameter (ESP) 2 tool.The local variance (LV) and the rate of change of LV (ROC-LV) values are plotted against corresponding scale parameters.The gray vertical dotted line shows our selected optimal SP.

Figure 6 .
Figure 6.Classification results of CCA and RCA using our proposed method.Figure 6. Classification results of CCA and RCA using our proposed method.

Figure 6 .
Figure 6.Classification results of CCA and RCA using our proposed method.Figure 6. Classification results of CCA and RCA using our proposed method.

Figure 7 .
Figure 7.The classification results of CCA and RCA by using our proposed method and other comparison methods.The black circles indicate that our proposed HCNet obtains the best performance.

Table 1 .
Object features used for image analysis with the object-based SVM method.GLCM: gray-level co-occurrence matrix.GLDV: gray-level difference vector.

Table 2 .
Confusion matrix for the final classification results.

Table 4 .
Quantitative comparison between our method and other methods at the pixel level, where the best values in bold and the second-best values are underlined.