ECNet: Efficient Convolutional Networks for Side Scan Sonar Image Segmentation

This paper presents a novel and practical convolutional neural network architecture to implement semantic segmentation for side scan sonar (SSS) image. As a widely used sensor for marine survey, SSS provides higher-resolution images of the seafloor and underwater target. However, for a large number of background pixels in SSS image, the imbalance classification remains an issue. What is more, the SSS images contain undesirable speckle noise and intensity inhomogeneity. We define and detail a network and training strategy that tackle these three important issues for SSS images segmentation. Our proposed method performs image-to-image prediction by leveraging fully convolutional neural networks and deeply-supervised nets. The architecture consists of an encoder network to capture context, a corresponding decoder network to restore full input-size resolution feature maps from low-resolution ones for pixel-wise classification and a single stream deep neural network with multiple side-outputs to optimize edge segmentation. We performed prediction time of our network on our dataset, implemented on a NVIDIA Jetson AGX Xavier, and compared it to other similar semantic segmentation networks. The experimental results show that the presented method for SSS image segmentation brings obvious advantages, and is applicable for real-time processing tasks.


Introduction
Side scan sonar (SSS), among the most common sensors used in ocean survey, can provide images of the seafloor and underwater target. Target detection based on SSS image has a great variety of applications in marine archaeological surveying [1], oceanic mapping [2], and underwater detection [3][4][5], in which the main task is SSS image segmentation.
Various methods for SSS image segmentation have been proposed, most of which are based on unsupervised segmentation methods, such as active contour model, clustering segmentation method and Markov random field (MRF) segmentation method, etc. Generally, the common techniques of SSS image segmentation include the clustering segmentation method and the Markov random field (MRF) segmentation method. Specifically, Celik T. et al. [6] utilized clustering algorithm for SSS image segmentation. Firstly, in their algorithm, the multiresolution representation of the input image was constructed using the undecimated discrete wavelet transform (UDWT), where the feature vectors were extracted. Secondly, principal component analysis (PCA) was used to reduce the dimension of each feature vector. Finally, k-means clustering was used to group feature vectors into disjoint clusters. No extra prior assumptions were supposed in this algorithm. However, if there is a serious imbalance Our network architecture draws on recent promising results of fully convolutional networks for image segmentation and edge detection [35,36].
One of the strengths of CNN is that its multilayer structure can automatically learn multiple levels of features. These abstract features are used to decide the class of objects classes within an image, which is useful for categorization. Since some details of objects are lost, it is difficult to get the specific outline of objects and the category of each pixel. Hence, it is difficult to achieve image segmentation for CNN. This situation was improved until the idea of fully convolutional network appeared in Shelhamer et al. [20]. Pixel level prediction tasks achieve great progress by replacing fully connected layers by convolution layers and inserting differentiable interpolation layers and a spatial loss. Thus, fully convolutional networks, which showed effective feature extraction and end-to-end training, become one of the most popular solutions for semantic segmentation tasks.
Most architectures, particularly designed for segmentation, have an encoder network and a corresponding decoder network [23,25,27]. The former, learning rich hierarchical features of input data, is composed of classification networks' modules, such as VGG [16], GoogLeNet [17], and ResNet [18]. These classification networks achieve good results for dealing with recognition tasks. Among them, ResNet have proved its superiority in several challenging recognition tasks, such as ImageNet [37] and MS COCO [38] competitions, and the number of semantic segmentation networks by means of residual building blocks is increasing. He K. et al. [39] investigated and analyzed the residual building blocks by a series of ablation experiments. These experiments, emphasizing the key role of skip connections and the importance of their identity mappings, supported the idea that keeping a "clean" information path is helpful for making the optimization easier. Based on these experiments, our encoder unit use "full preactivation unit", proved to perform better than the baseline counterparts.
Deep learning, especially convolutional neural network (CNN), has active performance in image preprocessing because of its self-learning ability. In recent years, image denoising methods based on deep learning have also been proposed and developed. In 2008, Jain et al. proposed a denoising method based on CNN for processing natural images [40], which can yield similar or even better results than conventional methods, such as wavelet transform and Markov random field. In 2016, Mao et al. proposed a deep fully convolutional autoencoder network for image denoising [41]. In this network, the convolutional layer is responsible for feature extraction, capturing abstract information from the image content while eliminating image noise or useful information loss. Correspondingly, the deconvolution layer is used to restore image details.
The aforementioned models relying on fully convolutional networks have significantly contributed to the state-of-the-art techniques, but all of them still lose some useful features from middle layers. In order to utilize the characteristics of the middle layer to the best of one's ability, we exploit the approach in holistically-nested edge detection [35]. Holistically-nested edge detection (HED) mainly solves two problems. One is the end-to-end training and prediction, and the other one is multiscale and multilevel feature learning. Through their architecture, the authors not only complete the prediction from image to image, but also perform edge detection by the learned rich hierarchical features from middle layers. A single stream deep network with multiple side-outputs was considered for multiscale learning, by which the authors highlight the importance of obtaining edge maps. The side-output layers were then interpolated behind encoder unit which encourages coherent contributions to improve the accuracy. It can generate predictions from multiple scales, but there is no significant redundancy of both representation and computational complexity.

Network Architecture
Our network architecture is illustrated in Figure 1. The main module of our network is shown in Figure 1b, where the side-output network are illustrated from top to bottom, the encoder network consisting of four encoders and the decoder network with the four corresponding decoders. Given an input image in Figure 1a, an encoder network made up of residual building blocks extracts the feature maps. Three side-output layers are connected to the last layer of encoder1-3 to get relevant side-outputs of different scales. In ECNet, the inputs of decoder1-3 are the sum of the corresponding encoder outputs and the previous decoder outputs. Each decoder takes advantage of the max-pooling indices, stored by the corresponding encoder during the max-pooling operation. The decoders use these indices to achieve up-sampling operation on their input feature maps. The inputs of decoder4 are only from encoder4. Then, the main outputs are obtained after decoder1. By doing this, spatial information lost during the encoder operations are recovered and coarse high-level feature maps can be refined. Finally, these three side-outputs and the main outputs are combined together, before being averaged to get the final prediction map in Figure 1d. encoder outputs and the previous decoder outputs. Each decoder takes advantage of the max-pooling indices, stored by the corresponding encoder during the max-pooling operation. The decoders use these indices to achieve up-sampling operation on their input feature maps. The inputs of decoder4 are only from encoder4. Then, the main outputs are obtained after decoder1. By doing this, spatial information lost during the encoder operations are recovered and coarse high-level feature maps can be refined. Finally, these three side-outputs and the main outputs are combined together, before being averaged to get the final prediction map in Figure 1d. The architecture of an encoder unit is illustrated in Figure 2a. This encoder is composed of a residual block and a convolutional block. In order to enhance the expressive ability of the encoder network and to make the whole model easier to learn and optimize, each encoder unit uses skip connection referencing full preactivation unit [39]. Then, each residual block outputs are passed through one convolutional block to fit the layers dimension. Layers within encoder block are shown in Figure 2a. Here, conv means convolution with the kernel size of 3. MaxPooling represents downsampling by a factor of 2, which is achieved by means of performing max-pooling operations. "Batch normalization (BN), followed by ReLU nonlinearity [42,43] is placed ahead of convolutional layer. Layer details for decoder unit are shown in Figure 2b. MaxUnpooling denotes the nonlinear upsampling operation by the max-pooling indices received from the corresponding encoder. The decoder1 outputs are fed to a classifier to independently estimate class probabilities for each pixel, as seen in Figure 2, used as the main output. Table 1 contains the information about the feature maps used in each of these blocks.  The architecture of an encoder unit is illustrated in Figure 2a. This encoder is composed of a residual block and a convolutional block. In order to enhance the expressive ability of the encoder network and to make the whole model easier to learn and optimize, each encoder unit uses skip connection referencing full preactivation unit [39]. Then, each residual block outputs are passed through one convolutional block to fit the layers dimension. Layers within encoder block are shown in Figure 2a.
Here, conv means convolution with the kernel size of 3. MaxPooling represents down-sampling by a factor of 2, which is achieved by means of performing max-pooling operations. "Batch normalization (BN), followed by ReLU nonlinearity [42,43] is placed ahead of convolutional layer. Layer details for decoder unit are shown in Figure 2b. MaxUnpooling denotes the nonlinear up-sampling operation by the max-pooling indices received from the corresponding encoder. The decoder1 outputs are fed to a classifier to independently estimate class probabilities for each pixel, as seen in Figure 2, used as the main output. Table 1 contains the information about the feature maps used in each of these blocks.  In general, spatial information is lost in the encoder due to pooling or convolution operations. Therefore, the decoder should map low-resolution features to input resolution for pixel-wise classification. The decoding method thus plays a key role in the model performance. There are many kinds of existing encoding methods, such as U-Net [27], SegNet [23], and LinkNet [25]. U-Net decoding method creates an up-sampling stage according to the corresponding down-sampling stage in the original network. Feature fusion is then performed using channel connection with the corresponding stage. This one provides more perspective, but it leads to an increase in the number of channels affecting the computational efficiency. In SegNet, the decoding network uses the maxpooling indices stored and transmitted from the corresponding encoder for up-sampling to obtain a sparse feature map. This method has fewer parameters to fit. In LinkNet, the outputs of the encoder are directly added to the inputs of the corresponding decoder. The decoding method, bypassing spatial information, improves performance along with a significant decrease in processing time. In this way, information which would have been lost using other decoding methods is here retained, and no additional parameters and operation are required in learning this information.
Our decoding method leveraged the combination of the max-pooling indices from SegNet and direct connection from LinkNet. In this paper, experiments prove that our method provides more accurate segmentation results.
Each of the last layers in encoder1-3 is followed by a 1 × 1 conv layer in the side-output module. Then, a deconv layer up-samples the feature maps to get original image size back. Finally, a sigmoid layer is followed to get the outputs and loss in side-output layers. As encoder units have different sizes of receptive field, our network can learn multilevel information, in particular object-level and low-level, which is useful for semantic segmentation. Table 2 is a summary of the receptive fields and strides configuration. The outputs of each side-output layer are multiscale, where the smaller the side-input size, the larger the receptive field.   In general, spatial information is lost in the encoder due to pooling or convolution operations. Therefore, the decoder should map low-resolution features to input resolution for pixel-wise classification. The decoding method thus plays a key role in the model performance. There are many kinds of existing encoding methods, such as U-Net [27], SegNet [23], and LinkNet [25]. U-Net decoding method creates an up-sampling stage according to the corresponding down-sampling stage in the original network. Feature fusion is then performed using channel connection with the corresponding stage. This one provides more perspective, but it leads to an increase in the number of channels affecting the computational efficiency. In SegNet, the decoding network uses the max-pooling indices stored and transmitted from the corresponding encoder for up-sampling to obtain a sparse feature map. This method has fewer parameters to fit. In LinkNet, the outputs of the encoder are directly added to the inputs of the corresponding decoder. The decoding method, bypassing spatial information, improves performance along with a significant decrease in processing time. In this way, information which would have been lost using other decoding methods is here retained, and no additional parameters and operation are required in learning this information.
Our decoding method leveraged the combination of the max-pooling indices from SegNet and direct connection from LinkNet. In this paper, experiments prove that our method provides more accurate segmentation results.
Each of the last layers in encoder1-3 is followed by a 1 × 1 conv layer in the side-output module. Then, a deconv layer up-samples the feature maps to get original image size back. Finally, a sigmoid layer is followed to get the outputs and loss in side-output layers. As encoder units have different sizes of receptive field, our network can learn multilevel information, in particular object-level and low-level, which is useful for semantic segmentation. Table 2 is a summary of the receptive fields and strides configuration. The outputs of each side-output layer are multiscale, where the smaller the side-input size, the larger the receptive field.

Formulation
We denote our input dataset by D = (X n , Y n ), n = 1, . . . , N , where sample image X n = X (n) ij ; i = 1, . . . , w, j = 1, . . . , h denotes the raw input image with the width (w) and the length (h), and ij ∈ {0, 1} represents the related ground truth binary map for image X n .
Each encoder is composed of a residual block and a convolutional block. More concretely, the entire set of standard network layer parameters in residual block and a convolution block are represented as w L and w, respectively. The functions of convolution block compute outputs y i by where, F determines the layer type: A batch normalization layer, a ReLU nonlinearity for an activation function, and a matrix multiplication for convolution. Each residual block can be expressed in a general form: where, X L and X L+1 denote the respective inputs and outputs of the L-th residual block. We represent the entire set of network layer parameters as Θ. Suppose there are M side-output layers in the network. Each of them is assigned with a classifier, so the corresponding weights can be defined as θ (1) , . . . , θ (M) Θ. Since most areas of the image are the seabed reverberation areas and only a few are the object-highlight areas, the imbalance classification issue cannot be avoided in SSS image datasets. To solve this problem, the following class-balanced cross-entropy loss is computed in this paper. Loss function used in the main-output layer is as follows: where, n represents the number of training samples in each batch and α = |Y − |/(|Y − | + |Y + |), 1 − α = |Y + |/(|Y − | + |Y + |) corresponds to the ratio of background pixels over all the pixels. |Y + | and |Y − | represent the number of target pixels and background pixels from the ground truth, respectively. P(y i = 0 X; Θ) ∈ [0, 1] is calculated using sigmoid function on the activation value at pixel i, in order to obtain the main-output map predictionsŶ main . R stands for the regularization term, and λ is the hyperparameter of the regularization. Similarly, the loss function of the side-output is represented as follows: So, the following objective function is considered: where, (β, β 1 , . . . , β m ) are some trade-off parameters introduced for loss function from both the main-output and side-output, which are represented here as a series of hyperparameters. Then, the objective function below is minimized by stochastic gradient descent: Given an input image X n , predictions are obtained from both the main outputŶ main and side-output side . During testing, we can select main outputŶ main or generated maps by further aggregatingŶ f as final output. TheŶ f is defined as follows:

Experiment
The experimental scheme of our SSS image segmentation process is provided in Figure 3. Accordingly, this section is arranged as follows: The collection method of dataset images is described and the SSS image preprocessing stage is shown in Section 4.1; the method used to get the ground truth of dataset is introduced in Section 4.2; the implementation details and parameters setup of the experiments are described in Section 4.3.

Experiment
The experimental scheme of our SSS image segmentation process is provided in Figure 3. Accordingly, this section is arranged as follows: The collection method of dataset images is described and the SSS image preprocessing stage is shown in Section 4.1; the method used to get the ground truth of dataset is introduced in Section 4.2; the implementation details and parameters setup of the experiments are described in Section 4.3.

SSS Image Preprocessing
SSS data are collected by dual-frequency side-scan sonar (Shark-S450D) in Fujian, China, where some of the data are in Figure 4. For the original SSS image, the proposed method can be used to get better performance with only some simple preprocessing. This also shortens the processing time. Therefore, we removed middle waterfall and performed bilinear interpolation on the raw SSS image. In Figure 5a, the results after these operations are provided.

SSS Image Preprocessing
SSS data are collected by dual-frequency side-scan sonar (Shark-S450D) in Fujian, China, where some of the data are in Figure 4. For the original SSS image, the proposed method can be used to get better performance with only some simple preprocessing. This also shortens the processing time. Therefore, we removed middle waterfall and performed bilinear interpolation on the raw SSS image. In Figure 5a, the results after these operations are provided.

Experiment
The experimental scheme of our SSS image segmentation process is provided in Figure 3. Accordingly, this section is arranged as follows: The collection method of dataset images is described and the SSS image preprocessing stage is shown in Section 4.1; the method used to get the ground truth of dataset is introduced in Section 4.2; the implementation details and parameters setup of the experiments are described in Section 4.3.

SSS Image Preprocessing
SSS data are collected by dual-frequency side-scan sonar (Shark-S450D) in Fujian, China, where some of the data are in Figure 4. For the original SSS image, the proposed method can be used to get better performance with only some simple preprocessing. This also shortens the processing time. Therefore, we removed middle waterfall and performed bilinear interpolation on the raw SSS image. In Figure 5a, the results after these operations are provided.

Dataset
The ground truth for images is essential to use supervised method for SSS image segmentation. We used LabelMe, an image annotation tool developed by the MIT, to manually label images. In the corresponding ground truth image (shown in Figure 5b), the blue and the black color denote the object and the background areas, respectively.
The data image (shown in Figure 5a) and its corresponding labelled image (shown Figure 5b) were then cut at the size of 240 × 240 and with a stride of 50. We selected samples with target pixel ratio exceeding 5% to form a dataset. We selected 70% of its images as our training set, and the other 30% as our test set. In the training set, about 20% of the data were randomly selected as a validation set. More precisely, the dataset consisted of 3528 training images, 882 validation images, and 959 testing images.
Some challenging factors existed in the dataset, which are illustrated in Figure 6. Figure 6a illustrates a clear and standard SSS image. In Figure 6b, some target pixels are clear and others are weak. In Figure 6c, the images are dark. In Figure 6d, the target pixel areas are discontinuous. In Figure 6e, target pixels are weak and unclear. In Figure 6f, there is strong noise.

Dataset
The ground truth for images is essential to use supervised method for SSS image segmentation. We used LabelMe, an image annotation tool developed by the MIT, to manually label images. In the corresponding ground truth image (shown in Figure 5b), the blue and the black color denote the object and the background areas, respectively.
The data image (shown in Figure 5a) and its corresponding labelled image (shown Figure 5b) were then cut at the size of 240 × 240 and with a stride of 50. We selected samples with target pixel ratio exceeding 5% to form a dataset. We selected 70% of its images as our training set, and the other 30% as our test set. In the training set, about 20% of the data were randomly selected as a validation set. More precisely, the dataset consisted of 3528 training images, 882 validation images, and 959 testing images.
Some challenging factors existed in the dataset, which are illustrated in Figure 6. Figure 6a illustrates a clear and standard SSS image. In Figure 6b, some target pixels are clear and others are weak. In Figure 6c, the images are dark. In Figure 6d, the target pixel areas are discontinuous. In Figure 6e, target pixels are weak and unclear. In Figure 6f, there is strong noise.
In addition to analyzing some challenging factors in the dataset, the number of target and background pixels in the dataset were also compared, shown in Figure 7. There is a clear serious category imbalance in the dataset, which brings huge challenges to the model performance.  The number of pixels in dataset In addition to analyzing some challenging factors in the dataset, the number of target and background pixels in the dataset were also compared, shown in Figure 7. There is a clear serious category imbalance in the dataset, which brings huge challenges to the model performance.

The number of pixels in dataset
Target Pixels Background Pixels

Experimental Parameters
Models were all trained by a NVIDIA Quadro M5000 card, which provides sufficient computational power and resources to compute the model weights. Then, in order to ensure that these models can achieve real-time processing for on-board applications, the experimental tests concerning the prediction time were then performed on an embedded platform, NVIDIA Jetson AGX Xavier.
To implement our model on an NVIDIA Quadro M5000 card, stochastic gradient descent (SGD) was used with the global learning rate of 1e −3 and momentum of 0.9 and 0.999. We trained the variable parameters until the training loss converges using PyTorch [44]. The hyperparameters and the values were fixed as follows: Mini-batch size (8), weight decay (0.0002), loss-weight β m (1) for each side-output layer, and loss-weight β (1) for main-output layer. Prior to each epoch process, it was necessary to shuffle the training set and then take each mini-batch out of the epoch in order to ensure that each image was used only once in an epoch. After that, each image was normalized. The model which performed the best on the validation set was finally obtained.

Results and Discussion
We compared different models' performances, using four common performance metrics [20]: Pixel accuracy (Pixel Acc.), measuring the proportion of pixels accurately predicted to the total pixels, mean accuracy (Mean Acc.) which is the average of the prediction accuracy over all categories, mean intersection over union (IU), and frequency weighted IU (f.w. IU). These last two metrics define the variations on region intersection over union (IU) used in target detection. IU is the overlap ratio between the candidate bound and the ground truth bound. In other words, IU is the ratio between the intersection and the union. The mean IU was calculated as follows: where, |C ∩ G| corresponds to the intersection of candidate bound (C) and ground truth bound (G), and |C ∪ G| represents their union. Mean IU was calculated for each class and then averaged, both of which were calculated using the ground truth and the segmentation results. Frequency weighted IU is an elevation of mean IU, which can set weights for each class according to their relative frequency. The pixel accuracy (Pixel Acc.) was obtained using the equation below: Sensors 2019, 19,2009 10 of 15 the mean accuracy was calculated as follows: the mean IU was obtained using the equation below: (11) and the function of frequency weighted IU (f.w. IU) was: where, k represents the number of classes. The symbol u ij corresponds to the number of samples belonging to category i in ground truth and are classified in class j in segmentation results. Our decoder unit is shown in Figure 2b, and a detailed comparative study of different decoding methods is provided in this section. In order to verify the proposed decoding method, we carried out the following experiment: Keeping the proposed network structure unchanged and testing different decoding methods. Table 3 shows their comparison results. Our decoding methods provide a more detailed and accurate segmentation than others in our architecture. In the previous section, six challenging factors of SSS images were analyzed. Figure 8 shows the corresponding segmentation results. Generally speaking, the model we propose can yield better results. However, when the SSS image is too dark or the target pixel is very unclear, the segmentation of the SSS image performs worse than ideal. ECNet is also compared with overall architecture of U-Net, SegNet and LinkNet in our dataset. Table 4 lists the relative comparison results.
Compared with other networks, our proposed network reduces the receptive field. This could lead to a decrease in the model performance, because the maximum size of the targets able to be detected is thus reduced, potentially missing bigger targets. However, because the target area of the sonar data is small, it has a small impact on the sonar images, as proved by the experiments. As shown in Table 4, ECNet outperforms LinkNet and SegNet in four measures, while U-Net performs slightly better than ECNet in pixel accuracy, mean IU, and f.w. IU. LinkNet, SegNet, and U-Net all have larger receptive fields than ECNet, so that reducing the receptive field has a small impact on SSS images with smaller target areas. From this, we are inclined to the view that it is critical to adjust the model structure to make it easier to optimize, which may be one of the reasons why U-Net performs best. Our approach, reducing the number of parameters and model size, makes our model take up the less computing resources on an embedded platform, which is critical for being applied in real-time tasks.   Figure 8 shows the corresponding segmentation results. Generally speaking, the model we propose can yield better results. However, when the SSS image is too dark or the target pixel is very unclear, the segmentation of the SSS image performs worse than ideal. ECNet is also compared with overall architecture of U-Net, SegNet and LinkNet in our dataset. Table 4 lists the relative comparison results.  The inference speed of ECNet and existing networks of semantic segmentation on NVIDIA Jetson AGX Xavier were reported. Our default image resolution was set to 240 × 240, and Table 5 reports the number of parameters, the size of the model, and the inference time for a single input. As indicated from the numbers provided, ECNet uses less memory and performs faster. Furthermore, it can provide real-time performance on NVIDIA Jetson AGX Xavier, which can be employed for real-time inspection application.
ECNet realizes the best trade-off between effectiveness and efficiency according to Tables 4 and 5. Although U-Net has a slightly higher accuracy than our network, ECNet is much faster. Moreover, our model is much smaller than others. Since our model is simple and efficient, it can be easily applied in various tasks. The segmentation of the SSS image is less ideal when the SSS image is too dark or the target pixel is very unclear. In a further study, authors can add an image enhancement process for SSS images, such as increasing image brightness, adjusting image contrast, and histogram equalization. The network performance can be improved through data augmentation, such as rotating the images to different angles and cropping the largest rectangle in them, or flipping images at each angle. It is worth mentioning that our model and strategy, based on HED approach, still have not clearly considered neighboring pixels information. In the future, authors would like to explore how to optimize the model using the context between neighboring pixels.

Conclusions
In this paper, we presented a novel neural network architecture designed for semantic segmentation of SSS images, where a novel encoder was designed to learn rich hierarchical features. Additionally, we took advantage of a single stream deep network using side outputs after each encoder to learn useful features from middle layers. To consider the influence of the imbalance classification problem, the model made use of weighted loss, where the target pixels resulted in a larger weight in the loss function. Finally, ablation studies were performed to compare different decoding methods. These show that skip architecture in our decoding method provides the best compromise between computational efficiency and accuracy.
ECNet allows to perform predictions much faster and more efficiently, which makes it possible to effectively utilize the limited resources on embedded platforms. There is no doubt that the ECNet can be more broadly applied to real-time tasks.