Deep Learning-Based Maritime Environment Segmentation for Unmanned Surface Vehicles Using Superpixel Algorithms

: Unmanned surface vehicles (USVs) are receiving increasing attention in recent years from both academia and industry. To make a high-level autonomy for USVs, the environment situational awareness is a key capability. However, due to the richness of the features in marine environments, as well as the complexity of the environment inﬂuenced by sun glare and sea fog, the development of a reliable situational awareness system remains a challenging problem that requires further studies. This paper, therefore, proposes a new deep semantic segmentation model together with a Simple Linear Iterative Clustering (SLIC) algorithm, for an accurate perception for various maritime environments. More speciﬁcally, powered by the SLIC algorithm, the new segmentation model can achieve reﬁned results around obstacle edges and improved accuracy for water surface obstacle segmentation. The overall structure of the new model employs an encoder–decoder layout, and a superpixel reﬁnement is embedded before ﬁnal outputs. Three publicly available maritime image datasets are used in this paper to train and validate the segmentation model. The ﬁnal output demonstrates that the proposed model can provide accurate results for obstacle segmentation.


Background and Motivation
Unmanned vehicles, especially drones and autonomous cars, have been widely applied in our daily lives due to the recent advances in robotics and artificial intelligence. Such a trend has attracted increasing attention from the maritime industry with unmanned surface vehicles (USVs) being rapidly developed. USVs can perform various tasks, such as scientific exploration, environmental information collection, search and rescue, and communication, in various scenarios [1]. More importantly, USVs have an inherent advantage in exploring hazardous areas because of their miniaturised design, which can greatly reduce the reliance on human operators and allows autonomous long-duration operations in harsh environments.
In general, a marine environment in which a USV operates is typically dynamic and unpredictable influenced by sudden high-speed incursions on the route, weakness in environmental awareness due to waves and sea fog, and signal disruptions, etc. Various sensors such as radar, LiDAR and inertial measurement units can be used for environment perception and localisation. However, these sensors may have several disadvantages, such as limited detection accuracy, impaired capability in detecting submerged obstacles, and high prices, especially for LiDAR sensing. Recently, due to the benefits of being able to provide rich texture information and relatively low price, visual cameras have been intensively used for object detection in maritime environments. Among several vision processing algorithms, the classical background subtraction method presents a high false negative rate under the harsh marine environment, which makes it unsuitable to navigate USVs [2]. Target tracking and object detection techniques based on stereo-vision [3] can be generally implemented to cope with dynamic environment perception. However, the accuracy of 3D point cloud map-based algorithms is sensitive to the texture features in the environment. In particular, when the water surface environment has very few features, low-cost cameras can be compromised in providing a reliable detection and consequently generate challenges for environment sensing [4]. For example, a fast semantic segmentation method was proposed in 2006 based on model adaptation using low-cost cameras. The method can achieve an outstanding fast target detection, but it shows a limited accuracy in detecting sunlight reflections and floating objects [5].

Contributions
Deep learning has raised a lot of attention in recent years due to its powerful feature extraction ability. Semantic segmentation, as one of the key areas using deep learning, is a computer vision task that entails taking raw data (such as images) as input and turning them into masks with highlighted regions of interest. Semantic segmentationbased obstacle detection can assist USVs with identifying potential collision risks when they are conducting tasks. However, existing studies on obstacle detection are typically associated with high false-positive rates and some semantic segmentation algorithms cannot obtain a good obstacle contour from a complex backgrounds. Therefore, to improve the obstacle detection accuracy, especially in environments with complex background, this paper aims to develop a deep learning-based marine environments segmentation algorithm integrated with a superpixel segmentation for a refined object detection result. The contributions of this paper are summarised as follows: • A superpixel segmentation model called Simple Linear Iterative Clustering (SLIC), has been innovatively integrated with a deep neural network in this paper to improve the segmentation accuracy, especially for obstacle edge detection in maritime environments. • Enriched cross validations based on three different maritime datasets are conducted with the results proving that the proposed SLIC enabled model has a strong capability in understanding the semantics of the environment. • Obstacle detection performances are validated using a series of practical maritime datasets with the results showing that a high obstacle detection accuracy can be achieved when using the segmentation generated by the proposed network.
The rest of paper is organised as follows: Section 2 presents a list of related works based on deep learning networks. Section 3 is a description of the Simple Linear Iterative Clustering (SLIC) segmentation algorithm and the construction of the SLIC module is also presented. Section 4 presents all the datasets used in this paper, as well as the results. Section 5 concludes the paper, acknowledges the shortcomings and highlights possible future research directions.

Related Work
This section provides a literature review of using computer visions together with machine learning techniques to support autonomous navigation missions of USVs. Given that most USVs are still remotely or semi-autonomously controlled, sensors such as LiDAR [6], radar [7], and visual systems [3], have been intensively used for scene perception. As stated previously, visual systems, as a relatively cheap solution that can provide enriched sensing information, have been recently adopted. Conventional machine learning approaches, such as Non-parametric Bayesian methods [8] have been applied to process visual signals for semantic labelling. In this work, we limit the review scope to deep convolutional neural network-based methods which are typically implemented to process visual information to assist with USV navigation, for their strong learning capabilities of rich features.

Image Semantic Segmentation Networks
As a common computer vision issue, semantic segmentation or full pixel semantic segmentation [9] includes image classification, object recognition detection and semantic segmentation. Image semantic segmentation network can provide a pixel-level image interpretation while deferring to human perception, and hence has a broad range of applications. In this section, typical deep networks proposed for semantic segmentation are reviewed with their features highlighted.
Convolutional Neural Networks (CNN) [10] have several fully connected layers after the convolutional layers, which project the feature map generated by the convolutional layers into a fixed-length feature vector. The classical CNN structure represented by AlexNet [11] is suitable for image-level classification and regression tasks since they export a probabilistic description of the whole input image as an output.
Different to CNNs, fully convolutional networks (FCN) can accept input images of any sizes, and use a deconvolutional layer to upsample the feature map. The semantic segmentation network based on FCN using a U-Net structure was proposed for segmentation of medical images in 2015 [12]. U-Net, similar to FCN, has a downsampling and an up-sampling phase and can solve: (1) a pixel localisation problem with shallow information and (2) a pixel classification problem with deep information to accomplish a semantic segmentation. In contrast to the FCN network, the downsampling and up-sampling phases within the U-Net structure employ the same number of convolutional layers. Furthermore, U-Net uses a skip connection structure to send the results of distinct downsampling layers to the appropriate up-sampling layers, allowing the network to obtain more precise pixel space information and increase segmentation accuracy.
SegNet [13] was proposed to improve the scene perception capability within autonomous navigation. The methods within the SegNet and the FCN are rather similar, with the exception that the encoder and decoder within the SegNet (downsampling and up-sampling) use different technologies. As shown in the SegNet architecture in Figure 1, the left side shows a convolutional extraction of features by pooling the perceptual field and shrinking the image (a process known as encoding). Convolution is used here to extract features from the encoder. The deconvolution and up-sampling layers are located on the architecture's right side. After image classification, the features are repeated by deconvolution, then the up-sampling process restores the feature map to the original size of the image (referred to as decoding). The DeepLab family [14][15][16][17] is a series of semantic segmentation networks proposed by Google. DeepLab v1 was launched in 2014 and achieved the second place in the segmentation task on the PASCAL VOC2012 dataset, followed by DeepLab v2, DeepLab v3 and DeepLab v3+ from 2017 to 2018. Two innovations of DeepLab v1 are the Atrous Convolution and the Fully Connected conditional random field (CRF). DeepLab v2 was thereafter improved by proposing an Atrous Spatial Pyramid Pooling (ASPP). DeepLab v3 further optimises the ASPP by adding a 1 × 1 convolution, batch normalization operation, etc.
DeepLab v3+ adds an up-sampling decoder module to optimise the accuracy of edges. Considering that the network features of DeepLab v3 do not encompass excessive highlevel features, DeepLab v3+ [17] adapts the encoder-decoder structure of Feature Pyramid Networks (FPN) to achieve the fusion of feature maps across blocks. Another contribution of DeepLab V3+ is the adaptation of Aligned Xception network as backbone. As shown in Figure 2, the DCNN part is considered to be encoder, and the part of DCNN output that is upsampled is considered to be decoder, forming an encoder-decoder structure. DeepLab v3+ concatenates the result of the up-sampling of the DeepLab v3 model output with the result of the downsampling of the DCNN output by a factor of 0.25. More refined results are then obtained using a 3 × 3 convolutional layers and bilinear interpolation of 4-fold up-sampling layers.

Superpixel Algorithms
Current image processing is largely pixel-based, using a two-dimensional matrix to describe an image without taking into account the spatial organisation relationship between pixels, making the algorithm less efficient. Ren et al. [18] initially presented the notion of superpixels in 2003, which are image blocks made up of surrounding pixels with identical texture, colour, brightness, and other characteristics. It groups pixels based on their similarity in features, allowing it to gather redundant information from the image and greatly minimise the complexity of subsequent image processing operations.
Graph theory-based approaches and gradient descent-based methods are the two most common types of superpixel creation algorithms. The graph-based approach proposed by Felzenswalb et al. [19], the N cut method proposed by Shi et al. [20], and the superpixel lattice method proposed by Moore et al. [21] are the main graph theory-based superpixel segmentation methods. The N cut method uses contour and texture properties to globally minimise the cost function, resulting in regular superpixels, while picture convenience is not fully preserved and is computationally expensive. The graph-based approach, which uses the concept of the least spanning tree to segment the image, can better keep the image border and is faster. However, the size and collision of the created superpixels are uneven. The superpixel lattice technique preserves the image's topological information, but its performance is heavily reliant on the image's pre-extracted bounds.
For gradient descent-based algorithms, important methods include the watershed method [22], the meanshift method [23], the quick-shift method [24] and the Simple Linear Iterative Clustering (SLIC) method [25]. They are all based on the core concept of clustering. In particular, the SLIC approach is based on colour and distance similarity, and it can produce superpixels of uniform size and shape. More specifically, Achanta et al. [25] proposed a single-minded and easy-to-implement algorithm to convert a colour image into a 5-dimensional feature vector in CIELAB colour space and XY coordinates, and then construct a metric for this feature vector to perform the process of local clustering of image pixels. By initialising the seed points and using a similarity measure, compact, approximately uniform superpixels can be generated relatively quickly. Ren et al. [26] implemented the SLIC algorithm using a GPU and the NVIDIA CUDA framework was able to increase the speed by a factor of 10. Lucchi et al. [27] used the SLIC segmentation algorithm for preprocessing and then built a graph model with superpixels as nodes and spatially adjacent nodes as edges, gave random field definitions for the corresponding conditions and proposed a structural image segmentation using coregulated features.

Deep Learning-Based Segmentation for USVs
In the past, processing for the detection task of marine semantics data was usually implemented through hardware methods such as LIDAR [28,29]. However, due to the application of machine learning to the navigation of USVs, CNN [30][31][32][33], Haar [34], and HOG [35] classifiers are gradually implemented for detection-based tasks. The WaSR [36] network proposed by Borja Bovcon and Matej Kristan achieves target recognition by building fusion blocks in the decoder structure using ASPP modules and FFM modules. Chen et al. [37] achieve accurate target recognition by embedding attention mechanism module to achieve accurate marine semantic segmentation results. In general, most of the existing semantic segmentation networks have been developed and redeveloped based on encoder-decoder structures.
When dealing with datasets containing a large number of samples with diverse features, current deep convolutional neural network approaches are still ineffective, and they typically require additional algorithmic modules, such as an attention mechanism module. In general, larger datasets and more training are required to allow the network model to learn the features of various environments. Nonetheless, realistic circumstances in which the network model has not learned will always exist, and the inference results are often not ideal at this level.
Although superpixel segmentation [38][39][40][41] techniques are frequently used in image preprocessing to discover edge characteristics in images, they are rarely used in marine semantic segmentation tasks due to their inability to segment semantically. However, the capability of storing target edge information of superpixel techniques can be used to help with addressing the problem of object edge information loss that occurs in deep convolutional neural networks.
In summary, most of the existing semantic segmentation networks have been developed based on encoder-decoder structures, but the following research gaps can be generated: • When dealing with datasets containing a large number of samples with diverse features, current deep convolutional neural network approaches are still ineffective. • Although superpixel segmentation techniques are frequently used in image preprocessing to discover edge characteristics in images, they are rarely used in marine semantic segmentation tasks due to their inability to segment semantically. • Despite the fact that many studies have used deep learning approaches to solve the segmentation problem of marine semantic datasets, there has been little discussion of the differences between deep convolutional neural networks with different depths in learning marine semantic features.
Therefore, in contrast to literature that investigates new effective deep convolutional network structures for marine semantic segmentation, this paper has proposed a new SLIC enabled segmentation model for refined obstacle detection in maritime environments. It can be demonstrated that by integrating the superpixel algorithm with existing deep neural convolutional networks, the detection accuracy especially for obstacle edges can be well improved. This can potentially benefit USVs' navigation in a confined area with complex obstacle structures.

Method
The proposed semantic segmentation architecture ( Figure 3) is assisted by a Simple Linear Iterative Clustering (SLIC) [25] module that refines the segmentation outputs from the deep neural network, with the goal of improving the inference accuracy. Two different types of backbones in DeepLab v3+ the network are used to assess the impacts on semantic segmentation performance.

Simple Linear Iterative Clustering Algorithm
The SLIC algorithm [25] performs an iterative categorisation operation on each pixel by converting the colour image into a LAB colour space [42] and combining it with X-Y coordinates to construct a distance function on a five-dimensional vector For pixel i, l i is the luminance of the pixel; a i is the colour vector for green to red; b i is the colour vector for blue to yellow; x i and y i indicate the coordinates of the pixel. Please note that differing from RGB colour space, LAB colour space is implemented here as it contains spatial information. Because of the simplicity of the constructed 5D vector C i , the SLIC algorithm is faster and more efficient than other superpixel algorithms, and can produce a set of superpixels that are compact and have uniform shapes, which facilitate the handling of colour images.
The SLIC algorithm starts from initialising the starting positions of superpixels. When there are N pixels in an image and K superpixels to be assigned, the size of each superpixel is N/K and the distance between each superpixel is S = √ N/K. For each pixel in an image, the degree of similarity between the pixel and its nearest superpixel is calculated in an iterative way and pixels are grouped to have the same label of the most similar superpixel. To calculate the similarity between each pixel and the superpixel, the distance measure variable D is introduced and calculated as shown in Equation (1).
where S is the distance between each pair of superpixels, m indicates the relative importance of space and pixel colour, d c is the distance of colour information between pixels, d s is the distance of spatial information between pixels. Figure 4 shows that as the value of m increases, the weight of spatial relations decreases and the weight of colour relations increases, resulting in that the generated superpixels are more irregular in shape but fit closely to the edges in the original image, while at the same time there are more unclassified pixels. To improve the algorithm's computational speed, when clustering each pixel, rather than in a whole image, similar pixels are searched only in a 2S × 2S region (as shown in Figure 5) and all pixels within this area will be grouped belonging to the same superpixel.

Structure of the SLIC Enabled Segmentation Model
As explained in the previous section, DeepLab v3+ adds a decoding module to recover some of the detailed information. More specifically, the DeepLab v3+ network employs two bilinear interpolation upsamplings to increase the feature resolution but without the capability of fully restoring the lost information during the up-sampling process. The lost image information is especially centred around edges areas, which is not ideal for an accurate object edge segmentation.
Hence, to address this issue, a new segmentation model with superpixel optimisation is proposed in this paper with the network structure shown in Figure 5. The model adopts a DeepLab v3+ network for an initial segmentation with the encoder module of DeepLab v3+ shown in the blue bounding box and the decoder module shown in the red bounding box. Using such an encoding-decoding structure, semantic segmentation features of images can be extracted to generate a relatively coarse segmentation result. In the meantime, the superpixel optimisation module is used as a subsequent processing part to optimise the semantic segmentation results, especially around edge areas. In this paper, we assume that a maritime environment only contains three semantic categories including sky, ocean and obstacles, the SLIC algorithm sets various criteria for each category to preserve the sky and ocean while trimming the edges of the obstacle. Last, by fusing the high-level semantic feature information from DeepLab v3+ and the refined object edges information from the SLIC, the final output can be generated. Please note that DeepLab v3+ network is adopted because it is reliable and relatively accurate. Other semantic segmentation networks, such as SegNet [13], BiSeNet [43] and DFANet [44], can also be selected according to specific requirements, which will be discussed in this paper.
The details of the fusion process within the proposed model ( Figure 2) is explained here. The superpixel segmentation information from the SLIC algorithm will be projected onto a new image first. Although the SLIC superpixel segmentation information is being mapped, the superpixels are reclassified based on the semantic segmentation results from the DeepLab v3+. Such a process is equivalent to "recolouring" the superpixel segmentation results, i.e., reassigning semantic information to each superpixel. By following such a process, the whole image can be rescanned pixel by pixel until no empty pixels exist.

Network Implementation
The network implementation includes the selection of backbone and the configuration of the loss function to achieve optimised learning results. To investigate the impact of the SLIC algorithm on the inference time and accuracy, this paper will discuss the adaptability of the proposed model using different types of backbones in the DeepLab v3+ network.
More specifically, Xception [45] and ResNet101 [46] are selected as they are the most commonly used ones in image segmentation. Between them, Xception, as a network structure using a depthwise separable convolution operation, has the feature of reduced network parameters making it computationally efficient. On the contrary, ResNet101 is constructed by combining a forward neural network with shortcut links, to address the problem of degradation caused by deep layers of networks and subsequently increase the learning capacity.
With regards to the loss function construction, as stated in [37], we consider a maritime environment consists of three categories including sky, ocean and obstacles. Training of the model is undertaken using public available datasets such as the MaSTr1325 dataset [47], where there is an issue of label unbalance between obstacles and the other two categories. To accommodate such an issue, L f inal , a mixture of cross-entropy (Equation (3)) and focalentropy (Equation (2)) loss functions is adopted in this paper as shown in Equation (4): where L f l is the focal-entropy loss function; p t i is the estimated probability of all three types of targets; γ is the hyperparameter; L ce is the cross-entropy loss function; y t i is the label of the three different categories of targets in the image;ŷ t i is the estimated output of the three categories of targets after passing through the network. Cross-entropy loss is a widely implemented classical loss function for machine learning, and focal-entropy loss can solve unbalanced classes issue by assigning weights. Such a loss function is appropriate for our semantic segmentation task with large-size networks as backbones. It is not useful in practice to find the global optimum for large-size networks when many local optimums exist with good performance; excessive training may lead to overfitting [48].

Trade-Off: Compactness and Accuracy of Superpixel Segmentation
It should be noted that there is a trade-off in optimising the output of the DeepLab v3+ network using the SLIC algorithm, where a good superpixel segmentation result is usually deemed to have good compactness and connectivity. In practice, however, when the value of m becomes small, some "holes" might be generated in places that are not classified as any cluster. To eliminate these "holes", the output of DeepLab v3+ must be marked and coloured in the end, which may affect the effectiveness and efficiency of the SLIC optimisation. As the value of m becomes larger, the problem of generating "holes" pixels can be significantly improved, but under-segmentation issue might occur and the superpixels do not adhere well to the edges, especially in those irregularly shaped targets such as masts, mountains, birds etc., resulting in a reduced optimisation.
To resolve the conflicting relationship between the compactness of superpixels and the segmentation accuracy generated by the SLIC algorithm, the number of superpixels usually must be increased when performing a segmentation task. This means that densely clustered regions are generated on the image using a larger number of small-sized superpixels in order to increase the sensitivity of the segmentation algorithm, i.e., using a high value of m to generate high compactness, while still having a good sensitivity of the superpixels to the target edges. Such a procedure will be specifically discussed in the Experiments and Results section, but it is without a doubt that as the number of superpixels increases, the computational time of the SLIC algorithm will increase accordingly, which may limit the on-line computing performance of the SLIC algorithm.

Experiments and Results
This section discusses the implementation results of the proposed model. Evaluation metrics will be first discussed followed by data augmentation methods and training setups. The performances of the SLIC algorithm on the maritime datasets will be presented together with the comparative analysis of the proposed model with the original DeepLab v3+ network.

Dataset and Evaluation Metrics
In this paper, the proposed model is trained using the Marine Semantic Segmentation Training Dataset (MaSTr1325) [47], a large scale marine semantic segmentation training dataset designed for small USVs for target detection. The MaSTr1325 dataset contains 1325 fully annotated high-resolution images of the real marine environments. Groundtruth masks are labelled with different values depending on the categories, for example, obstacles are labelled with 0, sea or water with 1, sky with 2 and unknown areas with 4. A total number of 1221 images in the MaSTr1325 dataset were used for the training, and the remaining images were included in the validation set. Another used dataset is the Multi-modal Marine Obstacle Detection Dataset 2 (MODD2) [49], which is the most comprehensive and voluminous marine obstacle detection dataset available. It contains images of targets at different scales in different weather and extreme conditions, such as sunny days, hazy weather, reflections on the sea surface, and varying sea states. Marine Image Dataset (MID) [50] was also selected to be included in the validation set to test the performance of the model because the MID dataset contains enriched open sea area images. Sample images from different datasets are shown in Figure 6.
Accuracy, recall, F1-measure, and mIoU, are used to quantitatively evaluate the performance of the proposed deep neural network model [51]. Precision is calculated by dividing the number of genuine positive samples by the total number of samples. Recall is calculated by dividing the number of true positive samples by the total number of positive samples that should be identified. The F1 score is a mixture of the two indications, where a model with a higher F1 score is more resilient. mIoU is used to compute the ratio of the true and predicted sets' intersection. For semantic segmentation tasks, the mIoU value provides an intuitive representation of how closely the inference results match the ground truth. The greater the value of mIoU (the closer it is to 1), the better the model performs in theory. The definition of these four categories of metrics is shown in Equations (5)- (8).
where TP indicates the number of samples that are true positive. FP indicates the number of samples that are false positive. FN represents false negative. Thus, TP + FP represents the number of samples that are predicted to be positive and TP + FN is the number of samples that are actually positive. As shown in Equation (8), k is the number of semantic types in the dataset, P is the predicted value for each semantic type, and G is the groundtruth value for each semantic type. Again, in this paper, the value of k is 3, representing the three semantic types, i.e., obstacle, ocean and sky.

Data Augmentation and Training Setups
Data augmentation is a useful technique for boosting data variety by systematically creating extra training samples. The 1221 high resolution (512 × 384) images used for training in MaSTr1325 are exposed to random data augmentation such as vertical symmetry, brightness change, contrast change, saturation change, etc. The proposed network model was trained using Adam optimiser and the learning rate was initialised to be 10 −4 . The Resnet101 network used in DeepLab v3+ was pre-trained on a subset of COCO train2017 [52], and the Xception network used was pre-trained on the ImageNet dataset [11]. Each type of network was trained for 50 epochs with a batch size of 2. All training and validation work was done on a node equipped with a Nvidia Tesla V100 at UCL's High Performance Computing platform. All inference work was done on an Intel Core i7-10875H 2.3 GHz workstation with 16 GB of RAM. All parameters are shown in Table 1.  Figure 7 shows the output images of the SLIC algorithm under different iterations. The black dots in the figure are the clustering centres of each superpixel and the output is Gaussian blurred for each superpixel for better visual discrimination. When configuring the number of superpixels to be 500 and m = 10, the image is segmented in an iterative way. It can be observed in Figure 7 that when the number of iterations reaches 10, the segmentation result is close to the original image showing that the superpixels can provide edges detection. However, it can also be seen that some of the superpixels in the marine area are irregularly shaped and jagged, suggesting that the superpixels are more sensitive to colours than spatial relationships, which can lead to "holes" in the segmentation results.   Figure 7, it can be seen that as the number of superpixels increases, the size becomes smaller, and the output segmented image is more detailed. For example, in the left in Figure 4, the white stripe on the hull of the ship can be well preserved, which is not the case when K = 2000. In addition, as the number of superpixels increases, each superpixel is more compact with fewer "holes" in the image. As can be seen from the above results, the application of the SLIC algorithm achieves a good accuracy when processing images with marine semantic types. Most of the superpixels can detect the edges of objects. One of the foreseeable drawbacks is that the iterative segmentation results using small m-values are too sensitive to some features leading to "holes" due to unworked superpixel edges. Overall, however, the SLIC algorithm achieves fast segmentation of images through a simple logical structure, which has potential and feasibility for refining inferred images from the DeepLab V3+ network. The results from the output of the SLIC optimisation module and the impact of carrying the SLIC module on the overall convolutional neural network model will be shown and discussed in the next section. Figure 9 shows the output of the DeepLab v3+ network model using Xception as the backbone (labelled as initial results) and the output optimised using the SLIC algorithm (labelled as refined results). It can be seen that the SLIC algorithm can finely trim features with detailed textures. For example, features such as masts and streetlights in the first and third rows of figures have sharper edges after being processed by the SLIC module. The SLIC algorithm module can also rectify several portions in the sky that are mistakenly recognised as barriers in the third row of figures. Furthermore, the railings of the boats on the water in the second row are trimmed finer to get closer to the ground truth, and the trimming is better than the other two sample figures, implying that the SLIC algorithm module has a more robust trimming strategy for complex obstacles such as hole-filled or thinly shaped targets. However, the improvement of the SLIC algorithm is relatively small due to the limited learning capability of Xception. Figure 10 depicts the DeepLab v3+ network model output using ResNet101 as the backbone (labelled as initial results), as well as the output optimised using the SLIC method (labelled as refined results). The SLIC algorithm module also has a stronger refining impact on the ResNet output mask's edges, such as the railings of the ship in the second row in Figure 10, while the other two rows also show results that are closer to the ground truth. Compared with Xception, ResNet101 has a more powerful learning capability, so the results obtained are more refined. In addition, with the SLIC algorithm, the edges of objects are clearer. For example, some errors (islands on the left of the horizon) in the second row of figures that could affect USVs navigation can be well corrected.

Quantitative Results
The quantitative results of the SLIC algorithm module's effect on the DeepLab v3+ network model is shown in Table 2. The mIoU metric is used to visualise the output of the proposed model. The evaluation metrics used to segment the images using the DeepLab v3+ network model with different backbones, are all derived based on the 324 images in the validation dataset.
In Table 2, according to the inference time of four network models, the DeepLab v3+ network model with ResNet as the backbone outperformed the one with Xception, regardless of whether the SLIC algorithm module was employed. When the SLIC algorithm is adopted, the mIoU of the DeepLab v3+_ResNet101 model increased from 89.1% to 90.1% and the mIoU of the DeepLab v3+_Xception model increased from 85.5% to 85.9%. Although such an increase is relatively small, it is mainly generated around edge areas to further improve the high accuracy produced by DeepLab v3+. Additionally, all currently available datasets are taken from open sea areas with limited edge features, to better reveal the superiority of the proposed model, new dataset, especially containing complex features, need to be recorded. Table 3 compares the mIoU of our DeepLabv3+_ResNet101+SLIC with that of the WODIS [37] on the Singapore Maritime dataset (SMD) [53], MID and MODD2 validation datasets. The WODIS network is a state-of-the-art deep network that has been specifically designed for the maritime environment, which has better performances than other standard networks such as SegNet [13]. Table 3 demonstrates that our proposed DeepLabv3+_ResNet101+SLIC framework is on par with the WODIS, i.e., achieving 89.2% mIoU on the MID dataset (88.1% for WODIS). On the SMD and MODD2 datasets, similar mIoU values (0.2% and 0.1% differences on SMD and MODD2 respectively) can be observed for the two models. Please note that high mIoU values for WODIS are mainly the results of the inclusion of attention modules that are focusing on the objects on water; whereas our proposed network achieves the high mIoU values by having refined results of obstacle edges, which can be better used to ensure navigational safety.
In terms of inference time, the DeepLab v3+_ResNet101 model was about 0.5 seconds quicker than the DeepLab v3+_Xception model. The difference in mIoU between these two network models in inferring the complete validation set was around 3.6%. This is because ResNet101 has a greater depth and learning power, as well as stronger generalisation capability. However, because of its more complicated structure, ResNet takes longer to train and is more prone to overfitting than the Xception network model, thus it is critical to monitor the training process while stated parameters are used.
From a runtime perspective, although the DeepLab v3+_ResNet101 model is faster and more accurate than the DeepLab v3+_Xception network model, the capacity of processing images in real time is slightly limited due to the inherent network architecture limitations. However, we argue that the main focus of this paper is on refining edges detection and the inference time can be improved by implementing light weight structures such as MobileNets [54] into the proposed model.

Conclusions
This paper proposes a novel deep semantic segmentation model for unmanned surface vehicles operating in the complex maritime environment. The proposed model employs a deep neural network with an auto-encoder layout and a superpixel refinement module which refines the reconstruction output from the deep neural network. Three open-source maritime image datasets are applied to train and validate the developed semantic segmentation model. The results imply that the deep neural network assisted by the superpixel method has achieved improved semantic segmentation performance with slightly increased inference time. The improved segmentation performance is attributed to the SLIC superpixel method, which is good at distinguishing edges in images. Such a unique feature compensates for the imperfect edge information provided by the deep neural network, consequently leading to an improved mean IoU of 90.1% when the combination of DeepLab v3+_ResNet101 + SLIC is applied. Additionally, we have also observed that the model is less effective in images with a low fraction of obstacles, which is possibly due to the insufficient connectivity between the SLIC and the deep neural network. The SLIC processing also increases the time complexity. In future work, the SLIC will be better integrated with the deep learning processes.