AlgaeMask: An Instance Segmentation Network for Floating Algae Detection

: Video surveillance on the offshore booster station and around the coast is a effective way to monitor floating macroalgae. Previous studies on floating algae detection are mainly based on traditional image segmentation methods. However, these algorithms cannot effectively solve the problem of extracting Ulva prolifra and Sargassum at different sizes and views. Recently, instance segmentation methods have achieved great success in computer vision applications. In this paper, based on the CenterMask network, a novel instance segmentation architecture named AlgaeMask is proposed for floating algae detection from the surveillance videos. To address the feature extraction ability of the network in the inter-dependencies for position and channel, we introduce a new OSA-V3 module with the dual-attention block, which consists of a position attention mechanism and channel attention mechanism. Meanwhile, scale-equalizing pyramid convolution is introduced to solve the problem of scale difference. Finally, we introduce the feature decoder module based on FCOS head and segmentation head to obtain the segmentation area of floating algae in each bounding box. The extensive experiment results show that the average precision of our AlgaeMask in the tasks of mask segmentation and box detection can reach 44.22% and 48.13%, respectively, which has 15.09% and 8.24% improvement over CenterMask. In addition, the AlgaeMask can meet the real-time requirements of floating algae detection.


Introduction
In recent years, disaster events involving floating macroalgae have occurred frequently.These events have caused deterioration of the marine ecological environment, as well as serious economic damage to fisheries, marine transportation, and marine tourism in the coastal areas of China [1,2].The large-scale accumulation of floating algae on the sea surface blocks sunlight and exhausts the oxygen in water during the process of extinction, which seriously affects the survival of marine life.Floating macroalgae disasters in the coastal areas of China mainly include green tides of Ulva prolifera and golden tides of Sargassum.Ulva prolifera is bright or dark green and appears in the Yellow Sea, while Sargassum is brownish-yellow or dark-brown and mainly blooms in the East China Sea [3,4].
It is necessary to detect the location and distribution range of these floating algae (Ulva prolifera and Sargassum) accurately in the sea waters of China across a long period of time.Real-time monitoring of floating algae could provide a reliable basis for the analysis, prevention, and control of disasters to reduce economic and ecological costs.Therefore, a lot of floating algae detection algorithms and methodologies have been researched thus far [5][6][7][8].
Satellite remote sensing technology is one of the effective ways to capture the distribution of floating algae in the ocean due to its advantages of broad spatiotemporal coverage and frequent data acquisition [9][10][11].Moderate Resolution Image Spectroradiometer (MODIS) and Geostationary Ocean Color Imager (GOCI) data are commonly used in related research.Xing et al. captured the spatiotemporal features of floating Sargassum in the Yellow Sea to calculate their distribution and drifting path via high-spatial-resolution satellite images [4].Wang et al. proposed a novel method to quantify Sargassum distribution and coverage by the MODIS alternative floating algae index (AFAI) over the Central West Atlantic region [12].Xu et al. conducted the comparison between MODIS survey data and UAV images to verify the detection efficiency and accuracy for the green tides in the Yellow Sea [13].Shin et al. monitored Sargassum distribution on the coast of Jeju Island by GOCI-II imagery captured in 2020 and adopted the GentleBoost model as the detection model [14].For improving image quality, Cui et al. proposed a super-resolution detection model to reconstruct a high-resolution image of a region from GOCI images in order to distinguish the floating macroalgae patches from the water area more precisely [15].Liang et al. proposed an extreme learning machine (ELM) method to detect floating macroalgae based on GOCI data, which was insufficiently sensitive to determine the value of threshold for traditional methods [16].Qiu et al. used multi-layer perceptron (MLP) to monitor floating macroalgae automatically, robust to different environmental conditions, from GOCI imagery in the Yellow Sea [17].
Synthetic aperture radar (SAR) can image the earth in all weather conditions and in high spatial resolution [18].Shen et al. proposed an unsupervised recognition method for green tide from RADARSAT-2 SAR images, paying attention to the polarimetric characteristics of green macroalgae blooms in both amplitude and phase domains [19].Ma et al. integrated MODIS with SAR to jointly detect green tide accurately in the Yellow Sea in 2021 and showed the spatiotemporal changes of the green tide in more detail than a single data source [5].
For traditional image processing methods, image transformation and threshold segmentation are adopted to achieve floating algae segmentation effectively.Obviously, the image processing methods have the advantages of simple feature extraction, fast computing speed, and low deployment cost.However, the traditional methods require artificial feature design and predefined templates, and the process of parameter adjustment is very complex.These methods are very sensitive to environment changes and difficult to apply in monitoring floating algae accurately in practice.With the development of deep learning technology in recent years, convolutional neural networks (CNNs) have been successfully applied in the field of object recognition, image segmentation, video analysis, and so on, as they are able to automatically extract useful and rich features.At present, a CNN-based method has become one of the most popular methods in the field of floating algae detection [20].Valentini et al. proposed a smartphone-camera-based Sargassum monitoring system in the French Antilles.The work adopted a pre-trained MobileNet-V2 model for image patch classification and the fully connected CRF to extract semantic segmentation in detail [21].Arellano-Verdejo et al. designed the ERISNet model based on CNN and RNN to detect floating and accumulated Sargassum for MODIS data along the Mexican Caribbean coastline [22].Wan et al. introduced a novel Enteromorpha prolifera (EP) extraction framework from GOCI images.Firstly, a strategy for the sample imbalance between EP and the background was adopted.Then, the network based on 1D-CNN and Bi-LSTM was proposed to make use of the spectral feature and context dependencies of each pixel [23].For high-resolution aerial images captured by UAV, Wang et al. introduced an Ulva prolifera region detection method, using a superpixel segmentation algorithm to generate multi-scale patches and a binary CNN model to determine whether the patches are Ulva prolifera or not [24].
Based on CNNs network, U-Net [25] proposes a symmetric structure composed of encoders and decoders to complete the concatenation of low-level and high-level features, and the overall network presents a U-shaped structure.The methods based on U-Net and its related variants have achieved great success in the field of image segmentation [26].Therefore, a lot of U-Net based methods have been applied to the field of floating algae monitoring.Kim et al. introduced the U-Net framework to detect red tide surrounding the Korean peninsula, which consists of five U-shaped encoder and decoder layers to capture the spectral features of red tide from GOCI images [27].Guo et al. constructed an automatic SAR image detection method for green algae in the Yellow Sea based on the deep convolutional U-net architecture [28].Cui et al. proposed the SRSe-Net to extract large-scale green tides based on U-Net structure and a dense connection mechanism.SRSe-Net has the ability to extract the green tides from the low-resolution MODIS image by the feature mapping learned from the GF1-WFV image domain [29].Gao et al. proposed the AlgaeNet model based on U-Net to extract floating Ulva prolifera from MODIS and SAR images [30].
In the study of computer vision, object detection is the task to locate and classify objects of interest in images.Semantic segmentation is a form of pixel-level prediction to classify each pixel according to the same category, and it only segments targets in different categories and cannot distinguish each individual target in the same category.The instance segmentation methods cannot only locate the corresponding bounding box of target in different categories, but also classify each object at pixel level in the same category.Therefore, the meaning of 'instance' is that the network has the ability to distinguish each individual target in the same category.It is more challenging for instance segmentation as it includes the tasks of object detection and semantic segmentation [31].The technology of instance segmentation has been widely applied in the fields of autonomous driving, medical image analysis, and video surveillance [32][33][34].Mask R-CNN [35] is one of the most widely applied instance segmentation algorithms today, developed from the object detection network, Faster R-CNN [36].Mask R-CNN adds a semantic segmentation branch for predicting each region of interest (ROI) to the object classification and regression branches, effectively detecting target objects and generating high-quality segmentation masks for each instance.In the Mask R-CNN framework, the final output masks are determined by the object classification branch's highest confidence.However, these predicted masks are not optimal as the correlation between the masks and the confidence is very low.To solve the problem, Mask Scoring R-CNN [37] designed Mask IoU, a mask evaluation strategy, to measure the distance between the real mask and the predicted mask.CenterMask [38] is an anchor-free instance segmentation framework that can simultaneously achieve the target at real-time speed and high accuracy.CenterMask introduced a new spatial attention-guided mask (SAG-Mask) branch to FCOS [39], a one-stage object detection method.SAG-Mask branch could obtain the object bounding boxes to predict segmentation masks on each detected area.The existing floating macroalgae detection and segmentation algorithms have poor portability, and have strict requirements on the observation environment, so it is difficult to apply them in a large range and for a long time.Video surveillance on aboard ships and around the coastline has the advantage of high-definition resolution, realtime image transmission and low cost, so it can be regarded as a useful supplement to remote sensing satellites and SAR, as shown in Figure 1.In this paper, inspired by the successful application of CenterMask in the field of image recognition and segmentation, we propose a new instance segmentation framework named AlgaeMask for the purpose of floating algae detection, using the surveil- lance images captured from the on-site imaging such aboard ships and around the coastline.The AlgaeMask integrates the boundingbox detection and edge area segmentation of the floating algae (Ulva prolifera and Sargassum) simultaneously into a unified architecture, which is applied to practical scenarios effectively.
The main contributions of our proposed AlgaeMask can be summarized as follows: (1) A new feature extraction module based on One-Shot Aggregation Version (OSA) and dual-attention mechanism was proposed.By integrating the position attention and channel attention in OSA architecture, the long-range position and contextual information of floating algae can be effectively extracted.(2) Considering the feature of floating algae at different scales, the multi-scale fusion module is introduced to capture the inter-scale correlation of the feature pyramid, which can effectively capture the invariant features of floating algae.(3) We evaluate the performance of AlgaeMask and other instance segmentation methods on different scenes.The results show that AlgaeMask can achieve state-of-the-art performance in floating algae detection.
The rest of the paper is organized as follows.Section 2 describes AlgaeMask applied in this paper in detail.Section 3 introduces our experimental results and analysis, including the related dataset, evaluation metrics, qualitative and quantitative performance comparisons, and ablation study.Finally, the conclusions are summarized in Section 4.

Methods
As shown in Figure 2, the AlgaeMask consists of a feature extraction module, multiscale fusion module, and feature decoder module.In the feature extraction module, based on OSA-V2 in CenteMask, the OSA-V3 is proposed to capture the spatial and channel interdependencies of floating algae features better by introducing the dual-attention mechanism.In addition, we replace the OSA-V2 block with the original OSA block at Stage 1 and Stage 2. The multi-scale fusion module extracts the scale-invariance features of floating algae by Scale-Equalizing Pyramid Convolution (SEPC) block and Feature Pyramid Network (FPN) block.In the feature decoder module, the Fully Convolutional One-Stage Object Detection (FCOS) head is used to detect the object bounding box at different scales by inputting the output of the multi-scale fusion module.Finally, the segmentation head is performed to obtain the segmentation area of objects in each bounding box.

Feature Extraction Module
The environment of floating algae detection is applied is complex and changeable.Therefore, it is necessary for a feature extraction module to have the strong feature extraction and anti-inference.Meanwhile, as the area monitored by one camera is limited, realtime detection on multiple cameras is required for floating algae detection.To deal with this real situation, minimum possible computation costs are desired for our detection model.
Compared to traditional backbone framework such as ResNet, DenseNet, or HRNet, OSA is a computation and energy efficient backbone network, which can capture different receptive fields efficiently.However, due to the lack of attention mechanism, OSA cannot extract long dependencies during the feature extraction phase.In order to enhance the performance of OSA, CenterMask proposes the OSA-V2 block which introduced a channel attention block called effective squeeze-excitation (eSE) [38].
In the floating algae detection, we find that the eSE is only focused on the channel dependencies and ignores the position dependencies between different targets.To handle the insufficiency, we propose a new OSA-V3 block to improve detection accuracy by introducing a dual-attention mechanism, which is composed of channel attention and position attention.The channel attention block can extract the feature interdependency between different channels.The position attention block has the ability to capture the spatial location interdependency under the current scale to help the OSA-V3 block to effectively limit the location of floating algae's regions only above the sea surface and reduce false detection.The architecture of the OSA-V3 block is shown in Figure 3.In the detection of floating algae, it is necessary to input high-resolution images as the number of small targets accounts for the highest proportion.However, the computation cost of the attention mechanism mainly depends on the resolution of images.Therefore, different from the architecture of CenterMask network, we only integrate the OSA-V3 block in Stage 3. In addition, we will also replace the OSA-V2 block with the original OSA block in Stage 1 and Stage 2. We will further discuss how using the origin OSA block in Stage 1 and Stage 2 can not only reduce the model complexity and computation cost, but also demonstrate better performance than the OSA-V2 block.(1) OSA Block Given the input feature map f ∈ R × × , we first use the convolutional operation with kernel size 3 to get f ~f ∈ R × × in turn.Then we perform the convolutional operation with kernel size 1 on the concatenation result of f ~f to obtain the fusion result of F .The calculation of this process is as follows in ( 1) and ( 2): where the conv2d is convolutional operation and cat denotes the concatenate operation on channel dimension.The output of OSA F is fed into the position attention block and channel attention block in parallel.
(2) Channel Attention Block In channel attention block, firstly, we reshape the F to f_C ∈ R × , N = H × W and transpose f_C to f_C ∈ R × .Secondly, we perform a matrix multiplication between f and f to obtain the channel attention map X ∈ R × .
where X represents the i channel's impact on the j channel, exp is exponential operation.
Thirdly, we multiply the attention map X and f_C and reshape to R × × and then multiply with a scale parameter β to obtain the result f_C ∈ R × × .
Finally, an element-wise sum operation is performed between f_C and F to obtain the channel attention result C .
Secondly, transpose the f_P and R × , perform a matrix multiplication between f_P and f_P , and use softmax operation to calculate the position attention map P ∈ R × .
where P represents the i position's impact on the j position.
Thirdly, perform a matrix multiplication between f_P and P.Then, reshape the result to R × × and multiply with a scale parameter α to obtain to obtain the result f_P ∈ R × × .
Finally, we perform an element-wise sum operation to obtain the final position attention result P .
After the calculation of the position and channel attention, we perform an elementwise sum operation between P and C to obtain the total attention result S .Then, we use the residual connection between the input feature map and S to obtain the output of the feature extraction module.

Muti-Scale Fusion Module
By the investigation of floating algae detection applications, the Ulva prolifera and Sargassum are always displayed in different sizes in the video due to the difference of viewing angles, focal lengths, and distance of the cameras.Therefore, it is very important for our model to be capable of extracting floating algae at different scales, as shown in Figure 4.The feature pyramid network (FPN) is commonly adopted to deal with object detection at different scales by the instance segmentation models such as Mask-RCNN and CenterMask.However, FPN cannot utilize the inter-level correlation in the feature pyramid efficiently.In this paper, we present a multi-scale fusion module (MSF) consisting of the SEPC block and FPN block.The SEPC block can help the MSF module to improve the ability of the scale-invariant feature extraction for floating algae in both spatial and scale dimension.The architecture of MSF module is shown as Figure 5. Comparing the extracted features in Figure 5, it is obvious that the SEPC block can improve the ability of extracting the robust scale-invariant features of floating algae.By introducing deformable convolution, the SEPC block can compromise the blurring effect of features under different scales.
In the feature extraction module, we can get the feature map with different scales-p ϵR × × , p ϵR × × , and p ϵR × × .We will take the p as an example to illustrate the calculation process of SPEC block.The processes of p and p are same as p and the difference in calculation of p is only that we use the conv2d operation instead of the Deform conv2d.
The calculation process of p can be summarized in ( 11)~ (14).
where Deform2d represents the deformable convolutional operation, p denotes the output of Deform2d with stride 2 on p , p represents the output of the up-sample operation on p , and the p is the result of Deform2d on p .Then, the output of SPEC block p ~p is used as the input of FPN to obtain the feature maps with different scales p ϵR × × , p ϵR × × , p ϵR × × , p ϵR × × , and p ϵR × × .The calculations of the above process are formulated in (15)~ (19).

Detection Block
The traditional detection networks will predict the class category, center point offset, and scaling of width and height of these anchors by the use of the pre-defined anchor boxes.However, the definition of anchor boxes depends on a lot of prior knowledge and may be not reasonable.
In [39], an anchor free framework named fully convolutional one-stage (FCOS) object detection is proposed.By directly predicting the distance up, down, left, and right of each pixel, the FCOS network can not only greatly reduce the complexity of time and space in the training phase, but it can also improve detection accuracy in the testing phase.
We take the outputs of FPN module P ~P as the inputs of FCOS head, consisting of classification head and regression head.These two heads include convolution, group normalization and ReLU operations, as shown in Figure 6.In the classification head, the corresponding classification label will be predicted for each position in current feature map.In this paper, our model will predict the following three categories: Ulva prolifera, Sargassum, and disturbances.In order to reduce the interference of irrelevant objects-such as ships, sea surface, or seabirds-in the environment we define them as a category called 'Disturbances', shown as Figure 7.In the regression head, two sub-branches are defined to predict the four boundary distance parameters and one center distance parameter.Assume that the coordinates of a point on the original image is (o , o ), and the scale between the current feature map and the original image is defined as s .Then, the relation between the regression branch prediction and the original image position can be summarized in (20)~ (23).
where the x_min , y_min , x_max , and y_max are the coordinates of the upper and lower left and right corners of the object bounding box.l, t, r, and b represent the distance to left, upper, right, and bottom of object, respectively.In the regression head, a parameter named as centerness ∈ (0,1) will be predicted that can measure the distance to the object center and the higher value means higher proximity to the object center.

Segmentation Block
The Mask R-CNN network [35] uses the ROIAlign method to realize the alignment of bounding boxes at different scales and it is improved in CenterMask to enhance the detection accuracy of small targets.The calculation of ROIAlign in Mask R-CNN and CenterMask are summarized as follows.
where the values of k and k are assigned as 4 and 5.The width and height of each bounding box are denoted as w and h.F represents the pixel area of input image, and F represents the pixel area of bounding box.Without using the constant value 224 in Mask R-CNN, CenterMask can assign the ROIAlign pooling scale adaptively by the ratio calculation of F /F , and thus can improve the detection accuracy of floating algaes with different scales.
After the operation of ROIAlign block, we will get the feature maps with same resolution under the inputs at different scales.Then, these feature maps will be fed into the segmentation block to achieve the mask area in the ROI bounding boxes.
As shown in Figure 8, the ROI bounding box in feature maps p ~p with different resolutions will be unified into a fixed size 14 × 14 after the ROIAlign operation.We fed these ROI features into four convolution layers sequentially.Then, a 2 × 2 de-convolution operation is performed to upsample the feature map to a resolution of 28 × 28.After that, a 1 × 1 convolution is used to predict the class-specific output.
Considering floating algae detection is a multi-class instance segmentation task; however, the score of mask segmentation is shared with the box-level classification result in FCOS head, hardly to measure the mask quality and completeness of instance segmentation.
We introduce the MaskIoU block in our segmentation pipeline to learn a score value for each mask output instead of sharing the box classification confidence.The process of the MaskIoU block can be summarized as follows.
(1) A convolutional operation is performed on the output of mask Out to get the prediction mask feature map f ∈ R × × .f is fed into a max-pooling block to get a downsampling result f ∈ R × × .(2) The input feature map f ∈ R × × and f are concatenated to obtain the fusion result f ∈ R × × .(3) Four convolution layers (kernel = 3 and stride = 1, and the stride of final convolution is 2 for downsampling the feature map to 7 × 7) and two fully connected layers (outputs with 1024 channels) are performed sequentially on f to obtain the result f ∈ R × × .
(4) Feed the f into task-specific fully connected layers to get the classification score of the current mask f ∈ R .
During the training phase, a binary operation with threshold 0.5 is performed on the predicted mask Out to convert the two-dimensional probability image into the binary image f . Then, we use the L2 loss between the f and the ground truth label image to calculate the mask score loss.During the testing phase, we multiply the classification score in FCOS classification head with the mask classification score in segmentation head as the final object confidence value.

Loss Function
Our loss function consists of the following five parts.loss = w × loss _ + w × loss _ + w × loss _ + w × loss _ + w × loss _ (26) where loss _ is the classification loss in FCOS classification head, loss _ and loss _ are centerness loss and box regression loss in FCOS regression head, loss _ is the average binary cross-entropy loss of segmentation mask in segmentation head, loss _ is the L2 loss in MaskIoU head.The w ~w represent the weight values of each loss, and the values in this paper are 0.5, 1.0, 1.0, 1.0, and 0.5 respectively.

Dataset
According to the location of the floating algae outbreak area in the East China Sea over these years, the data in this paper are collected from the surveillance video captured by seven-way cameras in Nantong and Yancheng of the Jiangsu sea area, from 2020 to 2022.These camera positions are shown in Table 1.We construct our dataset from 3600 images with the resolution of 1920 × 1080 from the videos mentioned above.The dataset is divided into three categories-Ulva prolifera, Sargassum, and Disturbances.We adopt 3000 images as a training set and 600 images as a testing set.

Evaluation Metrics
Considering the AlgaeMask is a type of instance segmentation network, we choose the following metrics to evaluate our model: (1) the mask average precision ( ); (2) the box average precision ( ); (3) the mask average recall ( ); (4) the box average recall ( ).The average precision and average recall can be formulated in ( 27) and ( 28).

= ∑ (27)
= ∑ (28) where the represents the number of samples, denotes the precision value of the i th sample, and represents the recall value of the i th sample.The calculations of and are as follows where , , and denote the number of true positives, false positives and false negatives, respectively.
For segmentation evaluation, we compute the , , and by comparing the predicted mask image with the ground truth label image.For bounding box evaluation, we judge whether the value between the predicted box and the label box is greater than the and compute the , , and .The could be calculated in (31).
where and are corresponding to the predicted box and the ground truth box, ( ⋂ ) represents the area of intersection between and , ( ⋃ ) denotes the area of union between and .In order to further evaluate the performance of different methods for targets with different sizes, we also provide the following metrics: , , , , , and .We mentioned the larger value of the metrics and the better performance of the network above.The meaning of these metrics are as follows:  , : the average precision or recall for small objects, which the pixel area of object is less than 32 .


, : the average precision or recall for medium objects, which the pixel area of object is between 32 and 96 .


, : the average precision or recall for large objects, which the pixel area of object is greater than 96 .
In the training phase, in order to enhance the anti-interference ability of the outdoor detection environment, we adopt the data augmentation methods such as random cropping, random brightness, random occlusion, and contrast variation.The batch size and iteration are 8 and 60,000 respectively.The resolution of images captured by surveillance camera is 1920 × 1080, but we use a resolution of 960 × 512 for training and testing in this paper.

Evaluation of Model Performance
In this section, we will compare our proposed AlgaeMask with other instance segmentation methods, including Mask R-CNN, Mask Scoring R-CNN, and CenterMask.The performance comparison is presented in Tables 2 and 3.The results show that our proposed network can reach the best performance on the precision dimension of box detection and mask segmentation.Specifically, compared with Mask R-CNN, Mask Scoring R-CNN, and CenterMask, our AlgaeMask model achieves 28.59%, 22.26%, and 15.13% improvement on and 43.96%, 37.52%, and 24.24% improvement on in Table 2; 26.15%, 24.35%, and 15.65% improvement on and 32.59%, 25.68%, and 3.26% improvement on in Table 3, respectively.Additionally, we also visualize the predictions under different scenes in Figure 9.In testing samples (1), (2), and (3), it is obvious that the networks without the SEPC block-such as Mask R-CNN, Mask Scoring R-CNN, and CenterMask-more readily miss detection.Meanwhile, by introducing the SEPC block to our AlgaeMask network, the success rate has been greatly improved.
In testing samples (4), ( 5), and ( 6), we can find out that our proposed model has the minimal false detection rate on these complex scenes.It means that by combing the channel attention and position attention block in our feature extraction phase, the features of floating algae and interference in the environment can be effectively distinguished, which can enhance the feature extraction ability and anti-interference ability of our AlgaeMask network.
Specific to objects in different sizes from Tables 2 and 3, the AlgaeMask can also exhibit better performance on AP and AR in detecting the floating algae.In testing sample (2), a lot of small Ulva prolifera are missed detection in CenterMask, Mask Scoring R-CNN and Mask R-CNN.For medium algaes, except for our method, other models demonstrate a large number of false detections in samples (1), (3), and (5).In sample (4), from the perspective of segmentation accuracy and integrity of large algal blooms, our AlgaeMask can achieve the best performance.
Additionally, the performance comparison of different networks for all categories are shown in Tables 4 and 5. Generally, the Disturbances category obtains the best performance in all categories and the Ulva prolifera category has the worst performance in all networks.According to testing samples (2), ( 4), (5), and ( 6), due to the lack of position attention mechanism and SEPC module, the CenterMask detects a lot of false land objects into Ulva prolifera or Sargassum and misses a lot of obvious small Ulva prolifera, and Mask R-CNN and Mask Scoring R-CNN also have the same problems.Therefore, compared with CenterMask on AP, our proposed network achieves 14.06% and 16.35% improvement for the category of Ulva prolifera and Sargassum respectively.Meanwhile, on the metric of AR, compared to CenterMask, our proposed AlgaeMask also shows a 13.79% and 6.91% improvement.

Ablation Study
In order to evaluate the performance of the proposed AlgaeMask under different factors and settings, the following ablation studies are conducted.

Impact of Input Resolution
During the training and testing phase, we conduct some experiments on our proposed method with 480 × 256, 960 × 512, and 1920 × 1080 as the input resolution in Table 6.With the increase in input resolution, the value of recall increases obviously because the target floating algaes in small blooms account for the majority ratio in our dataset.When the input resolution is increased from 480 × 256 to 960 × 512, our model could achieve 11.39%, 12.66%, 33.98%, and 43.66% improvement on , , , and respectively.It is obvious that the model performance is affected by the resolution of input images greatly.However, the value of precision tends to increase firstly and then decrease with the input resolution increases.When the input resolution is increased from 960 × 512 to 1920 × 1080, the AlgaeMask have 2.09% and 5.27% reduction in and , and have 9.07% and 7.65% increase in the and .This is mainly due to the increase in input resolution helping our model to strengthen its ability to detect small floating algal blooms.However, use of high-resolution images also introduces interference factors in complex environments, which will lead to an increase in false detections.In addition, it will cost more GPU memory resources, and the inference time with an input resolution of 960 × 512 is 2.6× faster than the resolution of 1920 × 1080.
As shown in Table 1, it is necessary for the application of AlgaeMask to process seven channels of video simultaneously.Therefore, the resolution of 960 × 512 is adopted as the input scale of the experiments in this paper, which could meet up the real-time requirements of floating algae detection.

Impact of SEPC Block
In this subsection, we will discuss the impact of SEPC block in our AlgaeMask.The results are as shown in Figure 10.For the Sargassum category, the network with SEPC has 6.56% improvement on and 5.36% improvement on .For Ulva prolifera category, the network with SEPC shows a 3.84% improvement on and 4.35% improvement on .It is clear that the network with SPEC can attain better performance in all categories.In practical applications, the camera needs to be installed on the land or other supports, and take pictures from different viewpoints.This leads to a large-scale changes in the image, so that the floating algaes at long distance are usually small and blurry, while the floating algaes at close range are big and clear.
Therefore, the SEPC block-effectively utilizing the invariant features of the floating algaes at different sizes-is very important for our network.

Impact of Dual-Attention Block
In Figure 11, we show the impact of removing the dual-attention block on experimental results.For the Ulva prolifera category, the network with the dual-attention block shows a 16.09% and 13.93% improvement on and .For the Sargassum category, the model with the dual-attention block shows a 13.13% improvement on and a 11.52% improvement on .In contrast with the SEPC block only considering the invariant-features of the floating algae in same category, the dual-attention block can help to learn both the correlation of features at different scales by the channel attention mechanism and the context of features at different spatial positions by position attention mechanism.These abilities of our proposed method are important as the surveillance environment is complex and diverse.By channel attention mechanism, the features of Ulva prolifera or Sargassum can be extracted effectively in our feature extraction phase.Via position attention mechanism, the similar targets that do not float above the sea surface will be eliminated.

Impact of OSA Block
In this subsection, we will discuss the impact of using OSA and OSA-V2 in Stage 1 and Stage 2 of the feature extraction module.
Based on the CenterMask and AlgaeMask, we replace the OSA-V2 block with original OSA block in Stage 1 and Stage 2 of CenterMask (OSA-V2) to generate the results of CenterMask (OSA).Meanwhile, we replace the OSA block with OSA-V2 block in Stage 1 and Stage 2 of our proposed AlgaeMask (OSA+OSA-V3) to generate the results of Algae-Mask (OSA-V2).
In Table 7, it is obvious that the performance of CenterMask (OSA-V2) and Algae-Mask (OSA-V2) on AP metric are worse than the CenterMask (OSA) and AlgaeMask (OSA+OSA-V3).Compared with AlgaeMask (OSA-V2) in small targets, AlgaeMask (OSA+OSA-V3) shows a 2.17% and 1.25% increase in and respectively.In Figure 12, compared to CenterMask (OSA-V2), the false detection of small targets in CenterMask (OSA) is decreased.Meanwhile, compared to our AlgaeMask (OSA+OSA-V3), the false detection of small targets in AlgaeMask (OSA-V2) are increased.In floating algae detection, tiny targets have the characteristics of less and simpler feature information.Therefore, these features can be effectively extracted in the first few layers.However, due to the average pooling operation in the eSE block, the features of tiny algaes in different channels may interface with each other, leading to false detection.Meanwhile, some algaes are present at large scale, the features of small algaes may be missed or submerged, resulting in missing detection.
In summary, we choose the original OSA block instead of the OSA-V2 as the backbone of AlgaeMask in Stage 1 and Stage 2 of the feature extraction module.

Conclusions
Floating algae detection plays an important role in marine environment monitoring.This study is the first time an instance segmentation method has been applied in floating algae detection.The dataset consisting of multiple marine scenes was built to compare the performance of different instance segmentation networks.The detection precision and time consumption under different input resolutions were also discussed to further verify the actual application capability of our proposed network.In our work, we propose a new instance segmentation framework named AlgaeMask for floating algae detection.
A new feature extraction module based on OSA and a dual-attention mechanism is proposed.The dual-attention block can integrate the position attention and channel attention simultaneously to capture the long-range position and contextual information of floating algae effectively.In floating algae detection, a strong correlation was found between floating algae and interference factors in the environment.Therefore, it is very important to ensure the network can learn the spatial position correlation of targets.The dual-attention mechanism can meet our requirements very well.Meanwhile, to reduce the computation cost of the attention block, we only applied the dual-attention block to the last layer of the feature extraction module.
In addition, considering the features of floating algae at different distances from the camera, a multi-scale fusion module was introduced to capture the inter-scale correlation of the feature pyramid.In the feature decoder module, the FCOS head and segmentation head were introduced to accurately obtain the segmentation area of the algae in every detection bounding box.The extensive experiment results show that the AlgaeMask can achieve better detection accuracy and at a lower time cost in all compared instance segmentation methods to satisfy the real-time needs of floating algae detection.
Due to the limit amount of marine environment data, our model did not take the interferences of bad weather, reflections, and shadows on the ocean surface into account.For future studies, we will further analyze the performance of deep learning methods under conditions of different complex marine scenes (e.g., rain, fog, and reflections) to enhance the robustness of the floating algae detection network.

Figure 1 .
Figure 1.Video surveillance on the ship (These texts not in English are generated from the cameras, which means the date, the camera manufacturer logo and location names of each camera).

Figure 4 .
Figure 4. Samples of floating algae at different scales.

Figure 5 .
Figure 5. Architecture of the multi-scale fusion module.

Figure 6 .
Figure 6.Overview of the detection block based on FCOS head.

Figure 7 .
Figure 7.Samples of the Disturbances category.

Figure 8 .
Figure 8. Flowchart of the segmentation block.

Figure 9 .
Figure 9. Visualization of different models on testing samples.(a) Source image.(b) Ground truth.(c) AlgaeMask model.(d) CenterMask model.(e) Mask Scoring R-CNN model.(f) Mask R-CNN model.Light green color represents the category of Ulva prolifera, the yellow color represents the category of Sargassum, and the pink color represents the category of Disturbances.

Figure 10 .
Figure 10.Box and segmentation results with/without SEPC block.(a) Box results of AP and AR with/without SEPC block.(b) Segmentation results of AP and AR with/without SEPC block.

Figure 11 .
Figure 11.Box and segmentation results with/without dual-attention block.(a) Box results of AP and AR with/without dual-attention block.(b) Segmentation results of AP and AR with/without dual-attention block.

Table 1 .
Camera position for surveillance adopted in our paper

Table 2 .
Comparison of experimental results on average precision (AP) by different methods

Table 3 .
Comparison of experimental results on average recall (AR) by different methods

Table 4 .
Comparison of experimental results on AP by different methods for each category

Table 5 .
Comparison of experimental results on AR by different methods for each category

Table 6 .
Ablation study on the impact of input resolution on AP and AR

Table 7 .
Ablation study on the impact of OSA block on AP