1. Introduction
The large-scale outbreaks of floating algae have become a worldwide ecological issue of the marine environment, recently. The wild breeding of floating algae, blocking the sunlight on the sea surface and the competition for nutrition with other ocean life, will affect the marine ecological structure. The rotting of a large number of seaweeds will cause severe pollution to the quality of the sea water, stacked along the coasts to destroy the natural environment of beaches [
1]. Therefore, the disaster events involving floating macroalgae are harmful for the protection and development of marine resources, and have a significant impact on the economic industries such as aquaculture, fishery, maritime transportation and coastal tourism [
2,
3,
4].
In recent years, the frequent outbreak of green tide and gold tide in the East China Sea is caused by two categories of floating algae—Ulva prolifera and Sargassum, which has the characteristics of short, explosive cycle and a large influence area [
5,
6,
7]. Therefore, it is necessary to conduct real-time monitoring and early warning for floating algae. Furthermore, the accurate extracting and distinguishing of the floating algae could provide a reliable basis for the analysis, prevention, and comprehensive governance of disasters, to reduce economic costs and environmental disasters. Therefore, a variety of ways, such as remote-sensing satellite [
8,
9,
10,
11,
12,
13,
14,
15,
16], synthetic-aperture radar (SAR) [
17,
18,
19], unmanned aerial vehicles (UAV) [
20,
21,
22,
23,
24], and the surveillance cameras on ships and shores [
25,
26,
27], are applied to monitor floating algae.
In the past decades, a lot of floating-algae detection methods based on traditional image-processing algorithms have been researched. These methods, using threshold segmentation and image transformation, have the advantages of effective feature-extraction and low computing cost. Pan et al., proposed a two-step method to reduce the computational complexity of N-FINDR, based on spectral unmixing, and this model could estimate the green-algae area efficiently from the Geostationary Ocean Color Imager (GOCI) [
28]. Wang et al., quantified the area coverage of Sargassum from Moderate Resolution Image Spectroradiometer (MODIS)-data in the western central Atlantic, based on the alternative floating algae index (AFAI) to detect the red-edge reflectance of floating algae [
29]. Wang et al., introduced a trainable, nonlinear reaction– diffusion denoising framework to handle non-algae influence, and used the Floating Algae Index (FAI) to detect Sargassum from MSI images [
30]. Ody et al., introduced the standard to describe various sizes of Sargassum, and compared the ability to detect Sargassum between in situ and remote-sensing observations (MODIS, VIIRS, OLCI and MSI) in the northern Tropical Atlantic [
31]. Shen et al., proposed the new index factors based on the polarimetric characteristics of green tides in both the amplitude- and phase-domain, to detect green-macroalgae blooms from quad-pol RADARSAT-2 SAR images [
32]. Ma et al., monitored the spatiotemporal trend of Ulva prolifera effectively, based on SAR and MODIS data in the Yellow Sea in 2021 [
33]. Using the RGB cameras on ship-borne UAVs, Jiang et al., proposed a new floating algae index, calculating the difference in green reflectance from the baseline for red and blue bands, to extract green tide in the Yellow Sea, effectively [
34]. Xu et al., introduced the normalized green–red difference index to detect the initial biomass of green tide accurately, from UAV and Sentinel-2A images [
35].
The traditional methods for floating-algae segmentation based on image processing require a complex process of artificial-feature design, and cannot achieve the segmentation accurately under different weather conditions (such as sunny, cloudy or windy). With the great progress of deep learning recently, convolutional neural networks (CNNs), generative adversarial network (GAN) and transformer have been successfully applied in the fields of image segmentation [
36] and object detection [
37], as they have the advantages of extracting useful features automatically. Therefore, extensive research has been conducted on floating-algae detection using deep learning methods in recent years [
38,
39,
40,
41,
42,
43]. From the UAV hyperspectral imagery, Hong et al., introduced four different CNN frameworks (ResNet-18, ResNet-101, GoogLeNet, and Inception v3) to estimate the vertical distributions of the harmful algae, and gradient-weighted class-activation mapping was used to adopt the representative features [
44]. Wang et al., proposed a green-tide detection framework based on a binary CNN-model and Superpixels extracted via energy-driven sampling (SEEDS) for high-resolution UAV images [
45]. To monitor the distribution of Sargassum along the French Caribbean Sea, Valentini et al., adopted MobileNet-V2 architecture on the super-pixel regions extracted from the images captured by smartphone cameras [
26]. A Sargassum-detection method was introduced, based on CNN and RNN from MODIS imagery collected along the Mexican Caribbean coastline [
46]. To overcome the lower detection-limit of different sensors and the complex environment, Wang et al., introduced the VGGUnet network to extract Sargassum macroalgae, based on the high-resolution satellite images from multi-sensors such as MSI, OLI, WV and DOVE [
47]. Cui et al., combined a semantic-segmentation network based on U-Net with super-resolution reconstruction to extract green tides accurately from low-resolution MODIS images, pre-trained from high-resolution GF1-WFV data [
48]. Gao et al., introduced the U-Net method to design a green-algae detection framework from both MODIS and SAR data in the Yellow Sea [
49]. Jin et al., proposed a GAN with a squeeze-and-excitation attention mechanism to detect green tide at different scales automatically from MODIS images in the Yellow Sea [
50]. Arellano-Verdejo et al., introduced the Pix2Pix GAN semantic-segmentation architecture to monitor the Sargassum-coverage map from the photographs acquired by mobile devices along the beaches [
27]. Song et al., proposed a cyanobacteria-detection model based on the transformer network, to extract the boundary of cyanobacterial blooms accurately in complex environments from UAV-multispectral images [
51].
The object-detection methods can locate and classify objects in images, and do not concern the classification of each pixel. The faster R-CNN network, a classic model in object-detection methods, has achieved good results in the fields of high-precision, multi-scale and small-target detection [
52]. The YOLOv7 is a current state-of-the-art network which surpasses a lot of known object-detectors in both speed and accuracy [
53]. Nevertheless, these object-detection methods cannot obtain the accurate boundary of each object. The semantic-segmentation methods can effectively classify each pixel according to category, and obtain the accurate mask-boundary of each object. A new high-quality building-extraction method named the sparse-token transformer (STT) is proposed, to represent the building as a set of ‘sparse’-feature vectors in their feature space by introducing the ‘sparse-token sampler’, which can reduce computational complexity in the transformer [
54]. A simple segmentation-method is proposed, even if only a few training images are provided, and can serve as an instrument for semantic segmentation, especially in the setup when labeled data is scarce [
55]. However, the semantic-segmentation methods still have shortcomings in dealing with objects in the same category, because these methods only capture the target of different categories and cannot distinguish the individual algae in the same category. The instance-segmentation method can integrate object detection and semantic segmentation into a unified architecture, which can simultaneously detect and segment the individual object in the same category [
56]. Therefore, it is meaningful to detect the floating algae based on the instance-segmentation algorithms.
As a pioneering work of the instance-segmentation method, the Mask R-CNN [
57] has achieved great success in the field of image recognition. However, as the resolution of the feature maps fed into the mask-segmentation stage is very low, Mask R-CNN makes it difficult to obtain the accurate mask-boundaries of floating algae. The potential error-segmented pixels can be detected and refined in the Mask Transfiner by building a multi-level hierarchical-point quadtree, and the multi-head self-attention block will be applied to predict highly accurate segmentation-masks [
58]. However, the architecture of ResNet in the Mask Transfiner makes it difficult to deal with the complex and changeable environment on the sea surface.
In this paper, we propose a high-quality instance-segmentation network for floating-algae detection based on the Mask Transfiner model named AlgaeFiner. To improve the anti-interference ability of ResNet in complex marine-scenes, the coordinate attention (CA) [
59] mechanism is introduced into the ResNet structure, to extract the long-range dependencies in both the channel- and position-dimension. However, the CA is an embeddable block, which cannot directly extract features but has the ability of learning the internal relationship of features. Therefore, the CA mechanism is an auxiliary module which only works with other feature extractors. Meanwhile, compared with other attention mechanisms, CA can improve the performance of the model with lower computational-complexity, and is very suitable for the real-time requirements of the floating-algae-detection task. In addition, considering the huge scale difference of floating algae, the Multi-scale Bi-directional Feature-Pyramid Network (Ms-BiFPN) is proposed, based on the Bi-directional Feature-Pyramid Network (BiFPN) [
60]. The Ms-BiFPN can make full use of the information at different scales and reduce the information loss in the highest-level feature maps, by integrating adaptive spatial-fusion (ASF) [
61] before the operation of max-pooling in BiFPN. Finally, we adopt the surveillance RGB-images captured from the cameras on ships and shores as the input for our model, and it has the advantage of lower deployment-cost and wide monitoring-range for floating-algae detection. The main contributions of our AlgaeFiner are follows:
(1) A novel backbone named CA-ResNet is proposed, to enhance the robustness ability of the model by integrating the coordinate-attention mechanism into the ResNet structure.
(2) The Ms-BiFPN is proposed, to efficiently utilize the responses at different levels in the feature pyramid by introducing the BiFPN structure; the feature information-loss can be reduced by integrating the ASF module before the max-pooling operation.
(3) A transformer-based method named Mask Transfiner is introduced to improve the segmentation quality of floating algae.
The rest of the paper is organized as follows.
Section 2 describes the specific implementation details of AlgaeFiner, consisting of the encoder module, the region-proposal network, the Mask Transfiner module, the decoder module, the loss function and the dataset.
Section 3 shows extensive experiment results and the analysis of different methods, including evaluation metrics, experimental setups, the main results and the ablation study. The discussion is presented in
Section 4 and the conclusion is summarized in
Section 5.
4. Discussion
Our method, a novel deep-learning model named AlgaeFiner for floating-algae detection, integrates the object-detection task and the mask-segmentation task into a unified network, which can not only output the external bounding-box of each target, but also obtain the accurate segmentation-area within the same category. To reduce the interference of a complex marine-environment, a novel feature-extraction backbone named CA-ResNet is proposed, by integrating the coordinate-attention mechanism into the conventional ResNet structure. The long-range dependencies between different channels and positions can be modeled simultaneously, which can effectively reduce the false segmentation. Meanwhile, the computation performance of CA-ResNet can also satisfy the real-time requirements of marine monitoring. The Ms-BiFPN module is proposed for solving the problem of large-scale changing by embedding the multi-scale block into the BiFPN structure. The ASF block is integrated into the multi-scale module to extract more spatial-context information in the highest-level feature maps. The performance of large algae detection is greatly improved by replacing the conventional BiFPN structure with the Ms-BiFPN structure. The transformer-based module named Mask Transfiner is introduced to improve the boundary-segmentation quality of AlgaeFiner. A multi-level hierarchical-point quadtree is built to detect the potential error-segmented pixels in the coarse masks at different scales, and a multi-head self-attention block is applied to predict the final refined mask-labels.
From the experimental results, we find that the AlgaeFiner has achieved a relatively good performance for floating-algae segmentation under different real-application scenarios on the sea. We also take the interference of bad weather into account in evaluating our model, although this is not the focus point of our model. By collecting and analyzing the surveillance-video data on ships and shores in recent years, we summarize the following different situations in the floating-algae detection task: rainy, sun reflection, cloudy, foggy and wave. The results of AlgaeFiner in the above environments are shown in
Figure 22 and
Figure 23. In
Figure 22d and
Figure 23f,h, it is clear that the AlgaeFiner can perform well in some bad-weather scenes with obvious floating algae.
However, in
Figure 22b, due to the blur of the camera lens caused by raindrops, the detection performance of AlgaeFiner declined (missed detection). On the one hand, the rain occlusion leads to information loss in the feature-extraction phase, which cannot be solved by our proposed feature-extraction structure (CA-ResNet). On the other hand, although the Ms-BiFPN can improve the detection ability for small targets, the AlgaeFiner will still miss detection of the floating algae in the situation where the target feature is unobvious and there exists strong interference in the scene, simultaneously.
In
Figure 22f,h and
Figure 23b,d, although the AlgaeFiner has detected the floating algae targets, the areas-of-segmentation results are still incomplete and inaccurate. In these scenes, the feature responses extracted by our network are always weak, and the Mask Transfiner module cannot effectively improve the accuracy and integrity of mask-segmentation results under the weak feature-responses.
Above all, although our method considers the influence of different environmental factors, it still has some shortcomings in dealing with bad weather, due to the limited amount of marine-environment data. In a further study, on the one hand we will explore the robustness of our model under marine scenes of different weather conditions. On the other hand, we will further research the possibility of introducing spectral data into marine-pollution monitoring.