Transformer-Driven Algal Target Detection in Real Water Samples: From Dataset Construction and Augmentation to Model Optimization

Li, Liping; Liang, Ziyi; Liu, Tianquan; Lu, Cunyue; Yu, Qiuyu; Qiao, Yang

doi:10.3390/w17030430

Open AccessArticle

Transformer-Driven Algal Target Detection in Real Water Samples: From Dataset Construction and Augmentation to Model Optimization

by

Liping Li

¹,

Ziyi Liang

¹

,

Tianquan Liu

¹

,

Cunyue Lu

^1,*,

Qiuyu Yu

² and

Yang Qiao

³

¹

School of Elctronic Information and Electrical Engineering, Shanghai Jiao Tong University; Shanghai 200240, China

²

SANJIANG ZHAOYUAN (SHANGHAI) BIOTECHNOLOGY Co., Ltd., Shanghai 200240, China

³

Goujian Eco-Science Technology (Nanjing) Co., Ltd., Nanjing 211113, China

^*

Author to whom correspondence should be addressed.

Water 2025, 17(3), 430; https://doi.org/10.3390/w17030430

Submission received: 18 December 2024 / Revised: 27 January 2025 / Accepted: 29 January 2025 / Published: 4 February 2025

Download

Browse Figures

Versions Notes

Abstract

Algae are vital to aquatic ecosystems, with their structure and abundance influencing ecological health. However, automated detection in real water samples is hindered by complex backgrounds, species diversity, and size variations. Traditional methods are deemed costly and species-specific, leading to deep learning adoption. Current studies rely on CNN-based models and limited datasets. To improve the detection accuracy of multiple algal species in real, complex backgrounds, this study collected multi-species algae samples from actual water environments and implemented an integrated Transformer-based framework for automated localization and recognition of small, medium, and large algae species. Specifically, algae samples from five different regions were collected to construct a comprehensive dataset containing 25 algal species with diverse backgrounds and rich category diversity. To address dataset imbalances in minority species, a segmentation-fusion data augmentation method was proposed, which enhanced performance across YOLO, Faster R-CNN, and Deformable DETR models, with YOLO achieving a 7.1% precision increase and a 1.5% mAP improvement. Model optimization focused on an improved Deformable DETR, incorporating multi-scale feature extraction, deformable attention mechanisms, and the normalized Wasserstein distance loss function. This improvement enhanced small target and overlapping object detection, achieving a 10.4% mAP increase at an intersection over union (IoU) threshold of 0.5 and outperforming unmodified Deformable DETR.

Keywords:

transformer; algal target detection; data augmentation; deformable DETR; NWD; object detection; water quality management

Graphical Abstract

1. Introduction

Algae are crucial to aquatic ecosystems, with changes in their community structure and abundance directly affecting the ecological health of water bodies. As primary producers, they provide energy through photosynthesis and play key roles in biogeochemical cycles [1]. However, harmful algal growth, particularly cyanobacteria, can trigger eutrophication of water bodies, leading to the death of aquatic animals due to lack of oxygen and deterioration of water quality, and even jeopardizing drinking-water supply, thus posing a serious threat to aquatic organisms and human health [2]. Thus, timely identification and monitoring of algal dynamics are vital for preventing harmful algae blooms and maintaining water ecological balance.

In natural aquatic ecosystems, algae distribution is highly heterogeneous, and their biomass and spatial distribution are influenced by factors like light, temperature, and nutrient concentrations [3]. Algae vary significantly in size, from small Chlorophyta (<10 microns) to larger diatoms (up to 200 microns), posing challenges for detection. The diversity of species and sizes requires detection algorithms with high resolution and generalization capabilities [4]. Additionally, algae’s overlapping and aggregation in microscopic images complicate detection, demanding sophisticated feature extraction and differentiation to accurately identify algae in complex backgrounds [5].

Traditional algae detection methods involve physical and chemical approaches using microscopes and flow cytometers to detect intracellular substances (e.g., cytoplasm, DNA, chlorophyll) or gather indirect information through secretions [5]. These methods are often cumbersome and costly [5]. Tong Hou et al. [6] developed a smartphone-integrated microfluidic chip for detecting live algae via chlorophyll fluorescence, using algorithms to count and distinguish algae viability. However, these methods are expensive and limited in species detection. To reduce costs, recent studies have adopted deep-learning approaches for algae detection [7].

Yuan Guo et al. proposed a texture-enhanced GA-Net model using Sentinel-1 SAR images to detect algae-inhabited water bodies from a macroscopic perspective, but it could not achieve recognition of individual algal species [8]. Ruiz-Santaquiteria et al. applied the deep learning segmentation method Mask-RCNN to segment diatom images, achieving an average precision of 85% for instance segmentation [9]. Liu Ting et al. used principal component analysis to address the classification of similar marine microalgae [10]; however, this method did not achieve an integrated process for both localization and recognition of algal targets.

Qian et al. designed a deep learning framework using an improved Faster R-CNN model with 1859 microscopic algae images, employing data augmentation to address dataset imbalance, ultimately detecting nine algae categories [11]. The YOLO model has also been applied to algae detection, with different versions of the YOLO model used by Abdullah et al., conducting comparative experiments on a dataset containing 400 algal micrographs to achieve the classification of four types of algae species [12]. Lin et al. applied an enhanced YOLOv5 for algae detection in the IEEE UV 2022 challenge, using 967 microscopic images of eight algae species [7]. However, the dataset used in their study was relatively small and did not include complex backgrounds or cases of highly overlapping algae.

Table 1 lists the datasets that can be queried in detail in the current research on algae target detection. According to Table 1, it can be seen that the number of images of the datasets used in the existing research is relatively small, not more than 2000. The first five articles contain no more than ten algae species, and the last article contains 300 algae species, but the number of images of each algae is no more than 40, and the image background is clear and lacks impurities, which is different from the complex background of the real water environment. The target detection model trained based on these datasets has poor generalization and cannot meet the real water sample detection requirements. Therefore, this study collects algae images from real water samples and constructs algae image datasets with rich and complex species and backgrounds to improve the recognizable generalization of the model from the perspective of the dataset.

In natural environments, less dominant species are difficult to capture, leading to dataset imbalance. Consequently, models tend to focus on the abundant species, reducing accuracy for the rarer ones. Data augmentation is a direct approach to mitigate this problem and can be categorized into basic and advanced augmentation techniques. Basic augmentation includes geometric transformations, such as flipping, rotation, scaling, and cropping [16]. However, scaling and cropping may cause loss of original size information, making them unsuitable for algal target detection. Advanced techniques include augmentation based on GANs (Generative Adversarial Networks), Mixup, and CutMix, which generate new synthetic samples or mix original samples to enhance dataset diversity [17,18]. Kisantal et al. proposed a method involving oversampling and multiple copy-pasting of small objects to increase their presence, which significantly improved the model performance for small targets [19]. However, this approach often leaves noticeable bounding-box artifacts, making the synthetic targets unrealistic. Inspired by this, we propose a segmentation-fusion-based data augmentation method that automatically segments the target’s contour and naturally integrates it into the background through a fusion module, creating highly realistic data samples.

YOLO and Faster R-CNN are mainstream models for algal target detection and recognition. However, they are limited by a fixed receptive field, leading to suboptimal performance when detecting small or morphologically complex algae, particularly in scenarios with complex backgrounds and significant overlap among algae [20]. In contrast, the Deformable DETR model shows notable advantages in handling complex backgrounds, overlapping algae, and small-object detection. Firstly, Deformable DETR utilizes a Transformer architecture and introduces a deformable attention mechanism, allowing it to adaptively focus on important local regions without being constrained by a fixed receptive field size, thereby accurately capturing the boundaries and detailed features of algae. This deformable attention mechanism enables the model to flexibly select the most informative regions, making it particularly well-suited for algae images with rich details and overlapping targets [21,22]. Additionally, Deformable DETR overcomes the slow convergence issue inherent in the original DETR model, resulting in a more efficient training process [23].

In real water samples, the size of algal species varies significantly—for example, Spirulina can be up to 50 times larger than Cyclotella. While large-scale algae are easier to detect and count, small-scale algae are often harder to recognize accurately due to their limited pixel representation. Therefore, improving the recognition accuracy of small algae while maintaining the accuracy of large algae is crucial. To achieve this, this paper incorporates the normalized Wasserstein distance (NWD) loss function into the original loss function of the Deformable DETR model. The NWD loss function can effectively alleviate the sensitivity problem of the IOU loss function to small objects, which gives NWD a significant advantage in handling targets of varying sizes, allowing for better balance in detecting both large and small objects [23].

In real water ecosystems, the background of the water environment is complex, and the algae species are rich and diverse, so the target recognition task needs to take into account the large, medium, and small targets, as well as the stacked targets at the same time. To improve the accuracy of localization and recognition of multiple algal targets in complex real water backgrounds, this research focuses on improvements in both dataset construction, augmentation, and model design.

First, we constructed a comprehensive dataset of various algal microscopic images, collected from different real water environments to enhance the generalization capability and applicability of the trained model.
Second, to address the issue of limited representation of disadvantaged algal species, we proposed an automated segmentation-fusion-based data augmentation method.
In terms of target recognition models, the Deformable DETR model, based on Transformer, is used and optimized with the NWD loss function to better adapt to the algae dataset.

2. Dataset Construction

2.1. Dataset Collection

Constructing a dataset is the primary step for achieving algal target detection based on deep learning. Existing research datasets are typically constructed using microscopic images, with the most representative being the Vision Meets Algae dataset [24], which comprises 1000 microscopic images of algae, covering eight types of microalgae. The targets in individual images of this dataset are relatively uniform, and most of the algae species are lab-cultured, which limits the dataset’s ability to reflect the distribution of algae in natural aquatic environments.

The ultimate goal of algal target recognition is to enable water environment monitoring, which demands both high recognition accuracy and a high level of automation in water quality assessment. To meet these goals, this study employed a Sanjiang Zhaoyuan Algae Digital Scanner as the microscopic image collection device. The Sanjiang Zhaoyuan Algae Digital Scanner, developed by SANJIANG ZHAOYUAN (SHANGHAI) BIOTECHNOLOGY Co., Ltd., Shanghai, China, is a microscopic imaging device designed for algae analysis. It supports high-speed panoramic scanning, achieving an imaging resolution of 0.15 μm/pixel at 400x magnifications. This scanner provides high resolution, full-field scanning, automated operation, and digital analysis capabilities, making it suitable for large-scale, automated data collection. Data samples were collected from water bodies in five regions: Daming Lake, Yinshan Lake, Dianchi Lake, Taihu Lake, and various ponds at Shanghai Jiao Tong University Minhang Campus. The collected water samples were then scanned at 400x magnification, and clear images containing algae were selected from the scanned images to form the dataset. The size range of algae in this dataset is above 2 μm.

2.2. Dataset Augmentation

For object detection tasks, a dataset should meet the requirements of background diversity, class balance, and instance diversity to ensure effective and robust model training [25]. In this study, the original dataset comprises a total of 4279 water sample images representing 25 algae species, as detailed in Figure 1 and illustrated in Table 2. The images collected cover different morphological stages of algae throughout their growth cycle and various collection sites significantly enrich the diversity of background characteristics, resulting in excellent background and instance diversity.

Regarding class balance, some rare algal species are underrepresented in the dataset. To improve class instance balance, we designed a segmentation-fusion-based data augmentation optimization approach. This method is based on the U2NET (U-square net) model, which extracts morphological contour features of the algae. Subsequently, the segmented rare algal species are seamlessly “pasted” into suitable backgrounds from other water samples using a feature fusion technique, creating new, synthetic, yet highly realistic data samples.

In object detection, Mate Kisantal et al. proposed a method to address the issue of low detection performance for small objects, which is caused by their insufficient representation in the dataset compared to larger objects. Their approach increased the frequency of small objects by repeatedly copying and pasting them [19]. However, their method relies on instance segmentation masks to obtain small object images, which imposes high annotation requirements on the dataset. Conventional bounding-box annotations are insufficient for capturing precise pixel-level contours. To address this, we propose a segmentation-fusion-based data augmentation optimization scheme, as illustrated in Figure 2. The proposed scheme consists of two parts:

The first part involves obtaining the target contour information through a segmentation model. Initially, the segmentation model is used to generate a segmented image that contains the target’s contour. However, directly fusing this segmented image with the background often results in jagged black pixels around the edges, leading to an unnatural fusion effect. Therefore, careful adjustment and refinement are necessary for the fusion.

The second part involves using a fusion module to integrate the segmented target image into the background seamlessly. This fusion module includes two steps: coarse fusion and fine fusion. The fine fusion process begins by converting the coarsely fused image to grayscale, followed by edge detection using the Canny operator. The edges are then optimized through a series of operations including dilation, erosion, enlargement, blurring, and shrinking. Based on the optimized edges, image restoration is performed on the edge regions of the fused image to obtain the final result. Figure 3 illustrates examples of the algae samples after data augmentation, showing that the augmented samples are realistically blended with the background.

3. Improved Deformable DETR Detection Algorithm

Traditional two-stage object detection algorithms, such as Faster R-CNN and Spatial Pyramid Pooling Networks (SPPNet), rely on a Region Proposal Network (RPN) to generate candidate object boxes, which are then classified and regressed. These methods involve multiple stages and complex post-processing steps, such as Non-Maximum Suppression (NMS), to handle overlapping candidate boxes, resulting in slower computation speeds [20,26]. In current research, YOLO has become a mainstream deep learning method for algal target recognition tasks. The YOLO model uses a single convolutional neural network to directly predict object bounding boxes and class probabilities. Although YOLO is praised for its speed and simplified design [27], it still requires predefined anchor boxes and employs Non-Maximum Suppression during the post-processing stage to handle overlapping detection boxes.

To further simplify the object detection process and achieve true end-to-end detection, the DETR model introduced a Transformer architecture, thereby eliminating the need for anchor boxes [28]. Deformable DETR improves upon the DETR model by addressing its slow convergence and limited feature space resolution issues. By incorporating a multi-scale feature extraction backbone and a deformable attention mechanism, Deformable DETR significantly accelerates the convergence rate of the DETR model and also achieves better performance in small object detection [21].

3.1. Deformable DETR Network Architecture

The original DETR model is primarily composed of three parts: the Backbone network, the Transformer, and Feed-Forward Networks (FFNs). The Backbone network typically uses ResNet as the feature extraction network, extracting features from the input image, which are then added with positional encodings and fed into the Transformer module. The Transformer module uses the standard Transformer structure, which includes self-attention mechanisms and feed-forward networks. The Transformer structure was first introduced in the paper “Attention is All You Need” [29] and uses an encoder-decoder architecture. In DETR, it is used to process image features and learn global dependencies between object queries without requiring any prior knowledge, such as anchor boxes [21]. At the end of the model, the FFN outputs the class probability and bounding-box coordinates for each object.

In the DETR model, the self-attention mechanism requires calculating attention weights between all elements in a sequence, resulting in substantial computational complexity. Deformable DETR addresses this by introducing Multi-Scale Deformable Attention (MSDeformAttn), which focuses on a small yet representative set of key points rather than performing dense sampling over the entire feature map. The deformable attention mechanism cleverly integrates the principles of sparse attention with innovative dynamic offset techniques. Initially, the deformable attention mechanism reduces computational complexity through sparse selection, thereby decreasing the number of key points that require attention calculations. Dynamic offsets are then applied at these sampled points to obtain more accurate attention weights.

3.2. Improved Loss Function for Deformable DETR

The loss function is a crucial component in model optimization, directly impacting both training outcomes and final performance. The loss function for Deformable DETR mainly consists of two parts: matching loss and bounding-box regression loss.

In Deformable DETR, the model outputs a set of N predictions in one pass, and the Hungarian algorithm is used to perform optimal matching. This algorithm finds the best one-to-one correspondence between the predictions and ground-truth annotations to minimize the total matching loss. The matching loss is typically calculated using cross-entropy loss, which takes into account the loss between each predicted category and its corresponding ground-truth category. The matching loss formula is as follows:

L_{m a t c h} = \sum_{i = 1}^{N} - l o g (p_{σ (i), y_{i}}),

(1)

Here,

N

represents the number of real objects, and

p_{σ (i), y_{i}}

is the probability that the model predicts the

σ (i) t h

predicted object belongs to the true category

y_{i}

, where

σ

is the optimal matching obtained through the Hungarian algorithm.

Once matching is completed, for each pair of matched predicted and actual boxes, the generalized intersection over union (GIoU) loss is used to optimize the position and size of the predicted boxes. GIoU loss, an improvement on IoU loss, accounts for the overlap and encasement of bounding boxes, providing better gradient properties.

L_{G I o U} = 1 - I o U + \frac{| C - U |}{| C |},

(2)

where IoU is the intersection over union of the predicted and actual bounding boxes,

U

represents the union of these boxes, and

C

is the smallest area encompassing all boxes.

| C - U |

indicates the non-overlapping area between the predicted and actual boxes.

Loss metrics based on IoU are highly sensitive to the positioning of small objects; minor positional deviations in these objects can result in significant changes in IoU, impacting the detection performance of small objects. Therefore, to reduce the model’s sensitivity to the positioning of small objects, a normalized Wasserstein distance is introduced into the existing loss function to measure the similarity between the predicted and target boxes.

The Normalized Gaussian Wasserstein Distance models the target box as a Gaussian distribution, utilizing the Wasserstein distance to measure the similarity between the predicted and target boxes [23]. Assuming bounding boxes

A = (c x_{a}, c y_{a}, w_{a}, h_{a})

and

B = (c x_{b}, c y_{b}, w_{b}, h_{b})

are modeled as two-dimensional Gaussian distributions

N_{a}

and

N_{b}

respectively, the second-order Wasserstein distance formula between these two boxes can be simplified as follows:

W_{2}^{2} (N_{a}, N_{b}) = {‖({[c x_{a}, c y_{a}, \frac{w_{a}}{2}, \frac{h_{a}}{2}]}^{T}, {[c x_{b}, c y_{b}, \frac{w_{b}}{2}, \frac{h_{b}}{2}]}^{T})‖}_{2}^{2},

(3)

where

(c x, c y), w

and

h

represent the center coordinates, width, and height, respectively. To meet the threshold range characteristics of the similarity measure, the exponential form of the second-order Wasserstein distance is normalized, yielding a normalized Wasserstein distance similarity measure and a regression loss based on the normalized Wasserstein distance:

N W D (N_{a}, N_{b}) = \exp (- \frac{\sqrt{W_{2}^{2} (N_{a}, N_{b})}}{C}),

(4)

L_{n w d} = 1 - N W D (N_{a}, N_{b}),

(5)

where

C

is a constant closely related to the dataset, which, in this study, is set to 12.8. The modified total loss function

L

is as shown, where

λ

is the weighting coefficient used to balance the different losses. It can be adjusted based on actual task requirements and data characteristics to achieve optimal training results.

L = λ_{1} L_{m a t c h} + λ_{2} L_{G I o U} + λ_{3} L_{n w d},

(6)

Finally, the architecture of the improved Deformable DETR model utilized in our study is presented in Figure 4.

4. Experiments and Analysis

4.1. Evaluation Metrics

The primary evaluation metrics for model recognition performance include precision (P), recall (R), and mean average precision (mAP). The formulas for calculating precision and recall are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

R e c a l l = \frac{T P}{T P + F N}

(8)

The values of True Positives (TP,), False Positives (FP,), and False Negatives (FN,) are determined by comparing the predicted results of the algae target detection model with the manually annotated ground truth. Specifically, TP represents the number of correctly detected targets. A prediction is classified as TP when the predicted bounding box matches the ground-truth bounding box in terms of class label, and the IoU exceeds a predefined threshold. FP refers to the number of falsely detected targets that do not exist in the dataset, occurring when a predicted bounding box fails to match any ground-truth object (e.g., the IoU is below the threshold or there is no corresponding ground-truth object with the same class label). FN represents the number of real algae targets that the model fails to detect, meaning the objects are present in the image but are missed by the model. Precision is used to evaluate the proportion of accurate predictions among all the predictions made by the model, while recall assesses the proportion of actual targets successfully detected by the model. By combining precision and recall, a comprehensive evaluation of the performance of the object detection model can be achieved.

The mAP (mean average precision) metric measures the model’s average precision at various recall thresholds. ‘AP’ stands for average precision, where ‘m’ represents the mean across categories, taking an average of AP values for all classes. The higher the AP value, the more accurately the model maintains high precision at elevated recall rates. Precision values at different recall rates are plotted on a graph to produce the precision–recall curve. The area under the precision–recall curve determines the AP. The formula for the AP is shown below, where

P_{k}

and

R_{k - 1}

represent precision and recall at the k-th threshold, respectively.

A P \approx \sum_{k = 1}^{n} (R_{k} - R_{k - 1}) \frac{P_{k} + P_{k - 1}}{2}

(9)

The mAP (0.5–0.95) metric shows the average performance of the model as the IoU varies from 0.5 to 0.95. This range of assessments demonstrates the robustness of the model under different stringencies.

4.2. Experimental Results and Discussion

4.2.1. Comparative Experiment on Dataset Augmentation Effectiveness

Using the segmentation-fusion-based data augmentation optimization approach proposed in Section 2, we performed data augmentation on the original dataset from Figure 1 to construct an augmented dataset. The dataset augmentation process mainly targeted the training set of the original dataset, enhancing the representation of algal species that appeared infrequently to reach a specified threshold, while keeping the validation set unchanged to ensure consistent comparison standards. Table 3 shows the performance improvement of the augmented dataset, with a threshold set at 400, compared to the original dataset under the same model and validation set conditions. The model-training environment is shown in the table below:

The YOLOv5 model, Faster R-CNN model, and Deformable DETR model were each selected and trained on both the original dataset and the augmented dataset. The backbone module of the Deformable DETR model was set to ResNet-50. To maintain recognition accuracy while speeding up model training, the number of encoder and decoder layers was reduced from six to four, denoted as Deformable_DETR_NWD(4).

The comparison of recognition performance between original and augmented datasets is presented in Table 4. In terms of overall recognition precision across the datasets, the YOLOv5 model trained on the augmented dataset demonstrated a 1.8% increase in mean average precision at an IoU threshold of 0.5, a 7.1% improvement in precision at an IoU of 0.65, and a 1.5% increase in mAP (0.5–0.95) compared to the original dataset. Similarly, the Faster R-CNN model showed a 0.7% increase in mAP (0.5), a 4.7% improvement in precision at IoU = 0.65, and a 1.3% increase in mAP (0.5–0.95) when trained on the augmented dataset. The Deformable DETR model also benefited from the augmented dataset, exhibiting a 0.1% increase in mAP (0.5–0.95). These findings indicate that the segmentation-fusion-based data augmentation method proposed in this study effectively improves the recognition performance of a variety of models.

Table 5 shows the improvement in recognition precision for underrepresented algae species in the dataset. The segmentation-fusion-based data augmentation approach not only significantly increased the frequency of underrepresented algae species in the training set but also preserved the morphological features and edge information of the original targets, thereby enhancing their detection performance. The method involves automatically segmenting the original target’s contours in the segmentation module, retaining detailed features of the target, and then reasonably integrating the segmented target with the background during the fusion stage. This ensures a natural fusion effect between the target and the background, avoiding artifacts around the edges. As a result, this data augmentation approach effectively improves the model’s detection precision for underrepresented algae species.

4.2.2. Comparative Experiments Based on the Improved Deformable DETR

After applying data augmentation, the improved Deformable DETR model was trained, and the overall metrics of the trained model are presented in the Table 6. In terms of precision, recall, and mAP, the improved Deformable DETR demonstrated significant advantages, especially at higher IoU thresholds. Its high precision and recall rates highlight its robust capability to handle diverse and complex scenarios, providing high accuracy and suitability for multi-target algae recognition in real-world, complex water sample environments.

The Figure 5 shows the recognition results of stacked algae species in some complex backgrounds, where green boxes represent the ground truth annotations, and red boxes represent the predicted bounding boxes. Various environmental interference factors are present in the water samples, such as suspended particles, bacteria, different lighting conditions, and water flow, all of which blur the algae boundaries. Additionally, the algae species in the water samples vary in shape and size, often resulting in overlapping algae targets. The improved Deformable DETR model utilizes adaptive attention mechanisms and long-range feature capture capabilities, enabling it to effectively distinguish algae targets in overlapping and complex backgrounds. This ability is particularly prominent in handling multi-target algae recognition tasks in real water sample environments, significantly enhancing the model’s robustness and applicability.

5. Conclusions

Due to the complex background of the water environment in real water samples, with a wide variety of algae species and different morphologies, target recognition needs to deal with different sizes and stacks of algae targets simultaneously. This study improves dataset construction, augmentation, and model design to improve the detection accuracy of multi-algae detection in real water samples.

The dataset constructed in this study was derived from real water samples collected from five different regions, containing 25 algal species. The dataset exhibited a good background and species diversity. Additionally, we proposed a segmentation-fusion-based data augmentation approach to address class imbalance issues, thereby improving the recognition precision of underrepresented algal species. To verify the effectiveness of the proposed data augmentation approach, comparative experiments were conducted using the YOLO model, the Faster R-CNN model, and the improved Deformable DETR model. The experiments demonstrated that the proposed data augmentation method effectively improved the performance metrics of all three models: for the YOLO model, the augmented dataset increased precision by 7.1% while maintaining a high recall rate, with mAP (0.5–0.95) improving by 1.5%. For the Faster R-CNN model, precision improved by 4.7%, mAP (0.5) increased by 0.7%, and mAP (0.5–0.95) improved by 1.3%. The improved Deformable DETR model also showed a 0.1% enhancement in the mAP (0.5–0.95) metric.

Regarding model optimization, the Deformable DETR model introduced multi-scale feature extraction and deformable attention mechanisms to effectively extract long-range dependency features in images, enhancing its ability to recognize overlapping targets in complex backgrounds. By incorporating the normalized Wasserstein distance loss function, the model more effectively measures the similarity between predicted and ground truth boxes, regardless of target size, thereby improving overall model performance in handling targets of varying sizes. The improvements made to the Deformable DETR model resulted in a 10.4% increase in mAP (IoU = 0.5) compared to the YOLO model, and the mAP metrics under different IoU thresholds were all improved compared to the original Deformable DETR model.

Author Contributions

Conceptualization, L.L.; Methodology, L.L.; Software, L.L.; Validation, L.L. and Z.L.; Formal analysis, L.L. and Y.Q.; Investigation, T.L.; Resources, Q.Y.; Data curation, Z.L., T.L. and Q.Y.; Writing – original draft, L.L.; Writing – review & editing, L.L. and Y.Q.; Project administration, C.L.; Funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

The authors declare that this study received funding from Shanghai Jiao Tong University and Ministry of Land and Resources of the People’s Republic of China. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article, or the decision to submit it for publication.

Data Availability Statement

The dataset in this study is currently unavailable as the data is part of an ongoing research project. Requests for access to the data can be directed to Cunyue Lu.

Conflicts of Interest

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Ramanan, R.; Kim, B.-H.; Cho, D.-H.; Oh, H.-M.; Kim, H.-S. Algae–Bacteria Interactions: Evolution, Ecology and Emerging Applications. Biotechnol. Adv. 2016, 34, 14–29. [Google Scholar] [CrossRef] [PubMed]
Paerl, H.W.; Otten, T.G. Harmful Cyanobacterial Blooms: Causes, Consequences, and Controls. Microb. Ecol. 2013, 35, 995–1010. [Google Scholar] [CrossRef] [PubMed]
Reynolds, C.S. The Ecology of Phytoplankton; Ecology, Biodiversity and Conservation; Cambridge University Press: Cambridge, UK, 2006; ISBN 978-0-521-60519-9. [Google Scholar]
Thessen, A. Adoption of Machine Learning Techniques in Ecology and Earth Science. One Ecosyst. 2016, 1, e8621. [Google Scholar] [CrossRef]
Sellner, K.G.; Doucette, G.J.; Kirkpatrick, G.J. Harmful Algal Blooms: Causes, Impacts and Detection. J. Ind. Microbiol. Biotechnol. 2003, 30, 383–406. [Google Scholar] [CrossRef] [PubMed]
Hou, T.; Chang, H.; Jiang, H.; Wang, P.; Li, N.; Song, Y.; Li, D. Smartphone Based Microfluidic Lab-on-Chip Device for Real-Time Detection, Counting and Sizing of Living Algae. Measurement 2022, 187, 110304. [Google Scholar] [CrossRef]
Lin, K.; Tang, Y.; Tang, J.; Huang, H.; Qin, Z. Algae Object Detection Algorithm Based on Improved YOLOv5. In Proceedings of the 2023 6th International Conference on Software Engineering and Computer Science (CSECS), Chengdu, China, 22–24 December 2023; pp. 1–5. [Google Scholar]
Guo, Y.; Gao, L.; Li, X. A Deep Learning Model for Green Algae Detection on SAR Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4210914. [Google Scholar] [CrossRef]
Ruiz-Santaquiteria, J.; Bueno, G.; Deniz, O.; Vallez, N.; Cristobal, G. Semantic versus Instance Segmentation in Microscopic Algae Detection. Eng. Appl. Artif. Intell. 2020, 87, 103271. [Google Scholar] [CrossRef]
Liu, T. Research on Marine Microalgae Recognition Algorithm for Few-Shot Scenarios. Master’s Thesis, Dalian Ocean University, Dalian, China, 2024. [Google Scholar]
Qian, P.; Zhao, Z.; Liu, H.; Wang, Y.; Peng, Y.; Hu, S.; Zhang, J.; Deng, Y.; Zeng, Z. Multi-Target Deep Learning for Algal Detection and Classification. In Proceedings of the 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Montreal, QC, Canada, 20–24 July 2020; pp. 1954–1957. [Google Scholar]
Abdullah; Ali, S.; Khan, Z.; Hussain, A.; Athar, A.; Kim, H.-C. Computer Vision Based Deep Learning Approach for the Detection and Classification of Algae Species Using Microscopic Images. Water 2022, 14, 2219. [Google Scholar] [CrossRef]
Wu, Z.; Chen, M. Lightweight detection method for microalgae based on improved YOLO v7. J. Dalian Ocean. Univ. 2023, 38, 129–139. [Google Scholar] [CrossRef]
Chu, Z.; Zhang, X.; Ying, G.; Jia, R.; Qi, Y.; Xu, M.; Hu, X.; Huang, P.; Ma, M.; Yang, R. Detection Algorithm of Planktonic Algae Based on Improved YOYOv3. Laser Optoelectron. Prog. 2023, 60, 257–264. [Google Scholar]
Park, J.; Baek, J.; Kim, J.; You, K.; Kim, K. Deep Learning-Based Algal Detection Model Development Considering Field Application. Water 2022, 14, 1275. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Yun, S.; Han, D.; Chun, S.; Oh, S.J.; Yoo, Y.; Choe, J. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6022–6031. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. arXiv 2018, arXiv:1710.09412. [Google Scholar]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for Small Object Detection. arXiv 2019, arXiv:1902.07296. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.-S. Detecting Tiny Objects in Aerial Images: A Normalized Wasserstein Distance and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [Google Scholar] [CrossRef]
Zhou, S.; Jiang, J.; Hong, X.; Fu, P.; Yan, H. Vision Meets Algae: A Novel Way for Microalgae Recognization and Health Monitor. Front. Mar. Sci. 2023, 10, 1105545. [Google Scholar] [CrossRef]
Salari, A.; Djavadifar, A.; Liu, X.; Najjaran, H. Object Recognition Datasets and Challenges: A Review. Neurocomputing 2022, 495, 129–152. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]

Figure 1. Distribution of algal species in the original dataset.

Figure 2. The segmentation-fusion-based data augmentation optimization scheme.

Figure 3. The effect image of the segmentation-fusion-based data augmentation optimization approach. Notes: The part marked by red box in the picture is the sample after data augmentation.

Figure 4. Structure of the improved Deformable DETR model.

Figure 5. Illustration of detection results based on the improved Deformable DETR Model.

Table 1. Statistics on datasets used in existing research on algal target localization and recognition.

Article	Dataset Quantity	Algal Species
Wu and Chen et al. [13]	1512 images of microalgae under microscope	Includes 14 species, such as Fibrocystis
Chu et al. [14]	635 images of planktonic algae	Includes 5 species, such as Dunaliella salina,
Qian et al. [11]	1859 microscopic images of algae	9 categories
Ruiz-Santaquiteria et al. [9]	635 images of planktonic algae under microscope	Includes 5 species, such as Dunaliella salina
Abdullah et al. [12]	400 images of microalgae under microscope	4 species
Park et al. [15]	437 images of microalgae under microscope	30 species

Table 2. Illustrations of various algal species in the original dataset.

Algal Species	ID	Algal Species	ID
Planktothrix sp.	1	Aulacoseira granulata	2
Aphanizomenon flosaquae	3	Microcystis sp.	4
Cyclotella sp.	5	Peridinium bipes	6
Nitzschia sp.	7	Chlorella sp.	8
Spirulina-like	9	Cryptomonas sp.	10
Pediastrum sp.	11	Scenedesmus quadricauda	12
Anabaena circinalis	13	Mougeotia sp.	14
Actinastrum sp.	15	Anabaena sp.	16
Chlamydomonas sp.	17	Planctonema lauterbornii	18
Cosmarium Corda	19	Scenedesmus acuminatus	20
Ulothrix sp.	21	Staurastrum sp.	22
Spirogyra sp.	23	Dolichospermum spiroides	24
Euglena sp.	25

Table 3. The computer environment used for model training.

Experiments	Names
system	Windows11
CPU	Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz 3.70 GHz
GPU	NVIDIA GeForce RTX 2070
RAM	32 GB

Table 4. Comparison of recognition performance metrics between the original dataset and the augmented dataset.

Model	∆P (IoU = 0.65)	∆R (IoU = 0.65)	∆mAP (0.5)	∆mAP (0.5–0.95)
YOLOv5	7.1%	−1.2%	1.8%	1.5%
Faster R-CNN	4.7%	−0.1%	0.7%	1.3%
Deformable_DETR_NWD(4)	0.6%	0.9%	0	0.1%

Table 5. Effectiveness of Data augmentation for disadvantaged algal species (YOLO model).

Disadvantaged Algal Species	∆P (IoU = 0.65) (%)
Chlamydomonas sp.	3.8
Cosmarium Corda	0.9
Scenedesmus acuminatus	2.5
Staurastrum sp.	5.1
Spirogyra sp.	27.6
Euglena sp.	14.7

Table 6. Comparison of recognition performance across different models.

Model	P (IoU = 0.65)	R (IoU = 0.65)	mAP (0.5)	mAP (0.5–0.95)
YOLOv5	0.695	0.659	0.684	0.397
Faster R-CNN	0.656	0.735	0.731	0.352
Deformable_DETR	0.800	0.880	0.790	0.486
Deformable_DETR_NWD(4)	0.810	0.907	0.799	0.488

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, L.; Liang, Z.; Liu, T.; Lu, C.; Yu, Q.; Qiao, Y. Transformer-Driven Algal Target Detection in Real Water Samples: From Dataset Construction and Augmentation to Model Optimization. Water 2025, 17, 430. https://doi.org/10.3390/w17030430

AMA Style

Li L, Liang Z, Liu T, Lu C, Yu Q, Qiao Y. Transformer-Driven Algal Target Detection in Real Water Samples: From Dataset Construction and Augmentation to Model Optimization. Water. 2025; 17(3):430. https://doi.org/10.3390/w17030430

Chicago/Turabian Style

Li, Liping, Ziyi Liang, Tianquan Liu, Cunyue Lu, Qiuyu Yu, and Yang Qiao. 2025. "Transformer-Driven Algal Target Detection in Real Water Samples: From Dataset Construction and Augmentation to Model Optimization" Water 17, no. 3: 430. https://doi.org/10.3390/w17030430

APA Style

Li, L., Liang, Z., Liu, T., Lu, C., Yu, Q., & Qiao, Y. (2025). Transformer-Driven Algal Target Detection in Real Water Samples: From Dataset Construction and Augmentation to Model Optimization. Water, 17(3), 430. https://doi.org/10.3390/w17030430

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Driven Algal Target Detection in Real Water Samples: From Dataset Construction and Augmentation to Model Optimization

Abstract

1. Introduction

2. Dataset Construction

2.1. Dataset Collection

2.2. Dataset Augmentation

3. Improved Deformable DETR Detection Algorithm

3.1. Deformable DETR Network Architecture

3.2. Improved Loss Function for Deformable DETR

4. Experiments and Analysis

4.1. Evaluation Metrics

4.2. Experimental Results and Discussion

4.2.1. Comparative Experiment on Dataset Augmentation Effectiveness

4.2.2. Comparative Experiments Based on the Improved Deformable DETR

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI