CA-YOLO: An Efficient YOLO-Based Algorithm with Context-Awareness and Attention Mechanism for Clue Cell Detection in Fluorescence Microscopy Images

Cui, Can; Chen, Xi; He, Lijun; Li, Fan

doi:10.3390/s25196001

Open AccessArticle

CA-YOLO: An Efficient YOLO-Based Algorithm with Context-Awareness and Attention Mechanism for Clue Cell Detection in Fluorescence Microscopy Images

¹

School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China

²

School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(19), 6001; https://doi.org/10.3390/s25196001

Submission received: 31 August 2025 / Revised: 14 September 2025 / Accepted: 22 September 2025 / Published: 29 September 2025

(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Automatic detection of clue cells is crucial for rapid diagnosis of bacterial vaginosis (BV), but existing algorithms suffer from low sensitivity. This is because clue cells are highly similar to normal epithelial cells in terms of macroscopic size and shape. The key difference between clue cells and normal epithelial cells lies in the surface texture and edge morphology. To address this specific problem, we propose an clue cell detection algorithm named CA-YOLO. The contributions of our approach lie in two synergistic and custom-designed feature extraction modules: the context-aware module (CAM) extracts and captures bacterial distribution patterns on the surface of clue cells; and the shuffle global attention mechanism (SGAM) enhances cell edge features and suppresses irrelevant information. In addition, we integrate focal loss into the classification loss to alleviate the severe class imbalance problem inherent in clinical samples. Experimental results show that the proposed CA-YOLO achieves a sensitivity of 0.778, which is 9.2% higher than the baseline model, making the automated BV detection more reliable and feasible.

Keywords:

clue cells; cell detection; context-awareness; attention mechanism; bacterial vaginosis

1. Introduction

Bacterial vaginosis (BV) is the most common reproductive tract infection in women, affecting up to 23–29% of the global population [1] and posing a significant health risk to millions [2]. Timely diagnosis and treatment are crucial, as untreated BV can lead to serious complications. These include an increased risk of HPV infection, which is associated with cervical cancer [3,4], and adverse pregnancy outcomes such as miscarriage and premature birth [5]. Therefore, the accurate and efficient detection of clue cells, the core biomarker for BV, is a prerequisite for effective clinical intervention [6].

However, automated clue cell detection faces several challenges due to several intrinsic factors. As illustrated in Figure 1, clue cells are Gardnerella-infected epithelial cells. They are thus highly similar to their normal counterparts in macroscopic size and shape. The truly discriminative information lies in subtle microscopic features [7,8]. For instance, the edges of clue cells exhibit a fuzzy, irregular granularity, whereas normal epithelial cells have smooth, clear boundaries (Figure 1a). This edge degradation often presents as high-frequency noise and is easily lost during standard convolution operations. Furthermore, the surface of clue cells is covered with a dense layer of Gardnerella, creating complex texture information. In contrast, normal epithelial cells have a uniform and smooth surface (Figure 1b). Capturing this bacterial distribution texture is fundamental for accurate classification. A final challenge is the severe class imbalance in clinical samples, where normal epithelial cells vastly outnumber clue cells. This can cause detection models to overlook the minority clue cell class. Consequently, an effective algorithm must excel at capturing fine-grained edge and texture details while robustly handling data imbalances.

While deep learning has advanced medical image analysis [9,10,11], existing object detection methods remain ill-suited for the specific demands of clue cell detection. Two-stage detectors like Faster R-CNN, though accurate in localization, are too computationally intensive for real-time clinical use and were not designed to capture microscopic features [12,13,14]. Mainstream one-stage detectors, such as the YOLO series, are faster, but their standard backbones tend to lose critical high-frequency information during downsampling, resulting in low sensitivity [15]. Even specialized cell detection models like CAF-YOLO [16] and YOLO-TCT [17] fall short. This is because they are optimized for different pathological features, such as cell nuclei characteristics or nucleus–cytoplasm ratios [18], and are not tailored to the unique edge and texture signatures of clue cells. As a result, there is an urgent need for an end-to-end model designed specifically for this task.

To overcome these limitations, we propose CA-YOLO, a deep learning model specifically engineered for high-sensitivity clue cell detection. Our model employs a synergistic strategy of targeted feature extraction and loss optimization. The main contributions are as follows: A novel context-aware module (CAM) is designed to capture the complex surface texture of clue cells. By modeling long-range dependencies across the feature map, CAM integrates local bacterial patterns into a global, semantic representation, effectively distinguishing them from the smooth surfaces of normal cells. A shuffle global attention mechanism (SGAM) is proposed to address the challenge of fuzzy cell edges. By combining spatial and channel attention with a large receptive field, SGAM accurately locates and enhances weak edge features while suppressing background noise. The integration of focal loss directly tackles the class imbalance problem, working in concert with our feature enhancement modules to create a complete end-to-end solution. Experimental results show that our CA-YOLO model achieves a sensitivity of 0.778, significantly improving upon baseline models and effectively reducing the risk of missed diagnoses in a clinical setting.

2. Related Work

2.1. Cell Detection

Due to its efficiency and real-time capabilities, the YOLO framework has been extensively applied to cell detection tasks, covering multiple cell types such as blood cells [19], cancer cells, and clue cells [20]. For instance, Wu et al. proposed the EB-YOLO model, which incorporated ShuffleNet as a lightweight backbone alongside the Convolutional Block Attention Module (CBAM) and the Adaptive Spatial Feature Fusion (ASFF) module. This architecture maintained high detection accuracy while reducing computational costs [21]. Similarly, Haq et al. integrated YOLOv2 with ResNet-18 and ResNet-50 for the detection of colorectal cancer cells, demonstrating particular effectiveness in identifying overlapping cells [22].

In the domain of clue cell detection, several approaches have also been introduced. Guo et al. developed a classification method based on multi-scale texture energy descriptors to extract cellular features, employing an SVM classifier to differentiate between cell types [23]. Yu et al. presented a Convexified Coupled Active Contour (CCAC) model that exploited structural relationships in immunofluorescence images to jointly segment clue cells, nuclei, and bacteria [24]. More recently, Chen et al. introduced the Multi-Scale Perceptual YOLO (MSP-YOLO), which integrated a super-resolution branch and an attention-based feature fusion module. This was designed to enhance the network’s sensitivity to the subtle features of clue cells and trichomonas [25]. Despite these advances, existing methods for clue cell detection still face limitations in sensitivity, primarily due to indistinct morphological features and significant class imbalance.

2.2. Context Information Enhancement

Traditional object detectors primarily rely on local features for object localization. However, in scenes with complex backgrounds or blurry targets, this reliance on local features can result in false detections or missed targets [26]. Incorporating a context information enhancement module enables a model to capture the global image context. This provides a more comprehensive understanding of the scene and improves detection accuracy and robustness [27].

Context-based modules have achieved promising results in object detection. For example, CGNet integrates multi-scale features through context-guided blocks, improving the recognition of slender structures [28]. Non-local networks efficiently model global information by capturing long-range dependencies, which particularly benefits small-object detection [29]. Additionally, the fusion of multi-scale features with global context has become a key research focus. Cui et al. proposed the Multi-scale Global Feature Aggregation Network (MGA-Net), which aggregates global features across different scales to improve adaptability to small objects and complex backgrounds [30]. In summary, context-enhancement modules are essential for optimizing detection performance in complex scenarios.

2.3. Attention Mechanisms

The attention mechanism dynamically assigns weights to emphasize key features, enhancing performance while reducing computational complexity in image processing [31]. It has been widely applied in various visual tasks, including image classification [32], object detection [33], semantic segmentation [34], 3D vision [35], etc. Common attention mechanisms are categorized into three types: channel attention [36,37], spatial attention [38,39], and a combination of both [40,41]. Channel attention determines which features to emphasize by adjusting the weights of each channel. Mo et al. proposed a U-Net-based network that incorporated the Efficient Channel Attention (ECA) mechanism to address blurred-edge segmentation [42]. Separately, Lu et al. introduced an enhanced channel attention residual block in their ECARN model to further improve performance [43]. Spatial attention, on the other hand, dynamically adjusts the weights of different spatial positions. This allows the model to emphasize key areas and suppress background noise. Verma et al. proposed a spatial attention mechanism to dynamically highlight key anatomical structures in fetal ultrasound images [44]. Liu et al. introduced the MCSA-HRNet, which employed dedicated attention blocks to improve focus on target-related regions [45]. However, both mechanisms have limitations. Channel attention can be less sensitive to fine image details, while standard spatial attention may struggle to capture the fine-grained features necessary for recognizing the indistinct edge morphology of clue cells.

3. Materials

The fluorescence microscopy image dataset used in this study was collected by Suzhou Tumo Medical Technology Co., Ltd. (Suzhou, China). It comprised 2314 images from 480 patients, with each patient contributing 2 to 8 images. For each patient, gynecological vaginal discharge samples were collected, and slides were prepared through a standardized process of dilution and staining. The prepared slides were then imaged using a fully automated immunofluorescence detection device (model YG02). The original image resolution was 2736 × 1824 pixels. Due to computational constraints and the need for efficient slide analysis, a standard preprocessing pipeline was initially applied, where original images were down-sampled to 1368 × 912 pixels and further cropped into four 684 × 456-pixel images. However, for the specific purposes of this research, we adopted a different strategy. Each original image was randomly cropped to 1024 × 1024 pixels and then down-sampled to a final input size of 512 × 512 pixels. Ultimately, our final dataset consisted of 2314 images. Among these, 455 images contained clue cells, while 1859 did not. The dataset was partitioned into a validation set of 576 images (containing 231 clue cell instances) and a test set of 477 images (containing 144 clue cell instances). The primary goal of this study was to detect these clue cell targets. As visually detailed in Figure 1, the main challenge lay in distinguishing clue cells from normal epithelial cells based on subtle differences in their edge characteristics (Figure 1a) and surface texture (Figure 1b).

4. Theory

4.1. The YOLOv8 Architecture

The YOLOv8 algorithm comprises three main components: the backbone, neck, and detection head [46]. The backbone network is responsible for initial feature extraction from the input image. It primarily utilizes CBS and C2f modules, progressively down-sampling the image to generate a hierarchy of feature maps at different spatial resolutions. The neck component, implemented as a Path Aggregation Network (PANet), is designed to fuse feature maps from different levels of the backbone. By aggregating features from both top-down and bottom-up pathways, the neck produces feature representations that are rich in both semantic and spatial information, enhancing the model’s ability to detect objects at various scales. The final predictions are generated by the three detection heads with different scales. YOLOv8 employs a decoupled head structure, where classification and regression tasks are handled by separate branches. This design allows each task to be optimized independently, improving overall detection accuracy.

4.2. Standard Loss Function of YOLOv8

The overall loss function of the YOLOv8 model is a sum of two components: classification loss

L_{c l s}

and bounding box regression loss

L_{b b o x}

. It can be expressed as

L_{Y O L O v 8} = L_{c l s} + L_{b b o x} .

(1)

The classification loss is typically calculated using binary cross-entropy (BCE) loss for each potential object. The BCE loss for a single prediction is given by

L_{B C E} = - y log \hat{y} - (1 - y) log (1 - \hat{y}),

(2)

where

y \in {0, 1}

is the ground-truth label, with 1 indicating a positive sample and 0 indicating a negative sample.

\hat{y}

is the model’s predicted probability.

The bounding box regression loss is a combination of the Complete Intersection over Union (CIoU) loss and the Distribution Focal Loss (DFL). This composite loss aims to accurately regress the coordinates of the bounding box. It is formulated as

L_{b b o x} = L_{C I o U} + L_{d f l} .

(3)

The CIoU loss accounts for the overlap area, central point distance, and aspect ratio consistency between the predicted and ground-truth boxes. DFL treats the continuous box coordinate regression as a discrete classification problem over a range of values, which helps in learning a more accurate and flexible bounding box representation.

5. Method

The overall workflow of our research is illustrated in Figure 2. The process begins with data acquisition and preprocessing, followed by patient-level partition of the dataset. The core of our approach is the training and optimization of our proposed CA-YOLO model, followed by a rigorous performance evaluation to generate the final detection results. This section details the key components of our model. The CA-YOLO model, as shown in Figure 3, is built on the YOLOv8 framework and takes a 512 × 512 pixel image as input. Our key innovation lies in the introduction of two collaborative modules within the neck structure to detect clue cells: the CAM is designed to capture the complex distribution patterns of bacteria attached to the cell surface. The SGAM enhances information interactions across channels and spatial locations, significantly improving the model’s ability to detect fuzzy cell boundaries. After being enhanced by these modules, the features are passed to the detection head for final prediction. The model’s loss function combines CIoU and distributional focal loss for bounding box regression, and binary cross-entropy and focal loss for classification. The following subsections describe each innovative component in detail.

5.1. Context-Aware Module

Distinguishing between clue cells and epithelial cells is challenging due to their morphological similarities and variations in bacterial distribution on their surfaces. The context-aware module (CAM) captures global contextual information from image features, enhancing the model’s ability to differentiate between these cell types by analyzing the spatial distribution of attached bacteria. Additionally, CAM is lightweight and can be integrated into multiple network layers, significantly improving detection accuracy while imposing minimal computational overhead.

CAM processes image feature maps through a series of structured operations. A

1 \times 1

convolution adjusts the number of channels without altering spatial dimensions, followed by Softmax normalization to convert outputs into probability distributions. The Softmax output is used to generate a weighted feature representation by performing element-wise multiplication with the input feature map. A second convolution refines the channel dimensions, followed by layer normalization (LN) to stabilize training. A third convolution processes the feature map, after which the output is transformed into probability values between 0 and 1 using the Sigmoid activation function. The probability values are applied element-wise to the weighted input feature map, enhancing important channel features. Finally, the weighted results are added to the original input feature map, generating an output feature map rich in global contextual information (Figure 4).

5.2. Shuffle Global Attention Mechanism

We propose the shuffle global attention mechanism (SGAM) to guide the network in focusing on cell contour features. The module structure is illustrated in Figure 5. SGAM enhances the model’s ability to focus on key features by integrating channel attention and spatial attention, and incorporates channel shuffle at the end to promote more effective channel mixing and improve the network’s expressive capacity.

The input feature map of SGAM is denoted as

X \in R^{C \times H \times W}

, where

C

,

H

, and

W

represent the number of channels, height, and width, respectively. The channel attention first transposes the input tensor from

C \times H \times W

to

H \times W \times C

to facilitate the calculation of attention weights. Then, two fully connected layers reduce the number of channels and restore to

C

, allowing the model to learn channel importance. Next, the channel attention weights are rearranged into a

C \times H \times W

shape and multiplied element-wise with the original input. The channel attention weights are computed as

A_{c} = σ (W_{2} \cdot δ (W_{1} \cdot P (X))),

(4)

where

P (\cdot)

represents permutation operation,

W_{1}

and

W_{2}

represent fully connected layers, and

δ (\cdot)

and

σ (\cdot)

denote the ReLU and Sigmoid activation functions, respectively. The output of the channel attention is obtained as

X_{C A} = X ⊙ P (A_{c}

), where ⊙ denotes element-wise multiplication.

Given that clue cells are relatively large targets in images, the use of

7 \times 7

convolution instead of smaller kernels in spatial attention can have a wider receptive field and better capture regional cell contour features, which is crucial for distinguishing clue cells and surrounding epithelial cells. The second

7 \times 7

convolution further refines the extracted features, enhances spatial information representation, and generates the spatial attention map. In addition, we have designed channel shuffle for spatial attention maps that can promote cross-channel information exchange, effectively utilize the dependency relationships between channels, avoid information redundancy or loss, and make it easier to focus attention on areas with strong pathological features, thereby effectively improving sensitivity. The spatial attention weights are computed as

A_{s} = σ (f_{2} (BN (δ (f_{1} (X_{C A}))))),

(5)

where

f_{1} (\cdot)

and

f_{2} (\cdot)

denote

7 \times 7

convolutions,

δ (\cdot)

is the ReLU function, BN is batch normalization, and

σ (\cdot)

is the Sigmoid function. The final output features after spatial enhancement are computed by

X_{S A} = X_{C A} ⊙ S (A_{s})

, where

S (\cdot)

denotes channel shuffle.

5.3. Loss Function

While the standard YOLOv8 loss function is effective for general object detection, it is not optimized for the severe class imbalance inherent in clinical clue cell samples, where normal epithelial cells vastly outnumber clue cells. To address this, we enhance the classification loss component of the total loss function.

The total loss of our CA-YOLO model is defined as

L_{t o t a l} = L_{c l s_e n h a n c e d} + L_{b b o x},

(6)

where the regression loss

L_{b b o x}

remains the same as in the standard YOLOv8 (Equation (3)), but the classification loss is specifically enhanced.

Our key modification is the integration of focal loss into the classification loss, resulting in an enhanced loss term

L_{c l s_e n h a n c e d}

:

L_{c l s_e n h a n c e d} = L_{B C E} + L_{F L},

(7)

where

L_{B C E}

is binary cross-entropy loss (Equation (2)). Focal loss adjusts the weights of minority samples that are difficult to classify, forcing the model to pay more attention to clue cells. Focal loss is defined as

L_{F L} (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} log p_{t},

(8)

where

p_{t}

is the model’s estimated probability for the ground-truth class. The focusing parameter

γ

down-weights the loss assigned to well-classified examples (e.g., easy-to-identify epithelial cells), while the balancing parameter

α_{t}

directly addresses class imbalance by assigning a higher weight to the minority class. By incorporating focal loss, our model becomes significantly more sensitive to the challenging clue cells, which is critical for achieving reliable automated BV detection.

6. Experiments

6.1. Setup

This subsection outlines the experimental setup for evaluating the performance of the proposed method, including the dataset partitioning strategy, preprocessing steps, and training and testing configurations, as well as the experimental design and evaluation metrics.

6.1.1. Implementation Details

In order to conduct a robust and unbiased evaluation of the model, we adopted a two-step data splitting strategy. First, the 2405 images were randomly divided at the patient level into a training set (80%) and a test set (20%), ensuring that the test set remained isolated throughout the development process. The training set was then further divided into a training set (70%) and a validation set (30%). In the end, 1352 images were used for training, 576 for validation, and 477 for testing. All subsequent experiments used the same training and test sets.

To address the limited training data and class imbalance, we employed data augmentation strategy during training. This included geometric transformations (random rotation, scaling, flipping, translation, and mild shearing) and compound augmentations (mosaic, MixUp, and Copy-paste). Data augmentation was applied exclusively to the training set, leaving the validation and test sets unaltered, which only increases the training cost but not the inference cost [47].

In the training stage, we applied random search to optimize three key hyper-parameters: learning rate, batch size, and weight decay [48]. The hyper-parameter search space we defined is shown in Table 1, and we conducted 20 independent training experiments by randomly sampling combinations from this space. The model’s F1 score on the validation set was used as the primary evaluation metric. The experimental results indicate that the best performance was achieved with a learning rate of 0.001, a batch size of 64, and a weight decay of 0.0005. This configuration was used as the default setting in all subsequent experiments. The model was trained using the Adam optimizer for 500 epochs. To mitigate overfitting, we implemented an early stopping strategy.

To ensure the stability and reliability of findings, all experiments on the hold-out test set were repeated five times with different random seeds, and the mean and standard deviation were calculated as the final performance metrics. Furthermore, to validate that the observed improvements are statistically significant, we conducted statistical analysis, including Bootstrap resampling and McNemar’s test, which is presented in Section 6.5.2.

All experiments were conducted on a machine equipped with a Tesla V100 GPU. The training framework used YOLOv8-s and was implemented using PyTorch 1.11.0 and Python 3.11.5.

6.1.2. Experimental Design

We first conducted a model analysis, focusing on several key factors: the impact of the hyper-parameter

γ

in focal loss, the optimal insertion position of the SGAM, and the comparison of the effects of SGAM with other attention mechanisms (such as Squeeze-and-Excitation block (SE) [36], Convolutional Block Attention Module (CBAM) [40], Efficient Channel Attention module (ECA) [49], and Simple Attention Module (SimAM) [50]). Then, we analyzed the relationship between the loss function, epochs and metrics during training. To further verify the performance of the model, we conducted ablation studies to evaluate the contribution of each component of CAM, SGAM, and focal loss.

Subsequently, we conducted comparison studies of CA-YOLO with various object detection methods. The comparison models include YOLOv11-s [51]; CA-YOLO (v11), which replaces the baseline of CA-YOLO with YOLOv11-s; and cell detection models such as MSP-YOLO [25], CAF-YOLO [16], and YOLO-TCT [17]. In order to demonstrate the performance of the model, we selected representative samples to visualize the detection results, plotted the ROC curves and PR curves of the model, and performed statistical validation with the baseline to prove the effectiveness of CA-YOLO.

Finally, we tested the generalization ability of the model on the public blood cell count and detection dataset (BCCD) [52]. The BCCD dataset contains 364 microscope images of 640 × 480 pixels, covering cell types such as red blood cells (RBCs), white blood cells (WBCs), and platelets, with a total of 4888 labels. The dataset is divided into a training set (255 images), a validation set (73 images), and a test set (36 images). On this dataset, we use object-level evaluation metrics such as precision, recall, and F1 score for evaluation.

6.1.3. Evaluation Metrics

We selected four metrics to evaluate clue cell detection in fluorescence microscopy images: Sensitivity (SEN), Specificity (SPE), Accuracy (ACC), and F1 score (F1). SEN reflects the model’s ability to correctly identify positive instances, with a higher SEN indicating a reduced rate of missed detections. SPE measures the proportion of true negatives accurately identified, where a higher SPE corresponds to fewer false positives. ACC assesses the overall classification performance of the model across both positive and negative classes. The F1 score is the harmonic mean of precision and recall, taking into account both missed detections and false detections, and is particularly useful for dealing with class imbalance. SEN, SPE, and ACC assess detection performance at the image level, while the F1 score evaluates performance at the object level. We diagnosed a true positive (TP) when the clue cells occupied more than 20% of the area of both epithelial cells and clue cells in a single image. The formulas for SEN, SPE, ACC, and F1 score are defined as follows:

S E N = \frac{T P}{T P + F N},

(9)

S P E = \frac{T N}{T N + F P},

(10)

A C C = \frac{T P + T N}{T P + T N + F P + F N},

(11)

F 1 s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l},

(12)

P r e c i s i o n = \frac{T P}{T P + F P},

(13)

R e c a l l = \frac{T P}{T P + F N},

(14)

where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively.

6.2. Model Analysis

6.2.1. Hyper-Parameter $γ$

In the CA-YOLO model, the focal loss function plays a key role in addressing the class imbalance problem in the clue cell detection. The performance of focal loss is mainly modulated by two hyper-parameters: the balancing factor

α

and the focusing parameter

γ

. We introduce the setting strategy of

α

and analyze the impact of

γ

on the model performance.

The

α

is used to statically balance the importance of different classes. We determine the weight of each class based on the inverse frequency of the corresponding class in the training set. This data-driven approach assigns higher weights to rarer clue cells, allowing the model to pay more attention to them during training. To ensure fair and consistent experimental evaluation, the pre-computed

α

remains unchanged in all subsequent experiments. The

γ

enables the model to focus on misclassified samples by dynamically adjusting the loss contribution of easy and difficult samples. To explore its optimal value, we conducted a sensitivity analysis. We trained the full CA-YOLO model with different

γ

values ranging from 0.1 to 3.0, keeping all other settings and

α

unchanged. Figure 6 shows the changes in SEN and F1 score on clue cells under different

γ

values of the model. As shown in Figure 6, as the

γ

increases from 0.1 to 1.5, the SEN and F1 score increase steadily and reach a peak at

γ = 1.5

. This shows that a moderate focusing effect is beneficial to our task, which can effectively reduce the loss from easy-to-classify background samples and help the model prioritize more difficult samples, thereby improving its precision without significantly affecting the recall. However, when the

γ

increases further, the model performance begins to decline. This is because a large

γ

overly suppresses the loss of classified samples, which may hinder the overall learning process and stability of the model.

Based on this comprehensive analysis, we choose

γ = 1.5

as the optimal setting for the CA-YOLO model to ensure the best performance in comparative studies and ablation experiments.

6.2.2. Insertion Position of the SGAM

We conducted experiments to determine where to insert SGAM in the backbone network to obtain the best detection performance. “3_layers_P345” means applying the SGAM after P3, P4, and P5 in Figure 3; “1_layer_P5” means applying the SGAM solely after P5. The experiments in Table 2 show that placing SGAM only on the deepest feature map (P5) is better than applying it to all levels (P3, P4, P5). This is because global structural information such as cell contours should be captured from high-level semantic features, which have larger receptive fields. Applying SGAM to shallow high-resolution feature maps may introduce noise, forcing the model to search in irrelevant long-range dependencies, thereby degrading performance.

6.2.3. Different Attention Mechanisms

To validate the effectiveness of the proposed SGAM, we conducted comparative experiments between SGAM and four representative attention mechanisms: the SE [36] is the earliest and most widely adopted channel attention structure; the CBAM [40] models both channel and spatial attention sequentially and is commonly used in medical image tasks; the ECA [49] is a lightweight attention design that eliminates fully connected layers and reduces parameters; and the SimAM [50] is a parameter-free spatial attention approach inspired by neuroscience energy functions, known for its simplicity and efficiency. The results in Table 3 verify the superiority of SGAM. SGAM can more effectively simulate long-range dependencies through its permutation and enhanced global receptive field and accurately capture the subtle differences in cell edge morphology, while the channel shuffle component enhances information flow and avoids information bottlenecks.

6.3. Training Process Analysis

We explain the relationship between loss functions, evaluation metrics, and training epochs to demonstrate the stability and effectiveness of CA-YOLO. Figure 7 shows a visualization of the training process. The performance of CA-YOLO is evaluated at the image and object levels, and the image-level metrics SEN, SPE, and ACC are calculated once on the test set through an independent evaluation script after the final model is determined. Therefore, during the training process, we focus on the object-level evaluation metric the F1 score.

As shown in Figure 7a, both the training loss and the validation loss drop rapidly in the initial stage of training, indicating that the model effectively learns the features of the clue cells from the training data. After the stage, while the training loss continues to gradually decrease, the validation loss decreases slowly and gradually flattens out. This difference between the training and validation loss curves indicates that the model has reached its best generalization point and begins to show signs of slight overfitting on the training data. The validation performance curve in Figure 7b further confirms this observation. In the initial stage of training, the F1 score grows rapidly, corresponding to the initial sharp drop in loss, and gradually stabilizes after about 40–50 epochs. By analyzing the training process, we select the model corresponding to the highest F1 score on the validation set as the optimal model for clue cell detection, which ensures that the model used for the final evaluation has the best generalization ability on the test set.

6.4. Ablation Study

To verify the effectiveness of each module in the proposed CA-YOLO network, the following comparison experiments were conducted: removing the SGAM, the CAM, and the focal loss, respectively. The results are shown in Table 4. After removing SGAM, SEN decreased by 1.2%, which shows that the SGAM is effective in improving the detection ability of the model. After removing CAM, SEN decreased by 1.7% and F1 score decreased by 1.5%. This not only proves the effectiveness of CAM, but the extent of the SEN decrease suggests that CAM contributes most to the improvement of model performance in clue cell detection. After removing focal loss, SEN decreased by 1.3% and F1 score decreased by 2.3%. This proves that increasing the attention to a few samples can improve the sensitivity of clue cell detection. The ablation results prove that the SGAM, CAM, and focal loss strategies are indispensable. Each component contributes to the final performance, among which CAM is particularly prominent in improving sensitivity. The optimal detection performance is achieved under the synergy of these components.

6.5. Comparison Study

6.5.1. Comparison with Other Detection Methods

In Table 5, we compare the performance of CA-YOLO with various object detection methods in clue cell detection. The comparison models include YOLOv11-s [51]; CA-YOLO (v11), with YOLOv11-s replacing the CA-YOLO baseline; and cell detection models such as MSP-YOLO [25], CAF-YOLO [16], and YOLO-TCT [17]. All models are trained and tested on our clue cell dataset under the same conditions. MSP-YOLO guides the network to learn small feature differences through super-resolution reconstruction, which improves the performance of small object detection. CAF-YOLO introduces the channel attention and spatial attention fusion module to effectively enhance the accuracy of blood cell detection. YOLO-TCT solves the class imbalance problem through an improved loss function, improving the performance of long-tail cervical cell detection. From the results, it can be seen that after replacing the CA-YOLO baseline with YOLOv11-s, the algorithms effect is not much different, which shows that the effectiveness of the module designed for clue cell features is not limited to the specific architecture of YOLOv8-s, but shows good generalization ability and consistent performance improvement on different backbone networks. Compared with other cell detection algorithms, CA-YOLO performs best in terms of SEN, reaching 0.778, while maintaining high SPE and ACC. This proves that the model is effective in identifying clue cells and reducing missed detections and is also superior in comprehensive performance. It is a detection model that is more suitable for this task. The visualization results of different detection models are shown on Figure 8.

In order to conduct a more comprehensive and threshold-independent evaluation of the CA-YOLO model, we plotted the Receiver Operating Characteristic (ROC) curves and Precision–Recall (PR) curves on the test set, as shown in Figure 9. As can be seen from the ROC curve, the Area Under Curve (AUC) values of all models are high and very close together, indicating that these detection models can effectively distinguish clue cells from the background. However, since the ROC curve is insensitive to a large number of true negative samples (i.e., correctly identified background areas) in the dataset, its discrimination in class imbalance scenarios is relatively limited. Considering that our task is to detect clue cells, a sparse target that accounts for a small proportion of the image and is easily confused with the background or other cells, the PR curve, as the gold standard for measuring the performance of the model in identifying positive samples, is more advantageous in this scenario. As can be seen from the PR curve, the CA-YOLO model achieved the highest AUPRC value of 0.724. More importantly, the PR curve of the CA-YOLO model is above other curves in most recall intervals, which shows that CA-YOLO can complete the task with a lower false detection rate, whether under a conservative detection strategy (low recall) that requires high precision or a comprehensive search strategy (high recall) that pursues a high detection rate. This feature is crucial for clinical auxiliary diagnosis application scenarios that require high reliability, because it directly affects whether accurate and reliable reference information can be provided to doctors.

In summary, CA-YOLO performs well in clue cell detection, especially in terms of PR curve performance, indicating that it can better handle the challenges of class imbalance and improve recall while ensuring high precision. This proves the reliability of CA-YOLO in the clinic.

6.5.2. Statistical Validation

To verify that the performance advantage of the CA-YOLO model over the baseline model YOLOv8-s is not due to random factors, we introduced two statistical test methods for analysis. First, we applied the Bootstrap resampling test to estimate the 95% confidence interval (CI) of the difference in accuracy between the two models. This method can quantify the credibility of the observed performance improvement. The CA-YOLO model has an accuracy of 90.20%, while the baseline model has an accuracy of 85.61%, with an average difference of +4.59%. Second, we used McNemar’s test. This test determines whether there is a significant difference in the error rates of the two models by analyzing samples where the prediction results of the two models are inconsistent (i.e., one model is correct and the other is wrong). In 477 test images, the CA-YOLO model corrected 34 errors of the baseline model and only introduced 13 new errors. This significant asymmetry indicates that the CA-YOLO model performs better. The key results of the two tests are summarized in Table 6.

As shown in Table 6, both statistical test methods provide strong evidence. The results of the Bootstrap test show that the 95% confidence interval of the CA-YOLO model’s accuracy improvement is [+1.75%, +7.43%]. Since this interval is completely above zero, it can be concluded that the performance advantage of the CA-YOLO model is statistically significant. The p-value obtained by McNemar’s test is 0.0035 (i.e.,

p < 0.01

), which enables us to reject the null hypothesis that “the error rates of the two models are the same”, further confirming that the performance of the CA-YOLO model is significantly better than that of the baseline model.

In summary, these statistical analysis results jointly prove that the superiority of the CA-YOLO model in the clinical clue cell detection task is real and robust, rather than an accidental experimental result.

6.6. Cross-Dataset Generalization Validation

In order to verify the generalization ability and robustness of the proposed CA-YOLO model under different tasks and data distributions, we selected the commonly used public BCCD dataset [52] for experiments and compared it with the mainstream detection model YOLOv8-s and the blood cell detection model BC-YOLO [53].

The detection results are shown in Table 7. The CA-YOLO model has a precision of 0.553, a recall of 0.926, and an F1 score of 0.693 on the BCCD dataset, all of which are higher than YOLOv8-s, indicating that its overall detection performance is better. In contrast, as a model optimized specifically for blood cell detection, BC-YOLO performs best in all three metrics. However, CA-YOLO can achieve performance close to that of a dedicated model in cross-dataset testing, which proves that our proposed model can effectively extract common cell features rather than just fitting the patterns of a specific dataset, fully demonstrating its good generalization ability and robustness.

In summary, the CA-YOLO model is not only competitive in the clue cell detection task, but also can achieve better results than mainstream models in blood cell detection tasks with different data distributions, thus verifying its practical value in cell detection tasks.

7. Discussions

We compare the proposed CA-YOLO model with the baseline model YOLOv11-s [51]. Although YOLOv11-s performs well in general tasks, it is less effective for the specific detection of clue cells. MSP-YOLO [25] focuses on super-resolution and is more suitable for detecting small targets; CAF-YOLO [16] adopts a CNN–Transformer hybrid method but is mainly used for blood cell detection; YOLO-TCT [17] combines Hard Polarized Self Attention (HPSA) and an improved loss function to perform better in long-tail neck cell detection. CA-YOLO performs particularly well in clue cell detection by simultaneously modeling cell surface contextual features and edge morphological features. CA-YOLO has the highest SEN but slightly lower SPE and ACC. In contrast, YOLOv11-s adopts a conservative strategy, leading to a higher SPE but a lower SEN. This trade-off is not a defect of the model, but meets the needs of clinical practice. In medical diagnosis, the risk of missed diagnosis is usually greater than that of misdiagnosis. Therefore, CA-YOLO could be a more effective auxiliary diagnostic tool in clinical screening while ensuring high sensitivity. Figure 8 shows the detection results of these models. As shown in Figure 8, in confusing scenes (row 1), most of the comparison models made mistakes for clue cells that are easily misjudged as epithelial cells, while CA-YOLO can identify them with a high confidence of 0.90. In scenes with dense cells and unclear features (rows 2–3), other models generally encounter problems involving missed detection, false detection, or low confidence, while CA-YOLO can provide correct and high-confidence detection results.

A key finding of our study is that while SEN improved significantly, the F1 score was only 0.503. This is due to a trade-off between precision and recall. This trade-off is not a flaw in the model, but rather a response to the clinical need to prioritize reducing missed diagnoses. In diagnostic screening, missed diagnoses are far more severe than false positives, leading to higher false positive rates (lower precision) because some epithelial cells with ambiguous features may be misclassified as clue cells. The high sensitivity of the CA-YOLO model allows the majority of potential clue cells to be flagged for clinician review. While this may include some false positives, this aligns with the primary goal of this screening tool: minimizing the risk of missing potential bacterial vaginosis cases.

The practical application of CA-YOLO lies in its integration into clinical diagnostic workflows. Regarding efficiency, our model achieved an inference speed of 88 FPS on a Tesla V100 GPU. Efficient inference, enabling rapid analysis of entire digital slides, which may contain hundreds of cells, without noticeable latency, is crucial for clinical applications. Potential workflow integration could enable CA-YOLO to serve as an automated pre-screening assistant. The system first scans the digital slide image and uses the CA-YOLO model to automatically detect and highlight all potential clue cells. These marked regions of interest are then presented to the expert, reducing manual search time and allowing them to focus on confirming diagnostically relevant cells, thereby improving the efficiency and reliability of bacterial vaginosis diagnosis.

8. Conclusions and Future Work

This paper proposes the CA-YOLO model, designed to address the significant challenge of automatically detecting clue cells in bacterial vaginosis diagnosis. By enhancing the YOLOv8 framework, we introduce three key innovations: the CAM captures the surface texture of bacterially adhered cells, the SGAM focuses on the morphology of ambiguous cell edges, and focal loss addresses the severe class imbalance. Experimental results demonstrate that CA-YOLO achieves a significant improvement in sensitivity, thereby providing a reliable tool for clinical screening by reducing missed diagnoses.

Despite its strong performance in sensitivity, we acknowledge key limitations. As discussed, the modest F1 score reflects a necessary trade-off between precision and recall, and the model’s robustness could be further enhanced by training on more diverse, multi-center datasets.

To address these points, our future work will proceed in two main directions. First, we will investigate advanced loss functions and post-processing techniques to improve the F1 score by achieving a better balance between precision and recall, without compromising sensitivity. Second, we will focus on expanding our dataset to enhance the model’s generalization capabilities. These efforts will be directed towards developing a more balanced, robust, and clinically deployable automated system for bacterial vaginosis diagnosis.

Author Contributions

Methodology, C.C. and X.C.; Validation, C.C.; Formal analysis, F.L.; Writing—original draft, C.C.; Writing—review and editing, X.C. and L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Key Science and Technology Program of Shaanxi Province, China (2024GXYBXM-036).

Institutional Review Board Statement

This study was conducted using fully anonymized vaginal fluorescence microscopy images without any patient-identifiable information. All data utilized in this research were obtained in compliance with relevant legal and ethical guidelines, and informed consent was obtained from all participants. The research was carried out in accordance with the principles of the Declaration of Helsinki and adhered to all applicable local regulatory requirements.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data will be made available upon reasonable request to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Swidsinski, S.; Moll, W.M.; Swidsinski, A. Bacterial vaginosis—Vaginal polymicrobial biofilms and dysbiosis. Dtsch. Ärzteblatt Int. 2023, 120, 347. [Google Scholar] [CrossRef] [PubMed]
National Bureau of Statistics of China. China Health Statistical Year Book. Technical Report. 2021. Available online: https://www.zgtjnj.org/navibooklist-n3020013080-1.html (accessed on 14 December 2021).
Romero-Morelos, P.; Bandala, C.; Jiménez-Tenorio, J.; Valdespino-Zavala, M.; Rodríguez-Esquivel, M.; Gama-Ríos, R.A.; Bandera, A.; Mendoza-Rodríguez, M.; Taniguchi, K.; Marrero-Rodríguez, D.; et al. Vaginosis-associated bacteria and its association with HPV infection. Med. Clínica (Engl. Ed.) 2019, 152, 1–5. [Google Scholar] [CrossRef]
Meng, L.; Xue, Y.; Yue, T.; Yang, L.; Gao, L.; An, R. Relationship of HPV infection and BV, VVC, TV: A clinical study based on 1261 cases of gynecologic outpatients. Zhonghua Fu Chan Ke Za Zhi 2016, 51, 730–733. [Google Scholar]
Li, H.; Dong, M.; Xie, W.; Qi, W.; Teng, F.; Li, H.; Yan, Y.; Wang, C.; Han, C.; Xue, F. Mixed vaginitis in the third trimester of pregnancy is associated with adverse pregnancy outcomes: A cross-sectional study. Front. Cell. Infect. Microbiol. 2022, 12, 798738. [Google Scholar] [CrossRef]
Demirezen, Ş.; Mutlu, A.B.S. Bacterial Vaginosis Bacterial and Epithelial Cell Adhesion Molecules. Gynecol. Obstet. Reprod. Med. 2013, 19, 58–62. [Google Scholar]
Sachdeva, S. Clue cell. Indian J. Dermatol. Venereol. Leprol. 2006, 72, 392. [Google Scholar] [CrossRef]
Martínez-Figueroa, C.; Estrada-Moreno, A.K.; Vences-Velázquez, A.; Cortés-Sarabia, K. One-Step Staining Method for the Identification of Clue Cells and Bacterial Morphotypes Associated with Bacterial Vaginosis. Microbiol. Spectr. 2022, 10, e01927-21. [Google Scholar] [CrossRef]
Song, X.; Xie, H.; Gao, T.; Cheng, N.; Gou, J. Improved YOLO-Based Pulmonary Nodule Detection with Spatial-SE Attention and an Aspect Ratio Penalty. Sensors 2025, 25, 4245. [Google Scholar] [CrossRef]
Xiong, M.; Wu, A.; Yang, Y.; Fu, Q. Efficient Brain Tumor Segmentation for MRI Images Using YOLO-BT. Sensors 2025, 25, 3645. [Google Scholar] [CrossRef] [PubMed]
Pan, Y.; Zhang, Z.; Zhang, X.; Zeng, Z.; Tian, Y. YOLO-TARC: YOLOv10 with Token Attention and Residual Convolution for Small Void Detection in Root Canal X-Ray Images. Sensors 2025, 25, 3036. [Google Scholar] [CrossRef] [PubMed]
Xie, Y.; Xing, F.; Shi, X.; Kong, X.; Su, H.; Yang, L. Efficient and robust cell detection: A structured regression approach. Med. Image Anal. 2018, 44, 245–254. [Google Scholar] [CrossRef]
Tofighi, M.; Guo, T.; Vanamala, J.K.; Monga, V. Prior information guided regularized deep learning for cell nucleus detection. IEEE Trans. Med. Imaging 2019, 38, 2047–2058. [Google Scholar] [CrossRef]
Wen, T.; Wu, H.; Du, Y.; Huang, C. Faster R-CNN with improved anchor box for cell recognition. Math. Biosci. Eng. 2020, 17, 7772–7786. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Chen, Z.; Lu, S. Caf-yolo: A robust framework for multi-scale lesion detection in biomedical imagery. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Lv, D.; Yi, L.; Liu, L.; Chen, Y.; Chen, X.; Liu, R. YOLO-TCT: An Effective Network For Long-Tailed Cervical Cell Detection. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Gu, W.; Sun, K. Improved YOLOv5 based on attention mechanism for blood cell detection. Biomed. Signal Process. Control 2024, 88, 105034. [Google Scholar] [CrossRef]
Zhang, D.; Bu, Y.; Chen, Q.; Cai, S.; Zhang, Y. TW-YOLO: An innovative blood cell detection model based on multi-scale feature fusion. Sensors 2024, 24, 6168. [Google Scholar] [CrossRef]
Al-Hamadani, M.N.; Poroszlay, R.; Szeman-Nagy, G.; Hajdu, A.; Hadjidemetriou, S.; Ferrarini, L.; Harangi, B. Improving Cell Detection and Tracking in Microscopy Images Using YOLO and an Enhanced DeepSORT Algorithm. Sensors 2025, 25, 4361. [Google Scholar] [CrossRef]
Wu, B.; Feng, S.; Jiang, S.; Luo, S.; Zhao, X.; Zhao, J. EB-YOLO: An efficient and lightweight blood cell detector based on the YOLO algorithm. Comput. Biol. Med. 2025, 192, 110288. [Google Scholar] [CrossRef] [PubMed]
Haq, I.; Mazhar, T.; Asif, R.N.; Ghadi, Y.Y.; Ullah, N.; Khan, M.A.; Al-Rasheed, A. YOLO and residual network for colorectal cancer cell detection and counting. Heliyon 2024, 10, e24403. [Google Scholar] [CrossRef]
Guo, Y.; Yu, X.; Li, J. A classification method of epithelial cells and clue cells based on multi-scale texture analysis. In Proceedings of the 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Datong, China, 15–17 October 2016; pp. 432–436. [Google Scholar]
Yu, Y.; Wang, J. Convexified coupled active contour segmentation of clue cells for assessing bacterial vaginosis with immunofluorescence microscopy. In Proceedings of the 2022 56th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 31 October–2 November 2022; pp. 307–311. [Google Scholar]
Chen, X.; Zheng, H.; Tang, H.; Li, F. Multi-scale perceptual YOLO for automatic detection of clue cells and trichomonas in fluorescence microscopic images. Comput. Biol. Med. 2024, 175, 108500. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Liu, C.; Wu, Z.; Wang, B.; Zhu, M. Pulmonary nodule detection in x-ray images by feature augmentation and context aggregation. Phys. Med. Biol. 2024, 69, 045002. [Google Scholar] [CrossRef]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Cui, Y.; Zhu, J.; Chen, L.; Zhang, G.; Gao, S. MGA-Net: Multiscale global feature aggregation network for arteriovenous classification. Signal, Image Video Process. 2024, 18, 5563–5577. [Google Scholar] [CrossRef]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 20, 1254–1259. [Google Scholar] [CrossRef]
Jia, Z.; Huang, T.; Li, X.; Bian, Y.; Wang, F.; Yuan, J.; Xu, G.; Yang, J. DBAII-Net with multiscale feature aggregation and cross-modal attention for enhancing infant brain injury classification in MRI. Phys. Med. Biol. 2024, 69, 205017. [Google Scholar] [CrossRef] [PubMed]
Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, W.; Han, F.; Tuniyazi, A.; Li, H.; et al. Swin-transformer-enabled YOLOv5 with attention mechanism for small object detection on satellite images. Remote Sens. 2022, 14, 2861. [Google Scholar] [CrossRef]
Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 905–909. [Google Scholar] [CrossRef]
Dong, M.; Fang, Z.; Li, Y.; Bi, S.; Chen, J. AR3D: Attention residual 3D network for human action recognition. Sensors 2021, 21, 1656. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Zubair, M.; Owais, M.; Hassan, T.; Bendechache, M.; Hussain, M.; Hussain, I.; Werghi, N. An interpretable framework for gastric cancer classification using multi-channel attention mechanisms and transfer learning approach on histopathology images. Sci. Rep. 2025, 15, 13087. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Yang, L.; Zhang, W.; Wang, S.; Yu, X.; Jing, B.; Sun, N.; Sun, T.; Wang, W. FED-UNet++: An Improved Nested UNet for Hippocampus Segmentation in Alzheimer’s Disease Diagnosis. Sensors 2025, 25, 5155. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Fang, C.; Liu, S.; Gao, B. MB-MSTFNet: A Multi-Band Spatio-Temporal Attention Network for EEG Sensor-Based Emotion Recognition. Sensors 2025, 25, 4819. [Google Scholar] [CrossRef]
Mo, H.; Yang, Z.; Li, P.; Wang, Q. The Method of Portrait Segmentation Based on Efficient Channel Attention Mechanism. In Proceedings of the 2024 4th International Conference on Computer Science and Blockchain (CCSB), Shenzhen, China, 6–8 September 2024; pp. 380–385. [Google Scholar]
Lu, J.; Luo, X. Image Super-resolution with Enhanced Channel Attention Residual Network. In Proceedings of the 2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 21–23 April 2023; pp. 1650–1654. [Google Scholar]
Verma, H.; Patro, B. Enhanced Fetal Ultrasound Image Segmentation using Spatial Attention Mechanisms with UNet: SAUnet. In Proceedings of the 2024 IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), Prayagraj, India, 19–20 October 2024; pp. 1–6. [Google Scholar]
Liu, Y. Study on human pose estimation based on channel and spatial attention. In Proceedings of the 2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 6–8 January 2023; pp. 47–50. [Google Scholar]
Xiao, B.; Nguyen, M.; Yan, W.Q. Fruit ripeness identification using YOLOv8 model. Multimed. Tools Appl. 2024, 83, 28039–28056. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. Proc. Mach. Learn. Res. 2021, 139, 11863–11874. [Google Scholar]
Ultralytics. YOLOv11. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 11 June 2025).
Roboflow. BCCD: Blood Cell Count and Detection Dataset. 2025. Available online: https://public.roboflow.com/object-detection/bccd (accessed on 29 June 2025).
Yurdakul, M.; Taşdemir, Ş. BC-YOLO: MBConv-ECA based YOLO framework for blood cell detection. Signal Image Video Process. 2025, 19, 712. [Google Scholar] [CrossRef]
Ultralytics. YOLOv8. 2023. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 5 February 2024).

Figure 1. Visual comparison of a clue cell and an epithelial cell in a fluorescence microscopy image. (a) Magnified view focusing on the edge difference. (b) Magnified view focusing on the surface texture difference. The red boxes indicate the clue cell, while the blue boxes indicate the epithelial cell.

Figure 2. The overall workflow of the proposed CA-YOLO model.

Figure 3. Overview of the proposed CA-YOLO framework.

Figure 4. The details of the context-aware module.

Figure 5. The details of shuffle global attention mechanism.

Figure 6. Impact of

γ

on F1 score and SEN.

Figure 6. Impact of

γ

on F1 score and SEN.

Figure 7. CA-YOLO training process analysis.

Figure 8. The visualization results of different detection models.

Figure 9. Performance comparison of different models on ROC and PR curves.

Table 1. Hyper-parameter search space and selected best values for CA-YOLO training.

Hyper-Parameters	Search Space	Best Value
learning rate	[0.0001, 0.0005, 0.001, 0.005, 0.01]	0.001
batch size	[16, 32, 64]	64
weight decay	[0.0, 0.0005, $1 \times 10^{- 5}$ , $1 \times 10^{- 4}$ , $1 \times 10^{- 3}$ ]	0.0005

Table 2. Performance comparison of placing the SGAM module at different locations.

Model Structures	SEN	SPE	ACC	F1
3_layers_P345	0.765 ± 0.020	0.929 ± 0.025	0.903 ± 0.018	0.442 ± 0.015
1_layer_P5	0.778 ± 0.036	0.941 ± 0.016	0.902 ± 0.016	0.503 ± 0.025

Table 3. Comparison of different attention mechanisms integrated into the CA-YOLO model.

Attention Modules	SEN	SPE	ACC	F1
SE [36]	0.762 ± 0.051	0.923 ± 0.015	0.894 ± 0.017	0.445 ± 0.023
CBAM [40]	0.728 ± 0.051	0.939 ± 0.005	0.901 ± 0.008	0.455 ± 0.019
ECA [49]	0.709 ± 0.071	0.930 ± 0.013	0.891 ± 0.005	0.458 ± 0.022
SimAM [50]	0.714 ± 0.049	0.926 ± 0.022	0.888 ± 0.013	0.444 ± 0.024
SGAM (proposed)	0.778 ± 0.036	0.941 ± 0.016	0.902 ± 0.016	0.503 ± 0.025

Table 4. Ablation study of the proposed CA-YOLO model components.

Model Structures	SEN	SPE	ACC	F1
w/o SGAM	0.766 ± 0.036	0.940 ± 0.014	0.908 ± 0.011	0.493 ± 0.011
w/o CAM	0.761 ± 0.031	0.936 ± 0.008	0.904 ± 0.009	0.488 ± 0.014
w/o focal loss	0.765 ± 0.045	0.939 ± 0.022	0.908 ± 0.014	0.480 ± 0.025
CA-YOLO	0.778 ± 0.036	0.941 ± 0.016	0.902 ± 0.016	0.503 ± 0.025

Table 5. Performance comparison with other object detection methods.

Models	SEN	SPE	ACC
YOLOv11-s [51]	0.686 ± 0.052	0.963 ± 0.009	0.904 ± 0.010
MSP-YOLO [25]	0.706 ± 0.029	0.961 ± 0.012	0.900 ± 0.006
CAF-YOLO [16]	0.531 ± 0.146	0.936 ± 0.013	0.860 ± 0.024
YOLO-TCT [17]	0.543 ± 0.037	0.936 ± 0.012	0.867 ± 0.010
CA-YOLO (v11)	0.765 ± 0.042	0.941 ± 0.005	0.909 ± 0.003
CA-YOLO	0.778 ± 0.036	0.941 ± 0.016	0.902 ± 0.016

Table 6. Statistical tests for comparing CA-YOLO performance with baseline models.

Statistical Test	Metric	Result	Conclusion
Bootstrap Test	95% CI for Accuracy Difference	[+1.75%, +7.43%]	Significant (Interval does not contain 0)
McNemar’s Test	p-value	p = 0.0035 (<0.01)	Significant (p < 0.05)

Table 7. Comparison of detection performance on the BCCD dataset.

Models	Precision	Recall	F1
YOLOv8-s [54]	0.537	0.898	0.672
BC-YOLO [53]	0.901	0.941	0.920
CA-YOLO	0.553	0.926	0.693

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, C.; Chen, X.; He, L.; Li, F. CA-YOLO: An Efficient YOLO-Based Algorithm with Context-Awareness and Attention Mechanism for Clue Cell Detection in Fluorescence Microscopy Images. Sensors 2025, 25, 6001. https://doi.org/10.3390/s25196001

AMA Style

Cui C, Chen X, He L, Li F. CA-YOLO: An Efficient YOLO-Based Algorithm with Context-Awareness and Attention Mechanism for Clue Cell Detection in Fluorescence Microscopy Images. Sensors. 2025; 25(19):6001. https://doi.org/10.3390/s25196001

Chicago/Turabian Style

Cui, Can, Xi Chen, Lijun He, and Fan Li. 2025. "CA-YOLO: An Efficient YOLO-Based Algorithm with Context-Awareness and Attention Mechanism for Clue Cell Detection in Fluorescence Microscopy Images" Sensors 25, no. 19: 6001. https://doi.org/10.3390/s25196001

APA Style

Cui, C., Chen, X., He, L., & Li, F. (2025). CA-YOLO: An Efficient YOLO-Based Algorithm with Context-Awareness and Attention Mechanism for Clue Cell Detection in Fluorescence Microscopy Images. Sensors, 25(19), 6001. https://doi.org/10.3390/s25196001

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CA-YOLO: An Efficient YOLO-Based Algorithm with Context-Awareness and Attention Mechanism for Clue Cell Detection in Fluorescence Microscopy Images

Abstract

1. Introduction

2. Related Work

2.1. Cell Detection

2.2. Context Information Enhancement

2.3. Attention Mechanisms

3. Materials

4. Theory

4.1. The YOLOv8 Architecture

4.2. Standard Loss Function of YOLOv8

5. Method

5.1. Context-Aware Module

5.2. Shuffle Global Attention Mechanism

5.3. Loss Function

6. Experiments

6.1. Setup

6.1.1. Implementation Details

6.1.2. Experimental Design

6.1.3. Evaluation Metrics

6.2. Model Analysis

6.2.1. Hyper-Parameter γ

6.2.2. Insertion Position of the SGAM

6.2.3. Different Attention Mechanisms

6.3. Training Process Analysis

6.4. Ablation Study

6.5. Comparison Study

6.5.1. Comparison with Other Detection Methods

6.5.2. Statistical Validation

6.6. Cross-Dataset Generalization Validation

7. Discussions

8. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

6.2.1. Hyper-Parameter $γ$