3.1. Performance Indicators
When evaluating network models, specific assessment criteria are typically required to quantify and compare the model’s performance. In this study, accuracy, precision, recall, and F1 score were selected as metrics to evaluate the model’s performance. The confusion matrix serves as an analytical tool that visualizes the model’s classification of the test samples, enabling the calculation of evaluation metrics based on these visualized data.
Accuracy is a key metric that measures how correctly a model performs its classification tasks. The specific formula for accuracy is given as follows:
Precision reflects the proportion of truly positive samples among those that the model predicts as positive. The specific formula for precision is given as follows:
Recall reflects how many actual positive samples are predicted as positive. The specific formula for recall is given as follows:
where
represents true positives, reflecting the number of positive samples correctly identified by the model;
represents true negatives, reflecting the number of negative samples correctly identified;
represents false positives, reflecting the number of negative samples incorrectly identified as positive; and
represents false negatives, reflecting the number of positive samples incorrectly identified as negative.
The F1 score balances the model’s precision and recall, taking into account both the completeness (recall) and the correctness (precision) of the model’s predictions. This provides a more comprehensive evaluation metric. The specific formula for the F1 score is given as follows:
3.2. Experimental Results
For the established training and validation sets, comparative experiments were conducted using seven models: EfficientNet, ResNet18, ResNet34, ResNet50, VGG16, MobileNetV2, GoogleNet, DenseNet, and the proposed SLENet. Additionally, we introduced the Vision Transformer (ViT) as a comparative model to explore the applicability of the transformer architecture in this task. Although transformer-based models have achieved excellent performance in various computer vision tasks, in our study, ViT showed a higher training loss and significantly lower validation accuracy compared to CNN-based models, as shown in
Figure 5 and
Figure 6. This may be because transformer architecture generally lacks the inductive biases inherent to CNNs, such as local receptive field and translation invariance, which may limit its ability to extract effective features from relatively small datasets. In our study, due to the limited number of experimental animal samples, the dataset was insufficient to support the effective training of transformer-based models. Furthermore, the self-attention mechanism employed in transformers has a quadratic computational complexity (
), leading to higher computational requirements during training. In contrast, convolutional operations are more efficient, making CNNs more suitable for conditions with limited computational resources. Therefore, the following analysis will focus on the performance of CNN-based models for this task.
Figure 5 indicates that the fluctuations of EfficientNet and ResNet50 in the early training stages are more pronounced compared to the other models. After approximately 100 epochs, the validation accuracy of all models shows reduced fluctuation and begins to converge. Notably, SLENet demonstrates a clear advantage after about 60 epochs, achieving a validation accuracy exceeding 96%. This indicates an improvement in the classification performance of SLENet compared to commonly used convolutional neural networks. Importantly, SLENet exhibits smaller fluctuations in the accuracy curve throughout the training process, suggesting better generalization ability and stability than the other models.
Figure 6 presents the loss curves for each model on the training set. It can be observed that EfficientNet has the highest loss in the first 50 training epochs, while ResNet18 and ResNet34 perform well with lower loss values compared to the other models. After about 120 epochs, the loss values for all models change minimally, indicating convergence. During the 120–130 epoch range, SLENet displays lower loss values compared to the other models.
To provide a more comprehensive evaluation of the effectiveness of SLENet compared to other convolutional neural networks in the classification task of the rat estrous cycle, we present the prediction results of each model on the test set using a confusion matrix, as shown in
Figure 7.
The confusion matrix indicates that SLENet demonstrates the highest accuracy in identifying the estrous stages, correctly classifying all images from this phase. It also shows strong performance in recognizing the P phase, with only one image misclassified as the D phase. Additionally, there were 6 and 12 misclassifications for the D and M phases, respectively. Compared to other classification models, which had total misclassifications exceeding 20, SLENet shows its advantage.
Figure 8 shows normalized confusion matrix data in the form of a bar chart; as can be seen, the proposed model achieves highest accuracy across the E, D, and P stages, demonstrating a comparatively superior generalization and robustness. Notably, SLENet does not achieve the top accuracy in the M stage, though it still maintains competitive performance, closely following the best-performing model; this slight drop can be attributed not only to the transitional nature of the M stage but also to its mixed cytological complexity, while SECA emphasize local details and the non-local module captures global distributions. These mechanisms are not fully optimal for transitional phases where local and global features are inconsistent, explaining the relative difficulty in accurately classifying the M stage.
To ensure the statistical reliability of our experimental results, we conducted each experiment five times with different random seeds under the same environment, reporting the mean and 95% confidence intervals for the evaluation metric, and performed statistical significance testing between SLENet and the other models, as shown in
Table 2,
Table 3 and
Table 4. Considering that precision and recall are often correlated in value, and F1 score provides a balanced measure between them, therefore, we applied paired
t-tests to calculate the
p-values of each class only based on the F1 score and report the average value as an overall indicator.
Based on the results, as shown in
Figure 9, the average accuracy of SLENet is 96.31%, which is the highest among these models.
Table 2,
Table 3 and
Table 4 show that SLENet’s overall average values on precision, recall, and F1 score achieved 96.27%, 96.30%, and 96.26%, respectively, with smaller confidence intervals (3.65%, 5.76%, and 4.18%), indicating excellent classification accuracy and robustness. Notably, in some cases, although SLENet shows slightly lower average precision and recall compared to the best-performing model, it consistently exhibits the smallest confidence intervals. This indicates that its predictions are more stable and reliable. More importantly, SLENet achieves the highest F1 score in all four classes, which means it can balance precision and recall more effectively, demonstrating better overall performance in this task. Additionally, the results show that SLENet achieves statistical significance (
p < 0.05) when compared with most of these models. Although the comparison with EfficientNet results in a
p-value of 0.13, which does not meet the significance threshold, the proposed model still outperformed EfficientNet numerically in all classes, showing an overall advantageous performance trend.
To further evaluate the performance of SLENet in this multi-class classification task, the Receiver Operating Characteristic (ROC) and precision–recall (PR) curves were generated. The ROC curves were constructed using the one-vs-rest (OvR) strategy. As shown in
Figure 10 and
Figure 11, the Area Under the Curve (AUC) and average precision (AP) for all classes in the figure exceed 0.99, showing that despite the overall accuracy implying a few errors in prediction, the model has excellent ranking capability and robust probability outputs. The few misclassifications did not significantly impact the model’s ability to distinguish between classes or to correctly identify positive samples.
Analyzing the result from a biological perspective, the estrous cycle in rat is a dynamic process; therefore, during the construction of the dataset, transitional phases are inevitable. During these periods, vaginal cytology often contains a diverse range of cell types in large quantities, resulting in images rich in detailed textures. Consequently, some issues such as cell overlap, blurred edges, and uneven staining may occur in the collected images. For these atypical images, in manual classification, experts may incorporate multidimensional information to make flexible judgments, such images can be classified as "transition phases" or "suspected stage", and multiple experts may review the samples to improve accuracy when it is necessary. However, network models are trained based on fixed labels; these factors present challenges for relatively simple models (e.g., those without integrated attention mechanisms), which may have limitations in effectively extracting such complex features, ultimately leading to performance differences.
According to the results, it is evident that the highest misclassification rates occur between the M and D stages. This is because the cytological composition during stages M and D is quite similar, with both containing a large number of leukocytes. For the model, subtle differences in leukocyte proportions are difficult to distinguish accurately. Additionally, we observed that the P stage is often misclassified as E. This is due to the gradual keratinization of nucleated epithelial cells on the vaginal smears during the late P stage, making the image features increasingly resemble those of the E stage and thus confusing the model. Lastly, it is noteworthy that stages E and D are generally well distinguished. This is because stage E is characterized by densely packed and orderly arranged keratinized cells, whereas stage D is dominated by small, round leukocytes. The distinct morphological features between these two stages make them relatively easier for the model to differentiate.
Generally, the experimental results suggest that SLENet is more suitable for the classification of specific medical images.
3.3. Ablation Study
This section uses ablation experiments to demonstrate the effectiveness of the modules introduced in the SLENet network. The control group consists of the SLENet network and EfficientNet. All other parameters and conditions are kept the same.
To ensure the reliability of the results, we use the same strategy to repeat the experiment five times and report the F1 score and overall accuracy as mean ± 95% confidence interval. Additionally, to evaluate whether SECA provides superior enhancement, we incorporated attention modules such as Convolutional Block Attention Module (CBAM) and Coordinate Attention (CA), which also emphasize joint modeling of spatial and channel features, and calculated their performance. The specific results are shown in
Table 5.
As we can see from the result, incorporating SECA alone can increase the mean F1 score and accuracy, and reduce the confidence interval, but the overall improvement is quite limited. In contrast, when the non-local module is introduced alone, both F1 score and accuracy decrease. However, when SECA and non-local are combined, the model achieves the best performance on both metrics (F1 score = 96.26%, accuracy = 96.31%), with a reduction in confidence intervals, which means it not only enhances the model’s predictive performance but also improves its stability. This result can be explained as follows: In this task, vaginal smear microscopic images present both prominent local features (such as clusters of keratinized cell or leukocytes) and global features (such as the proportion and spatial arrangement of different cell types), relying solely on local details or global distribution that can easily lead to misclassification. For example, a local region might already show keratinized cells, but the overall distribution still resembles the previous stage; or certain areas are dominated by leukocyte but the global proportion is not fully changed. The SECA module enhances the model’s capability to capture critical local features, improving its sensitivity to the details of the images. Additionally, the non-local module strengthens the model’s ability to capture long-range dependencies, which is crucial for recognizing specific distribution patterns. However, when applied alone, the non-local module may overemphasize global context and suppress subtle but critical local features; moreover, the dataset size may be insufficient for it to effectively learn stable long-range dependencies, which can explain the observed performance degradation. Therefore, compared to introducing a single module, integrating both of them can improve the model’s classification performance more effectively.
The results also show that substituting the SECA module with the CA module leads to only minimal performance improvement, while substituting it with CBAM even degrades the model’s performance. This indicates that the fusion methods of these modules are not suitable for discriminating features in this task; therefore, incorporating SECA is better in order to classify the estrous cycle of rats.
3.4. Complexity Analysis
To assess the computational efficiency, we compared the inference time, number of parameters, and floating-point operations (FLOPS) between the baseline (EfficientNet) and SLENet. The inference time was measured over five runs and reported the average result under the same environment to ensure consistency.
As shown in
Table 6, compared to the baseline, SLENet shows an increase in model complexity; the number of parameters increases from 4.01 M to 14.19 M, which is about 3.5 times larger. Additionally, the FLOPs increase from 6.58 G to 9.35 G, an increase of 42%, with both of them reflecting a higher computational complexity. This is primarily due to the introduction of SECA and non-local, which enhance the model’s feature extraction capability but inevitably add extra parameters and computational cost.
However, despite the increase in both parameters and FLOPs, the inference time only rises from 32.32 ms to 34.58 ms, representing only a 7% increase. This result suggests that, although SLENet introduces higher theoretical complexity, the practical computational overhead remains limited, and the added operations are efficiently handled. Thus, the design achieves a favorable balance between improved accuracy and computational efficiency.