4.2. Comparison of Experimental Parameters
Over the course of model training, as summarized in
Table 4, the study systematically monitored several critical parameters to ensure a comprehensive evaluation of performance [
29].
These metrics collectively offer insights into the model’s learning behavior, convergence stability, and detection accuracy, thereby serving as essential indicators for performance assessment throughout the training procedure. Specifically, Loss_rpn_cls and Loss_rpn_bbox are used to evaluate the classification accuracy and bounding box regression performance of the Region Proposal Network (RPN). A reduction in these losses implies that the RPN is generating more precise object proposals and achieving better localization performance, which directly influences the quality of the subsequent detection stage.
Furthermore, Loss_cls and Loss_bbox focus on the final detection stage, where they measure classification accuracy and localization precision. Lower values of these losses suggest that the model is not only correctly identifying object categories but also accurately predicting their spatial positions, thereby confirming the overall robustness and reliability of the detection framework.
Acc represents the overall classification accuracy across all test samples. To adapt to few-shot learning tasks, Loss_meta_cls and Meta_acc are introduced to measure the meta-classification loss and meta-level accuracy, respectively. Lower Loss_meta_cls values and higher Meta_acc scores suggest better recognition of novel classes with limited labeled data. Loss_vae evaluates the data model capability of the Variational Autoencoder; a lower value indicates a more precise model oflatent feature distribution.
Finally, the overall loss aggregates all individual loss terms, providing a unified indicator of the model’s training performance. A lower total loss implies more effective learning across object detection, meta-learning, and latent feature model. These parameters collectively offer a detailed assessment of the model’s capability, particularly in few-shot learning scenarios.
4.7. Performance and Convergence Analysis
To comprehensively evaluate the model’s overall performance, we track both the total loss and Overall Accuracy during training. The total loss curve (Loss) reflects the combined loss from all model components, offering a holistic view of the model’s convergence. A lower total loss indicates improved performance across all tasks.
As shown in
Figure 10, the Overall Accuracy and loss curves demonstrate the superior performance of the proposed model compared to ResNet. In the Accuracy curve, the proposed model consistently achieves higher accuracy. Although ResNet exhibits better accuracy in the early stages, the proposed model quickly surpasses it and stabilizes at a higher accuracy level. This reflects the model’s more efficient learning and faster convergence.
Similarly, in the loss curve, the proposed model shows a steady decline in loss, ultimately reaching a lower and more stable value compared to ResNet. While ResNet experiences fluctuations and struggles to reduce loss effectively in later iterations, the proposed model converges more smoothly, demonstrating better optimization in both classification and localization tasks. These results emphasize the model’s robustness and superior generalization ability. In addition, this experiment adopts the default evaluation settings of the MMFewShot framework, where the IoU threshold for determining correct detection boxes is 0.5 (mAP@0.5), and the AP calculation for each category is based on the area under the P-R curve. The final mAP value is obtained by averaging across all categories. The study uses the
mAP as a comprehensive evaluation metric, as shown in
Table 5; to ensure the reproducibility of the experiment, this study repeated key experiments (especially with a 10-shot set) using five different random seeds. The relevant results have now been expressed as “mean ± standard deviation”, and
t-tests have been conducted, indicating that certain improvements are statistically significant at the
p < 0.05 level. In addition, in order to ensure the effectiveness of the experimental results, this experiment also counted the accuracy, recall, F1 value and other indicators to observe the optimization effect of the model in many aspects, so as to perform a more comprehensive evaluation and make it easier to compare it with the cited and other related work.
As the mAP value is the core indicator in target detection and is closely related to other indicators, it calculates the average value of all categories of AP to reflect the overall detection performance of the model on multiple categories. Therefore, the study focuses on the comparison of map values. The mAP comparison results of two fish groups, red fish and black fish, show that the proposed model consistently outperforms the VFA method. For red fish, the proposed model achieves a mAP of 0.775 in base training and 0.265 in 10-shot fine-tuning, slightly higher than VFA’s 0.763 and 0.258. For black fish, the improvements are more evident, with the proposed model reaching 0.833 and 0.286, compared to VFA’s 0.804 and 0.271. These results demonstrate the proposed model’s better generalization and adaptability under few-shot settings.
To evaluate the performance of different models in fish detection tasks, we conducted a series of experiments using the following models: TFA [
30], Meta-RCNN [
31], VFA, and the module improved in this study. Additionally, we explored the model’s performance under different few-shot learning conditions (1-shot, 5-shot, 10-shot). These experiments aim to assess the models’ ability to adapt to new classes with limited labeled data, simulating the challenge of having only a few samples in real-world scenarios.
This study compares the backbone networks by comparing their training accuracy on the same dataset, as shown in
Table 6. Bold values indicate the best performance among the compared methods.
The proposed model consistently outperforms the other approaches across all fine-tuning conditions for both red-fish and black-fish detection tasks. For red fish, under the 10-shot fine-tuning setting, the proposed model achieves a mAP of 0.265, exceeding VFA (0.258), Meta-RCNN (0.224), and TFA (0.125). Similarly, for black fish in the same setting, the proposed model attains a mAP of 0.286, surpassing VFA (0.271), Meta-RCNN (0.244), and TFA (0.129). The performance advantage is particularly pronounced in the low-shot scenarios. In the 1-shot condition, the proposed model records mAPs of 0.152 for red fish and 0.169 for black fish, both of which represent notable improvements over competing methods, indicating superior rapid learning and adaptability to novel categories. In the 5-shot setting, the proposed model continues to outperform, reaching mAP values of 0.247 for red fish and 0.253 for black fish, which further confirms its effectiveness in few-shot detection scenarios.
Based on the provided mAP values, the object detection process for novel-class fish targets under low-shot settings exhibits significantly low mAP, exemplified by a value of 0.152 for the VFA method under the 1-shot condition. This study identifies three primary contributing factors. Firstly, inadequate feature discriminability arises from the limited number of training samples, hindering the model’s ability to learn subtle yet distinguishing features, which consequently leads to a high incidence of both false positives and false negatives. This limitation is further corroborated by a substantial discrepancy between the predicted bounding boxes and the ground-truth annotations, with the area of predicted boxes being approximately only one-tenth of the actual annotations. Secondly, a pronounced domain shift is evident. The annotated dataset and the public dataset used for base-class learning were acquired under different lighting conditions (in-air vs. underwater), potentially impairing the model’s adaptability to the novel domain. Thirdly, overfitting to the support set is observed. The model demonstrates good performance on the limited support samples but fails to generalize effectively to the query set, as indicated by a lower recall rate. From a statistical perspective, the low mAP indicates that the detection proposals exhibit low precision across all recall levels, implying that a considerable number of predictions are either incorrect or associated with low confidence. Despite the modest absolute performance, reporting this result remains critically valuable. It establishes a rigorous performance baseline for a highly challenging task, quantitatively characterizing the difficulty inherent in the combined constraints of “low-shot” learning and “domain difference.” Furthermore, it clearly delineates the limitations of current methodologies, thereby providing a clear benchmark for comparison and directing meaningful pathways for future research and improvement.
These results demonstrate that the proposed model can effectively detect new categories with very limited samples while maintaining high detection accuracy. Compared to traditional models, the proposed approach adapts better to new fish species, reduces misclassifications, and exhibits strong generalization and learning efficiency, which also reflects the robustness of the model during optimization, especially when new data samples are scarce, as it effectively prevents overfitting while continuing to improve.
Table 7 shows the results when treating shot as a hyperparameter, tested with shot ∈ {1, 5, 10}. For each setting, we recorded the detection accuracy (
mAP) and the average time per training epoch. The results clearly show that increasing the number of shots improves detection accuracy:
mAP rises from 0.16 for a 1-shot to 0.275 for a 10-shot model. Although the 10-shot model requires more time per epoch (0.03) compared to the 1-shot (0.012) and 5-shot (0.021) models, the accuracy improvement is significant. It indicates that the additional training time is worthwhile, as it delivers the best balance between performance and efficiency in our experiments. Based on these results, we selected shot = 10 for the remaining experiments.
In summary, increasing K improves validation mAP but incurs higher time/epoch (and label cost): mAP rises from 0.160 (k = 1) to 0.275 (K = 10), while time/epoch increases from 0.012 to 0.030. The marginal mAP gain per additional labeled image exhibits diminishing returns, Thus, K = 10 achieves the highest accuracy (0.275), but K = 5 attains ~91% of the mAP of K = 10 (0.250/0.275) at ~70% of the training time per epoch (0.021/0.030) and 50% of the labels per class, representing a strong Pareto point when resources are constrained.In the main experiments, we adopt K = 10 to report the best attainable accuracy under our setting. For deployments with tight labeling or compute budgets, K = 5 is recommended as a balanced choice.
Following the ablation study, the study additionally provides a visual comparison between predicted bounding boxes and ground-truth annotations on single test images, as
Table 8 shows, thereby offering a clear and intuitive demonstration of the model’s capability in both object counting and localization precision.
To further investigate the model’s performance in complex underwater environments, three representative cases of failure or suboptimal detection from the validation set were selected (see
Figure 11a–c). The main issues include partial occlusion, instance merging that leads to undercounting, and missed detections caused by turbid water containing settled excreta.
These examples indicate that although the proposed model generally performs robustly, certain limitations remain under extreme lighting, heavy occlusion, and highly turbid water with settled excreta. Future work may focus on enhancing data augmentation strategies and feature extraction mechanisms to improve robustness and generalization in such challenging scenarios.
Based on the research content outlined, this study proposes a targeted technical framework for few-shot object detection and deploys it within an operational RAS, as illustrated in
Figure 12. Empirical validation was conducted across successive breeding cycles, demonstrating the method’s efficacy in accurately distinguishing between different categories of autotrophic fish species (e.g., red fish and black fish). In addition, the model can not only recognize these two types of fish schools but also has certain detection effects for multi-class object detection in subsequent research. Methods based on small-sample learning aim to learn general features for learning about fish schools and improving robustness against environmental interference factors. The proposed approach effectively addresses the limitations of conventional single-class dataset recognition, which typically demands large-scale annotated data and exhibits poor robustness.