1. Introduction
The Earth’s surface is dominated by oceans, which constitute over 70% of the total area and contain vast reserves of natural resources. However, due to technical limitations and environmental constraints, the exploitation and utilization of marine resources remain significantly underdeveloped. As land-based resources become increasingly scarce in response to rapid population growth and rising standards of living, the strategic importance of marine resource development has never been more prominent. Particularly in China, with its extensive coastline and rich marine ecosystems, the potential for ocean-based economic and scientific advancement is substantial. Yet, the dynamic and complex nature of underwater environments imposes considerable challenges for human intervention and exploration.
To overcome these limitations, underwater robots have emerged as a viable solution for performing tasks in hazardous or inaccessible underwater settings. These systems are widely applied in fields such as marine oil and gas extraction, underwater archaeology, environmental monitoring, and marine scientific research. However, most conventional underwater robots rely on electric propeller-based propulsion systems, which are often structurally complex, energetically inefficient, and limited in maneuverability. In contrast, fish have evolved over millions of years to exhibit exceptional hydrodynamic performance and adaptability, making them an ideal model for biomimetic underwater robotic design. By emulating fish locomotion strategies, engineers aim to develop next-generation underwater robots that offer superior efficiency, agility, and environmental adaptability.
A key technological enabler for such biomimetic systems is pose estimation, a computer vision technique that identifies the spatial configuration of objects or organisms in images or video frames. Although significant progress has been made in human pose estimation driven by deep learning advancements, animal pose estimation—particularly for aquatic species—remains comparatively underexplored. This gap stems from several intrinsic challenges, including the anatomical diversity across species, the high variability and speed of animal motion, the labor-intensive nature of data annotation, and environmental complexities such as occlusions and dynamic lighting in natural habitats. Moreover, existing datasets and algorithmic toolchains tailored for animal pose estimation are still in early stages of development.
To address these challenges, a series of foundational studies have investigated fish motion mechanisms through image processing, 3D reconstruction, and hydrodynamic analysis. Early works, such as Jing et al. [
1], examined crucian carp motion dynamics, identifying distinct behavioral stages via hydrodynamic modeling. Li et al. [
2] highlighted the importance of dorsal fin oscillation in propulsion for the Nile electric eel, while Yan et al. [
3] and Wu et al. [
4] developed multi-angle video systems to analyze swimming trajectories and kinematics of cyprinids and koi carp. Lai et al. [
5] and Stern et al. [
6] advanced behavioral detection systems and landmark recognition methods based on image processing and machine learning, respectively.
Subsequent studies introduced 3D pose reconstruction techniques based on stereo vision [
7,
8,
9,
10], and deep learning approaches for robust fish tracking and classification [
11,
12,
13,
14,
15]. Lin et al. [
16] developed the first fish pose dataset with 1000 annotated images and proposed a two-stage estimation framework, achieving over 90% detection accuracy. Further innovations have emerged in pose recovery under occlusion [
17], monocular 3D reconstruction [
18], and spatiotemporal modeling using LSTM networks [
19]. Lightweight deep learning models, such as MFLD-net [
20], and graph-based behavior detection frameworks [
14], have improved computational efficiency and detection robustness in aquaculture scenarios.
Meanwhile, several researchers have explored skeleton-based 3D tracking [
21], stereoscopic keypoint detection [
22], and motion parameter estimation [
23], enabling more precise modeling of fish dynamics. Pose estimation methods leveraging DeepLabCut [
24] and other deep learning architectures have shown promise in extracting motion parameters like velocity, acceleration, and angular displacement with high accuracy. These studies collectively reflect a growing trend towards integrating visual perception, behavior modeling, and biomechanical analysis for intelligent underwater systems.
Beyond fish-focused studies, related advances in representation learning are also relevant to this work. For instance, the study by Zhang et al. [
25] investigated strategies for balancing multi-task learning objectives, and its insights on task interaction inspired our consideration of how joint feature optimization can improve pose estimation performance. In addition, the work of Li et al. [
26] addressed representation learning under incomplete multi-view scenarios, highlighting the importance of leveraging structural consistency across multiple feature views, which is conceptually related to the spatiotemporal consistency emphasized in our method.
Despite these advancements, limitations still exist in data quality, generalization under occlusion, and temporal modeling of continuous fish motion. Moreover, there remains a lack of unified frameworks that can bridge fine-grained pose estimation with real-time feedback control for robotic applications. In response to these gaps, this paper proposes a high-precision fish motion capture and analysis framework combining a custom-built dual-camera acquisition system with a deep learning-based semi-supervised pose estimation network. The platform captures multi-view sequences of carp performing natural swimming behaviors and annotates 21 anatomical keypoints for high-fidelity motion reconstruction. Building on this dataset, the proposed Semi-supervised Temporal Context-Aware Network (STC-Net) fuses spatial and temporal information across frames and integrates novel unsupervised loss functions to enhance performance under limited supervision.
This study contributes not only a valuable dataset and technical framework for fish pose estimation but also provides a scalable solution for real-time pose tracking in underwater biomimetic robotic control systems. The proposed approach holds significant potential for applications in marine resource development, aquaculture monitoring, and autonomous underwater navigation.
The main contributions of this paper are as follows:
To address the critical limitation posed by the scarcity of high-quality fish motion posture datasets, we have developed a custom-designed fish motion visualization experimental platform. The system consists of a transparent water tank (dimensions: 120 cm × 60 cm × 60 cm), a high-performance computing workstation, dual synchronized cameras, and two auxiliary lighting sources. By synchronously capturing multi-view motion sequences of fish from orthogonal perspectives, the platform enables precise annotation and facilitates the construction of a multi-view, fine-grained fish posture dataset, thereby providing foundational data support for downstream pose estimation tasks.
To enhance the robustness of pose estimation under conditions of occlusion and motion ambiguity, we introduce a novel architectural modification to the fully supervised baseline model by incorporating a Bidirectional Convolutional Recurrent Neural Network (Bi-ConvRNN) into the head of the network. This module fuses spatial-temporal features from adjacent frames—specifically, two preceding and two succeeding frames relative to the labeled target frame—allowing the model to capture motion continuity and spatial correlation across time. This temporal context integration significantly improves prediction accuracy in scenarios where the target fish body parts are partially or fully occluded.
We propose an enhanced loss function framework by introducing two unsupervised loss terms: the temporal continuity loss, which enforces consistency in predicted keypoint trajectories across consecutive frames, and the pose plausibility loss, which constrains predicted poses to adhere to biologically valid configurations. During training, both labeled and unlabeled frames are fed into the network. The unsupervised loss terms enable effective utilization of unlabeled data, thereby improving generalization performance and reducing the reliance on large-scale manual annotations. This semi-supervised learning strategy ensures greater scalability and applicability of the model in real-world underwater environments.
3. Experimental Results and Analysis
Since the temporal context-aware network and the unsupervised loss act independently (one requires modifying the network architecture and the other requires modifying the loss function), the two can be seamlessly combined to form a semi-supervised temporal context-aware network. For the labeled frames, a continuous image sequence of 5 frames is used to predict the pose of the intermediate frame, which is then compared with the true value; at the same time, the unlabeled video data also uses a continuous image sequence of 5 frames to generate predictions of the intermediate frames, but there is no corresponding true value label, so the unsupervised loss is applied to these predictions. The overall structural block diagram is shown in
Figure 6.
3.1. Datasets
Most of the existing fish data sets focus on fish classification, while the original intention of this study is to systematically summarize the movement rules of fish through in-depth analysis of fish movement postures, so as to guide the posture control of bionic robot fish in underwater environments. Therefore, most existing fish data sets are not suitable for the research objectives of this paper. Therefore, in order to study fish movement, a fish posture and motion estimation platform was built. The platform mainly includes the collection, processing and visualization of fish movement data. The experimental platform includes: a carp (350 mm long), a fish tank (120 cm × 60 cm × 60 cm), a high-performance computer, two synchronous cameras, and two fill lights. The experimental platform is shown in
Figure 7. The movement of the fish in the pool is captured by two cameras at the same time and stored on the computer hard disk. The fill light can provide uniform light for the entire recording process. With the help of this experimental platform, a total of 16 videos were shot, each of which is no less than 2 min. Part of the data set is shown in
Figure 8.
3.2. Experimental Environment
The proposed semi-supervised temporal context-aware network for fish pose estimation represents a significant advancement over standard fully supervised models. This method integrates two core enhancements: (1) the incorporation of a temporal context modeling module, and (2) the introduction of unsupervised loss functions for learning from unlabeled video sequences. By seamlessly combining temporal modeling with a semi-supervised learning paradigm, the framework is capable of capturing both spatial and temporal dependencies while leveraging large volumes of unlabeled data to improve prediction accuracy and robustness. This architecture effectively addresses several limitations of conventional approaches, including the reliance on densely labeled datasets, the lack of temporal reasoning, and vulnerability to occlusions or anatomical ambiguities. Through the joint optimization of supervised and unsupervised objectives, the model learns temporally coherent and anatomically plausible keypoint trajectories without sacrificing inference efficiency.
The implementation details and hardware configuration of the experimental platform used for model training and evaluation are summarized in
Table 2.
3.3. Network Training
The fish pose estimation experiments conducted in this study are based on a self-constructed carp dataset, specifically curated to support the development and evaluation of the proposed semi-supervised temporal context-aware network. The dataset consists of 400 manually annotated static images and 20 unlabeled video sequences, each with a duration of 10 s, capturing various postures and motion patterns of carp in natural aquatic environments. This dataset provides both spatially labeled frames for supervised learning and temporally continuous sequences for unsupervised learning, enabling comprehensive evaluation under the semi-supervised learning paradigm.
During training, each input image or video frame is resized to 384 × 384 pixels. The Backbone of the pose estimation network adopts the ResNet-50 architecture, pre-trained on the ImageNet dataset to accelerate convergence and improve feature representation quality. The network is optimized using the Adam optimizer [
30], which offers adaptive learning rates for different parameters and generally improves training stability.
A dynamic learning rate adjustment strategy is employed: the initial learning rate is set to 0.001, and is subsequently reduced by a factor of 0.5 at epoch 150, epoch 200, and epoch 250, respectively. To ensure stable convergence and prevent early disruption of the pretrained feature extractor, the weights of the ResNet-50 Backbone are frozen during the first 20 epochs. The batch size is set to 6, and the total number of training epochs is 300.
This carefully designed training protocol balances learning stability and flexibility, allowing the model to effectively learn both spatial localization from labeled data and temporal-spatial regularities from unlabeled sequences.
3.4. Results Analysis
The experimental design mainly includes ablation experiments and comparative experiments. The ablation experiment aims to evaluate the impact of different improvements on the model training results, while the comparative experiment is used to verify the effectiveness of the algorithm in this paper and conduct an in-depth analysis of the training results before and after the improvement.
(1) Ablation experiment
In order to verify the effect of the added module on improving network performance, this section designs an ablation experiment. First, the standard fully supervised network (Base network) is trained as the baseline model. Then, improvements are made on the basis of the Base network: first, unsupervised loss is added to build a semi-supervised network (Semi-Supervised network); second, the time context perception mechanism is introduced to form a time-aware contextual network (Time-aware Contextual Network). Finally, the unsupervised loss and the time context perception mechanism are simultaneously integrated into the Base network to build a semi-supervised time-aware contextual network (Semi-Supervised Time-aware Contextual Network) to further verify the role of the hybrid module. The results of the ablation experiment are shown in
Table 3.
As summarized in the preceding table, the proposed semi-supervised temporal context-aware network achieves the best performance in the fish pose estimation task, yielding a Root Mean Square Error (RMSE) of 9.71 pixels. This represents a significant reduction of 4.41 pixels compared to the baseline fully supervised network, demonstrating a substantial enhancement in prediction accuracy. The incorporation of unsupervised loss functions alongside the use of unlabeled video frames during training enables the model to learn richer spatiotemporal patterns, thereby improving its generalization capability and reducing estimation errors.
The temporal context-aware network, which inputs a triplet of consecutive frames (the target frame along with its immediate predecessor and successor), benefits from leveraging adjacent temporal information to inform keypoint predictions. However, when compared to the semi-supervised model, its error reduction is more modest, suggesting that temporal information alone is insufficient to fully resolve ambiguities or occlusions inherent in fish posture estimation. By contrast, the semi-supervised temporal context-aware network synergistically combines the strengths of semi-supervised learning and temporal modeling. This hybrid approach not only exploits unlabeled data to enhance model robustness but also incorporates temporal continuity to stabilize predictions across frames. Consequently, the proposed framework significantly improves pose estimation accuracy and attains the best overall performance.
These experimental findings underscore the effectiveness of jointly leveraging semi-supervised learning and temporal context information as a strategic solution for advancing fish pose estimation accuracy.
Figure 9 illustrates the RMSE values for individual fish body parts evaluated on both the training set and the validation set. The relatively low RMSE observed for the overall mean across both datasets indicates the model’s strong generalization and reliable predictive capability for diverse anatomical landmarks.
Figure 10 shows the loss as a function of Epoch for two models: the semi-supervised temporal context-aware network (black curve) and the standard network (red curve) on the training set. Both curves exhibit a downward trend, indicating that the models successfully reduced errors and improved their fitting ability during training. In the initial training phase (approximately 0~50 epochs), both models show a sharp decrease in loss, suggesting that the models quickly learned the key features. However, the loss curve of the temporal context-aware network drops more rapidly, indicating that the network learned more effective features. After 50 epochs, the training loss becomes stable, suggesting that the models gradually converged. Yet, the loss of the semi-supervised temporal context-aware network remains lower than that of the standard network, indicating that its training error is smaller and its fitting ability is stronger.
Figure 11 shows the prediction of the key point trajectory of the carp dorsal fin by the base model and the semi-supervised time context-aware network. The red curve represents the prediction of the base model, and the black curve represents the prediction trajectory of the semi-supervised time context-aware network. The first sub-figure shows the prediction of the dorsal fin key point in the X direction by the two models, and the second sub-figure shows the prediction of the dorsal fin key point in the Y direction by the two models. As can be seen from the figure, the predicted trajectory of the base model has more burrs and jumps, while the key point trajectory predicted by the semi-supervised time context network is smoother, the outliers are significantly reduced, and the overall performance is more stable.
(2) Comparison experiments
To verify the effectiveness of the algorithm proposed in this paper, we selected the proposed algorithm and compared it with two representative fully supervised algorithms, DeepLabCut and SLEAP (both trained with default parameters as per their official implementations), on the fish pose dataset built in this paper. The models were trained and their performance was analyzed. The experimental results are shown in
Table 4.
As evidenced by the experimental data, the proposed algorithm significantly outperforms both DeepLabCut and SLEAP in terms of Root Mean Squared Error (RMSE). Specifically, DeepLabCut and SLEAP achieve RMSE values of 15.21 pixels and 14.84 pixels, respectively, whereas the proposed method attains a substantially lower RMSE of 9.71 pixels. This corresponds to an error reduction of 5.50 pixels relative to DeepLabCut and 5.13 pixels compared to SLEAP. These results unequivocally demonstrate that the proposed semi-supervised temporal context-aware network delivers superior accuracy in fish pose estimation tasks, effectively mitigating prediction errors and enhancing model robustness.
Figure 12 illustrates the predicted trajectories of the head keypoint of a carp obtained from three different networks. The black curve represents predictions generated by the SLEAP network, the red curve corresponds to those from the DeepLabCut network, and the blue curve depicts the results of the proposed semi-supervised temporal context-aware algorithm. A comparative analysis of the figure reveals that the trajectory produced by the proposed method is markedly smoother and more continuous, exhibiting minimal jitter and absence of abrupt fluctuations relative to the predictions of SLEAP and DeepLabCut. This temporal stability in keypoint localization suggests enhanced robustness and higher reliability of the proposed algorithm in modeling fish pose dynamics over time. Such smoothness is critical in practical applications, as it reflects the model’s ability to maintain consistent tracking across frames, thereby reducing erroneous detections caused by occlusions, noise, or morphological ambiguities.
4. Discussion
Underwater biomimetic robotic fish have emerged as vital platforms for ocean exploration, offering unique advantages in environments that are otherwise inaccessible, hazardous, or costly for human divers and traditional equipment. Their bio-inspired locomotion provides higher maneuverability and energy efficiency compared to propeller-driven robots, making them particularly suitable for long-duration tasks such as environmental monitoring, biological observation, and seabed surveying. Within this broader context, high-precision fish pose estimation serves as a cornerstone technology. Accurate estimation of fish kinematics not only facilitates in-depth analysis of locomotion patterns and ecological adaptability but also provides critical guidance for the design optimization, real-time control, and autonomous navigation of robotic fish. Beyond robotics, precise pose estimation also holds promise for applications in intelligent fishery management, behavioral ecology, and marine biodiversity conservation.
This study tackles one of the most pressing challenges in aquatic pose estimation: the scarcity of large-scale, high-quality annotated datasets. The custom-designed fish motion visualization platform developed here represents a significant contribution toward filling this gap. By integrating dual synchronized cameras, auxiliary lighting, and a transparent tank setup, the platform allows high-resolution, multi-view recording of fish motion under controlled conditions. The resulting dataset not only provides fine-grained annotations of 21 biologically meaningful keypoints but also captures diverse swimming behaviors including straight, backward, and turning locomotion. This dataset constitutes a valuable benchmark resource for both supervised and semi-supervised pose estimation studies.
Building on this foundation, the proposed Semi-supervised Temporal Context-Aware Network (STC-Net) demonstrates how architectural innovations and loss design can jointly address the limitations of existing methods. The integration of a Bi-directional Convolutional Recurrent Neural Network (Bi-ConvRNN) introduces temporal continuity by incorporating contextual cues from preceding and succeeding frames. This design is particularly advantageous in aquatic environments, where self-occlusion, inter-fin interference, and motion blur frequently occur. In addition, the proposed unsupervised loss functions—temporal continuity loss and pose plausibility loss—enable effective utilization of unlabeled frames, thereby reducing dependence on manual annotation. This semi-supervised paradigm provides a scalable pathway for building robust models under real-world conditions where labeled data is inevitably limited.
Experimental results confirm that STC-Net achieves reliable performance under complex motion scenarios, maintaining robustness even when body parts are partially or fully occluded. Compared with fully supervised baselines, the proposed framework demonstrates superior generalization, highlighting the importance of incorporating temporal dynamics and biologically inspired constraints into pose estimation. Similar perspectives on the value of robust signal modeling have also been observed in other engineering domains. For example, Saleem et al. [
31] proposed an AE-based pipeline monitoring approach combining Empirical Wavelet Transform with a customized one-dimensional DenseNet, illustrating the potential of advanced signal decomposition and deep learning for robust leak detection. Notably, similar perspectives on balancing model complexity and performance have been explored in recent studies on ballistic target recognition and aero-engine bearing fault diagnosis [
32,
33], which provide useful methodological references for our work.
Nevertheless, some limitations of this study warrant discussion. First, the dataset was collected in a controlled tank environment, which may not fully reflect the visual complexity of natural underwater habitats, where factors such as variable lighting, turbidity, and multi-species interactions occur. Future work should extend the framework to field deployments in open-water conditions to evaluate its robustness in natural ecosystems. Second, while the proposed STC-Net incorporates temporal context, the temporal window is currently limited. Exploring long-range temporal modeling, potentially via transformer-based architectures, may further enhance the ability to capture extended motion dependencies. Third, although carp were chosen as the representative species, fish morphology and swimming mechanics vary significantly across species. Cross-species validation and transfer learning approaches will be essential to ensure broader applicability. Finally, while the current focus is on single-fish pose estimation, real-world ecological and robotic applications often involve multiple interacting agents. Extending the framework to multi-fish pose tracking and interaction modeling remains an important direction.
In summary, this work contributes a novel dataset and a semi-supervised temporal context-aware network for fish pose estimation, providing a strong foundation for future advancements in underwater robotics and biological observation. By bridging biological insight, computer vision, and deep learning, this study highlights a promising pathway toward intelligent, adaptive, and scalable underwater robotic systems. Future research should emphasize ecological validity, cross-species adaptability, and real-world deployment, thereby advancing both fundamental biological understanding and the practical utility of biomimetic robotic fish.