1. Introduction
Accurate fish identification and classification are fundamental to biodiversity research and fisheries conservation. These identifications are important as they facilitate programs for healthy fish populations, detection of both cryptic and invasive species and the protection of critical ecological systems [
1]. Deep learning techniques applied to underwater photography and video from coastal marine ecosystems provide reliable and accurate fish detections with lower ecological impacts for the identification task when compared with previous techniques like scientific trawling [
2]. These ecosystems serve as essential spawning, nursery, and feeding grounds for a diverse collection of marine creatures. Precise species classification further enables scientists and resource managers to associate species with specific habitats and implement targeted conservation measures at appropriate times [
3]. Because fish constitute a major source of protein globally, mapping habitats with high abundance is essential for sustaining fisheries, guiding conservation priorities, and preserving the coupled human–natural systems that depend on them [
4,
5]. Global food demand will increase in the future. Fisheries, when sustainably managed, can provide a key source of protein [
6].
Challenges are especially acute in marine environments. Unlike freshwater systems [
7] with extensive taxonomic baselines, estuary and ocean environments impose severe observational constraints, including limited visibility, low illumination, degraded image resolution, and complex, noisy backgrounds [
4]. These conditions complicate species detection and classification, and the scarcity of high-quality labeled datasets further restricts methodological progress.
Traditional approaches to fish classification often rely on divers to collect fish samples, a process that is not only labor-intensive and time-consuming but also potentially destructive to marine habitats [
2]. Consequently, marine biologists and researchers have increasingly sought automated and non-invasive methods for fish classification [
4]. In recent years, machine learning (ML) models have been widely applied for this purpose. Among them, Support Vector Machine (SVM) has been used to classify fish species based on features extracted from underwater image datasets [
8,
9]. However, these traditional ML algorithms heavily depend on manually engineered features, limiting their scalability and adaptability to diverse aquatic environments.
Recent advances in computer vision have turned to deep learning to address these obstacles. CNN can automatically extract discriminative features from underwater imagery and has demonstrated resilience to noise, low contrast, and motion blur [
10,
11,
12]. However, much of the existing literature relies on single-environment datasets [
5,
13,
14], which limits out-of-distribution generalization across sites, seasons, depths, and camera setups. Moreover, training a CNN needs large collections of labeled images as well as large computational resources. To solve the data limitation, researchers often use the transfer learning model. Transfer learning is employed to leverage pretrained visual representations from large-scale datasets, thereby reducing data dependency, accelerating convergence, and enhancing classification performance in domain-specific tasks such as underwater fish identification. AlexNet [
15], ResNet, and VGG16 are commonly used transfer learning models in the case of fish classification (FC).
Against this backdrop, this study systematically evaluates the performance of multiple ResNet models for underwater fish image classification by jointly considering predictive accuracy and computational efficiency. Experiments are conducted on a large-scale and well-balanced underwater image dataset consisting of approximately 40,000 images collected from 20 distinct aquatic habitats, allowing classification accuracy (ACC) to serve as a reliable primary evaluation metric. A transfer learning framework is employed using four widely adopted convolutional neural network backbones—ResNet-18, ResNet-50, ResNet-101, and ResNet-152—selected to represent increasing network depth and computational complexity. All models are trained and fine-tuned under a unified experimental protocol to ensure a fair and reproducible comparison. Beyond accuracy, we introduce an EWS to quantify the computational resources required to train each model to a given performance level, integrating model size, memory consumption, training time, and energy-related resource usage into a single efficiency metric. By jointly analyzing accuracy and EWS, this study provides a comprehensive assessment of the trade-offs between performance and computational cost, offering practical guidance for model selection in deployment scenarios where resource availability, energy efficiency, and scalability are critical considerations.
The remainder of this paper is organized as follows.
Section 2 reviews relevant and recent literature.
Section 3 describes the ResNet models and outlines the key methodological steps.
Section 4 presents the performance of the evaluated models on the target dataset and provides comparative analyses in terms of model parameters, memory consumption, and runtime.
Section 5 interprets the results and offers guidance for selecting appropriate models for related applications. Finally,
Section 6 summarizes the main findings and discusses the study’s limitations.
2. Literature Review
In this section, we review recent and representative studies on underwater image classification by first summarizing commonly used datasets for fish detection and classification that have supported model development and benchmarking. We discuss their scope, scale, and limitations, particularly with respect to environmental diversity, annotation quality, and class imbalance. Building on this foundation, we examine prior work on deep learning-based computer vision methods, focusing on convolutional neural network architectures for fish detection and recognition, the adoption of transfer learning to mitigate data scarcity, and emerging efforts to balance classification performance with computational efficiency in resource-constrained underwater environments.
The existing datasets for fish identification and classification play a crucial role in advancing research in underwater biodiversity. One prominent dataset is the Fish4Knowledge dataset [
16], which presents an efficient backbone for fish classification from composited underwater images. This dataset, however, has limitations in terms of its environmental diversity, which may restrict its generalizability across various underwater habitats. Similarly, Kuswantori et al. [
17] developed a dataset for fish detection and classification, aiming to support automatic sorting systems using an optimized YOLO algorithm. However, this dataset is primarily focused on specific operational conditions, which limits its applicability in broader ecological studies. The UVOT400 dataset, introduced by Alawode et al. [
13], aims to enhance underwater visual tracking. Yet, it faces challenges related to the noise and complexity of underwater imagery, which may hinder accurate classification. Furthermore, the FishInTurbidWater dataset, presented by Jahanbakht et al. [
14], employs semi-supervised and weakly supervised deep neural networks for fish detection in turbid underwater videos, but its reliance on semi-supervised methods may result in inconsistencies due to limited labeled data. Lastly, the FishNet dataset by Ma et al. [
18] advances species recognition for aquatic biodiversity monitoring through semi-supervised learning, yet it still confronts the issue of insufficient labeled examples, which is a common challenge across many datasets.
Early work in automated FC relied primarily on traditional computer vision techniques based on manually designed features. Research in this area dates back to 1994, when Castignolles et al. [
19] developed a vision-based system to automatically detect, recognize, and count migratory fish passing through backlit observation windows in river fish passages. Their approach employed offline detection methods to segment fish from S-VHS video frames and enhanced visibility by improving background illumination conditions. Subsequent studies emphasized shape-based feature extraction for fish identification. Lee et al. introduced curvature function (CF)-based descriptors to represent fish contours [
20], and later evaluated multiple contour-based representations—including line segments, polygonal approximations, Fourier descriptors, and CF analysis—for fish classification [
21]. While these approaches demonstrated the feasibility of automated FC, they were often limited by measurement inaccuracies, sensitivity to image quality, and the need for manual determination of feature locations. To address some of these challenges, Islam et al. [
22] proposed a content-based method that integrated both local and global visual features, resulting in improved classification accuracy and outperforming several contemporaneous approaches.
As research progressed, machine learning models became increasingly prevalent in fish classification tasks. These approaches leveraged morphometric measurements and mathematical transform-based descriptors to further automate the classification process and enhance accuracy. Widely adopted algorithms included Support Vector Machines (SVM), Random Forests, and Artificial Neural Networks (ANN), all of which consistently reported superior performance compared to purely rule-based or handcrafted feature-driven methods [
23,
24]. This transition toward data-driven learning frameworks laid the foundation for subsequent advances in fish classification, particularly the emergence of deep learning methods capable of learning discriminative features directly from raw image data in an end-to-end manner.
The emergence of deep learning marked a fundamental shift in image classification research by enabling automatic feature extraction directly from raw image data, thereby reducing reliance on manually engineered descriptors. One of the earliest applications of deep learning to unconstrained underwater fish imagery was introduced by Salman et al. [
25]. In their study, the authors designed a custom CNN comprising three convolutional layers to learn discriminative visual features from fish images. The learned representations were subsequently fed into conventional classifiers, including Support Vector Machines (SVM) and k-Nearest Neighbors (kNN), for species identification. Despite its relatively shallow architecture, the proposed CNN substantially outperformed traditional hand-crafted feature pipelines by autonomously capturing salient visual patterns such as fish body morphology and texture characteristics.
A further paradigm shift in image classification research emerged with the introduction of AlexNet [
26], which demonstrated the effectiveness of deeper convolutional neural network (CNN) architectures when combined with innovations such as ReLU activations, dropout regularization, and large-scale GPU-based training. The success of AlexNet accelerated the widespread adoption of deep learning in visual recognition tasks and significantly influenced subsequent applications in marine and ecological domains, particularly through benchmark initiatives such as the LifeCLEF and SeaCLEF challenges. Building on this advancement, Iqbal et al. [
27] employed AlexNet trained from scratch for fish species classification and achieved an accuracy of 90.48%, while Tamou et al. [
28] applied transfer learning using a pre-trained AlexNet model and reported a substantially higher accuracy of 99.45%. These results underscore the effectiveness of transfer learning for underwater fish classification, representing an early and influential demonstration that CNNs trained on large-scale natural image datasets can provide robust and transferable representations for challenging underwater visual environments.
The use of deeper neural networks has significantly improved image classification performance; however, increasing depth also introduced optimization challenges, particularly the vanishing gradient problem. This issue was effectively addressed by the introduction of residual learning in ResNet [
29], which enables stable training of very deep architectures through identity shortcut connections. Although several deep models—such as VGG16, InceptionV3, Xception, DenseNet, and MobileNet—have achieved strong performance in image classification tasks, ResNet has consistently demonstrated superior optimization efficiency and representational capability across a wide range of benchmarks.
With the availability of large, annotated datasets, end-to-end deep learning models have become dominant in underwater fish classification. Researchers have increasingly fine-tuned deep architectures, especially ResNet variants, to address domain-specific challenges in underwater imagery. For example, Zhang et al. [
30] proposed AdvFish, an adversarial fish recognition framework that fine-tunes a ResNet-50 backbone and incorporates an additional loss term to suppress background noise while emphasizing salient fish features, resulting in improved accuracy in complex scenes. Similarly, Pang et al. [
31] employed a teacher–student knowledge distillation strategy to mitigate underwater image interference, enabling the model to learn more robust representations under conditions of turbidity and illumination variation. Together, these studies highlight the adaptability of modern deep learning techniques in improving robustness and performance for underwater image classification tasks.
3. Materials and Methods
3.1. Dataset
One of the primary requirements in computer vision tasks, particularly in object detection and classification, is the availability of an appropriate and well-curated dataset. Furthermore, deep learning models generally require a large volume of training images to achieve optimal performance. The DeepFish dataset fulfilled all of the requirements. It was introduced by Bradley et al. [
32] in 2019. Originally, the dataset was developed to examine the influence of local habitat characteristics and environmental contexts on the assemblage composition of juvenile fish, rather than for classification or segmentation purposes. Subsequently, Saleh et al. [
33] annotated and partitioned the dataset for classification, counting, localization, and segmentation tasks, thereby establishing it as a benchmark dataset for fish identification research.
Primarily, the dataset was captured in video format. The videos in the dataset were collected from 20 distinct habitats located in remote coastal marine regions of tropical Australia. All recordings were captured during daylight hours under low-turbidity conditions using low-disturbance techniques to minimize environmental interference. The footage was acquired in full high-definition resolution (1920 × 1080 pixels) with a digital camera.
The DeepFish dataset contains labeled image data to handle three distinct tasks for fish identification: Classification, Segmentation and Localization. In this paper we specifically focus on the classification data to test the ResNet models. The DeepFish dataset used for classification comprises a total of 39,766 labeled images, including 22,357 negative-class (“No Fish”) images and 17,409 positive-class (“Fish”) images. This yields a 56.2 percent negative to 43.8 percent positive distribution. This indicates a dataset that is slightly skewed toward the majority class. These images were created from a digital camera that captured video clips in full HD resolution 1920 × 1080 pixels in RGB 3 layers videos [
34]. The authors provided a Github repo (
https://github.com/alzayats/DeepFish, accessed on 10 January 2026) to offer information and their code used in their paper [
35]. There is also a download link (
http://data.qld.edu.au/public/Q5842/2020-AlzayatSaleh-00e364223a600e83bd9c3f5bcd91045-DeepFish/, accessed on 10 January 2026) to retrieve the entire DeepFish dataset [
36].
For the classification task, Saleh et al. [
33] reported a data split of 50%, 20%, and 30% for training, validation, and testing, respectively. However, we observed a discrepancy between the split described in the paper and the partition specified in the accompanying dataset CSV files. Specifically, the files indicated a 40%, 10%, and 50% distribution. To ensure reproducibility and consistency with the released resources, we adopted the split defined in Saleh et al.’s dataset files. In this study, to enhance model generalization, we instead partitioned the dataset into a stratified partition of 80%, 10%, and 10% for the training, validation, and test sets, respectively.
We resized all images to (224 × 224) to ensure compatibility with the input dimensions required by the ResNet models. Then, we normalized it by scaling pixel values to the range [0, 1] to improve convergence during training. We shuffled the data and used a random seed for repeatability.
3.2. Models
Our long-term research goals involve complex reef fish detection and classification tasks that require fine-grained visual discrimination. Such tasks are expected to benefit from models with large representational capacity capable of capturing subtle spatial features. Historically, very deep neural networks suffered from vanishing gradients as layer depth increased. Residual Networks (ResNets) addressed this limitation by introducing residual connections that improve gradient flow and enable effective training of substantially deeper architectures [
29].
We utilized four widely adopted convolutional neural network models from the ResNet family—ResNet-18, ResNet-50, ResNet-101, and ResNet-152—as feature extractors in our experiments. The numerical suffix denotes the number of layers in each variant. As network depth increases, so does the number of trainable parameters and representational capacity.
In this set of models, network depth increases progressively from 18 to 152 layers. This progression enables a controlled assessment of whether increased representational capacity yields measurable performance gains on the DeepFish binary classification task or whether performance saturates at lower depths.
All of the pretrained models were initialized with ImageNet-pretrained weights to leverage their strong representational capabilities. Although several alternative architectures, such as VGG, DenseNet, MobileNetV2, AlexNet, InceptionV3, and GoogleNet, are commonly employed for vision tasks involving limited datasets, we prioritized the ResNet family due to its residual (skip) connections [
29]. These connections effectively mitigate the vanishing gradient problem, enabling more stable and efficient training of deeper networks.
To comprehensively assess model performance, we evaluated each model under multiple training configurations. First, the models were trained from scratch using randomly initialized weights to establish baseline performance. Subsequently, transfer learning was employed by fine-tuning ImageNet-pretrained models, enabling the networks to leverage general visual representations learned through large-scale pretraining.
Figure 1 shows the architecture of the models we have used in our experiment. The final layer of the network, originally designed for 1000 ImageNet categories, was replaced with a fully connected layer featuring a classification head of a single neuron. This neuron outputs a scalar value
z representing the logit for the valid class. The sigmoid activation function was applied to convert this logit into a probability between 0 and 1, which is mathematically defined as:
where
represents the predicted probability that an image belongs to the valid class. In our experiments we output a sigmoid activation function. The model was trained using the Binary Cross-Entropy (BCE) loss, a suitable choice for binary classification problems. The BCE loss measures the discrepancy between the predicted probabilities and the true labels
. Its mathematical formulation is expressed as:
The Adam optimizer updates model parameters by maintaining exponential moving averages of the first and second moments of the gradients. Let
denote the model parameters at iteration
t, and let
represent the gradient of the loss function
with respect to
. The first moment estimate
and the second moment estimate
are updated as
where
is an exponentially weighted moving average of past gradients (first moment),
is an exponentially weighted moving average of the squared gradients (second raw moment), and
and
are decay coefficients that control the influence of historical gradient information.
To compensate for the bias introduced by initializing
and
to zero, bias-corrected estimates are computed as
The parameter update rule is then given by
where
denotes the learning rate and
is a small positive constant added to ensure numerical stability. This formulation enables Adam to adaptively adjust learning rates for individual parameters while incorporating momentum through the first moment estimate.
During training, we monitored both validation AUC and validation loss to assess convergence behavior. Models were intentionally allowed to run for additional epochs to examine training dynamics, evaluate the effects of different learning strategies on loss convergence, and assess potential overfitting. This approach ensured that the training process was convergent and that the models were sufficiently trained.
3.3. Transfer Learning
Transfer learning is an effective strategy for addressing limited labeled data in specialized visual domains such as deep-sea imaging. Formally, transfer learning aims to improve performance on a target task by leveraging knowledge learned from a related source task. Let the source domain be defined as
with the corresponding source task
where
denotes the input space,
the data distribution,
the label space, and
the predictive function learned from the source data. Similarly, the target domain and task are denoted by
and
respectively. Transfer learning is applicable when either the data distributions or the tasks differ, that is,
In deep neural networks, the predictive function can be decomposed as
where
represents the feature extraction layers and
denotes the task-specific classification head. During pretraining, the network parameters
are learned on the source dataset by minimizing the empirical risk
where
is the loss function and
are source samples.
For the target task, the pretrained feature extractor
is reused, and the model is adapted by minimizing
where
are samples from the target domain. During fine-tuning, the feature extraction layers are initialized with
and selectively updated using a smaller learning rate, allowing the learned representations to adapt to the statistical properties of deep-sea imagery while preserving general visual features.
Figure 2 illustrates the transfer learning framework adopted in this study. We employ ResNet models pretrained on the ImageNet dataset to extract hierarchical visual representations, ranging from low-level edges and textures to higher-level shape-based patterns. These representations are subsequently fine-tuned using domain-specific underwater imagery to improve discrimination between fish and non-fish classes. Prior work in ecological monitoring and underwater computer vision has shown that such pretrained representations generalize well to novel object categories due to their domain-agnostic nature.
For implementation, we used multiple ResNet variants pretrained on ImageNet [
37]. The original classification head was replaced with a global average pooling layer followed by a dense output layer tailored for binary classification. Training was conducted using the Adam optimizer with a learning rate of
and binary cross-entropy loss. Models were trained for up to 200 epochs with early stopping with patience value 20 to mitigate overfitting, using a batch size of 32 in our 14 accuracy-focused experiments. This transfer learning strategy enables robust classification under challenging underwater imaging conditions, where labeled data are limited and visual characteristics differ substantially from images used to train the original model.
Our experiments were conducted entirely within the Keras 3.13.2 environment to ensure a consistent and controlled implementation across all evaluated models. At the time of experimentation, ImageNet-pretrained weights for ResNet-18 were not natively available within the Keras Applications module.
To preserve implementation consistency—particularly for fair comparison of training dynamics and EWS measurements—we limited our experiments to models officially supported within the same framework. This approach ensured that all models shared identical software, optimization settings, and hardware measurement procedures.
3.4. Fine-Tuning Parameters
To examine the impact of optimization strategies during transfer learning, we conducted a series of fine-tuning experiments using a pretrained ResNet-50 model. The experiments evaluated two batch size configurations (32 and 256) in combination with a diverse set of learning-rate schedulers, including no scheduling (NONE), Cosine Annealing (COSINE), Cosine Annealing with Restarts (COSINE_RESTARTS), Reduce-on-Plateau (PLATEAU), Exponential Decay (EXPONENTIAL), Piecewise Constant Decay (PIECEWISE), Polynomial Decay (POLYNOMIAL), Linear Warmup (LINEAR_WARMUP), and Linear Warmup followed by Cosine Annealing (LINEAR_WARMUP_COSINE).
All configurations were trained under identical experimental conditions to ensure a fair comparison, enabling a systematic analysis of how batch size and learning-rate scheduling jointly influence convergence behavior, classification performance, and computational efficiency. Across all experiments, the initial learning rate was set to the base value specified in the experiment configuration and served as the maximum learning rate during training. When no scheduling was applied, both the initial and minimum learning rates were equal to this base value.
When learning-rate scheduling was enabled, the learning rate varied dynamically over training while remaining bounded above by the initial value. Cosine-based and warmup-based schedules reduced the learning rate toward zero by the end of training, yielding an effective minimum learning rate approaching 0. In contrast, monotonic decay strategies enforced explicit lower bounds: Exponential Decay, Polynomial Decay, and Piecewise Constant Decay reduced the learning rate to approximately 0.01 times the initial value, while Reduce-on-Plateau adaptively decreased the learning rate with a minimum bound set to 0.00001. This was achieved by setting the alpha value to 0.01 on the learning rate. This design ensured a consistent maximum learning rate across configurations while allowing controlled variation in minimum learning rates and decay behavior, facilitating a fair comparison of optimization strategies.
3.5. Gradient Clipping
Gradient clipping was employed to mitigate unstable and excessively large parameter updates during training. This technique constrains the magnitude of gradient updates, thereby improving optimization stability, particularly in large-batch training regimes. Although larger batch sizes reduce gradient variance, they can increase the magnitude of individual parameter updates, making optimization more sensitive to rare but extreme gradient excursions. By explicitly bounding the global gradient norm, clipping limits abrupt movements in parameter space and helps stabilize training dynamics. In this study, we evaluated the impact of gradient clipping using three global clip-norm thresholds (1.0, 0.75, and 0.5) to assess how progressively tighter bounds on update magnitude influence convergence behavior and validation stability.
The clip-norm values were selected after examining the gradient behavior of the unclipped ResNet-50 (ImageNet-pretrained) configuration. Without clipping, the global gradient norms exhibited two clearly separated regimes: a stable regime with values below 0.03 and intermittent extreme spikes ranging from 45 to 513. As seen in
Figure 3, the smallest spike exceeded the largest stable gradient magnitude by approximately a factor of 1500, indicating a pronounced separation between ordinary updates and rare high-magnitude excursions. Based on this structure, we began with a commonly used baseline threshold of 1.0 and progressively reduced it to 0.75 and 0.5. These thresholds were chosen to remain substantially above the stable gradient range while imposing increasingly strict constraints on extreme spikes, enabling a systematic evaluation of how different clipping intensities affect training smoothness and stability.
3.6. Threshold Selection
During the initial evaluation of model performance, a default decision threshold of 0.5 was used to assign class labels in the binary classification task. To identify a more optimal threshold, we optimized the F1 score, which provides a balanced trade-off between precision and recall. Specifically, we performed a sweep over possible threshold values less than 0.5 on the validation set and selected the threshold that maximized the F1 score. This optimized threshold was subsequently fixed and applied when evaluating the trained classifier on the held-out test set.
3.7. Evaluation Metrics
To assess model performance, we employed several standard classification metrics, including Accuracy (ACC), Precision, Recall, F1 Score, and an Energy-Weighted Score (EWS). These metrics jointly capture predictive performance and computational efficiency.
Classification Metrics
Let , , , and denote the numbers of true positives, true negatives, false positives, and false negatives, respectively, where the positive class corresponds to fish images.
ACC measures the proportion of correctly classified samples and is defined as
Precision measures the reliability of positive (fish) predictions and is given by
Recall evaluates the model’s ability to identify all positive instances and is defined as
The F1 Score provides a balanced measure of Precision and Recall through their harmonic mean:
In addition to threshold-dependent metrics, we also report the Area Under the Receiver Operating Characteristic Curve (AUC), which evaluates the model’s ability to discriminate between positive and negative classes across all possible classification thresholds. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR), where
The AUC is defined as the area under the ROC curve:
An AUC value of 0.5 indicates random classification performance, while an AUC of 1.0 corresponds to perfect discrimination between fish and non-fish images.
3.8. Wilson Score Confidence Interval for Accuracy
To quantify the statistical uncertainty associated with classification accuracy, we computed 95% Wilson score confidence intervals based on the confusion matrix results.
Let, n denote the total number of test samples, k denote the number of correctly classified samples obtained from the confusion matrix, denote the observed accuracy, z denote the standard normal quantile corresponding to the desired confidence level (for 95% confidence, ).
Since classification accuracy represents a binomial proportion, the Wilson score interval was used due to its improved coverage properties compared to the standard Wald approximation, particularly when is close to 0 or 1.
The Wilson confidence interval is defined as:
Accordingly, the lower and upper bounds are computed as:
This interval provides a statistically robust estimate of the uncertainty associated with the observed accuracy values.
In addition to these quantitative metrics, all misclassified samples were manually reviewed to determine whether errors resulted from model behavior or potential inaccuracies in the ground-truth annotations, thereby ensuring the reliability of the evaluation process.
3.9. Energy Weight Score (EWS)
Beyond classification accuracy, this study evaluates computational efficiency by systematically monitoring CPU, GPU, and memory utilization during model training. All experiments were conducted on identical hardware, and resource usage was sampled at fixed 0.5 s intervals to ensure comparability across runs.
At each sampling interval, system-level metrics were recorded, including a timestamp (converted to elapsed time in seconds), CPU utilization percentage, and system memory usage measured in megabytes. These measurements provide a temporal profile of processor and memory demand throughout training.
GPU-level metrics were collected from an NVIDIA A100 (NVIDIA Corporation, Santa Clara, CA, USA) accelerator operating within a cloud-based environment. The recorded variables included GPU utilization percentage, GPU memory usage (MB), GPU temperature (°C), and total available GPU memory. Although the hardware supports direct power queries (e.g., instantaneous power draw and power limits), such metrics were not used in this study due to access restrictions commonly imposed in multi-tenant computing environments.
Using the recorded utilization data, energy consumption was estimated for each hardware component. GPU energy consumption (
) was computed as the time-integrated product of GPU utilization and the maximum rated GPU wattage. CPU energy consumption (
) was estimated analogously using CPU utilization and an assumed processor wattage. Memory energy consumption (
) was derived from RAM usage (in gigabytes) multiplied by an estimated watts-per-GB factor over time. Total energy consumption was then defined as
To enable a unified comparison of computational efficiency across experimental conditions, we introduce the EWS, a composite metric that aggregates energy consumption across hardware components using cost-based weighting factors. These weights were derived from the relative pricing of GPU, CPU, and memory resources offered by the cloud provider at the time of experimentation (
Table 1). The EWS is defined as
We can also write this as a summation of components and usages.
where
denotes the infrastructure cost weight for the component
i, and
denotes the estimated energy consumption (in Wh) for the component
i. In our experiments,
i = 1, 2, 3 correspond to GPU, CPU, and RAM, respectively.
The motivation for this weighting scheme is economic rather than physical. Resource cost serves as a proxy for relative scarcity and infrastructure value, reflecting the operational expense associated with providing each component in a managed cloud environment. As such, EWS should be interpreted as a normalized, hardware-dependent efficiency indicator rather than an absolute measure of energy consumption. EWS, through cost-based weighting, addresses the scarcity of high-demand hardware used in the experiments. EWS provides a single scalar metric for comparing experimental runs under identical hardware conditions.
It is important to emphasize that computational efficiency metrics are inherently hardware-specific, and substantial ongoing research seeks to standardize energy-aware evaluation in machine learning. In this work, the objective is not to claim universal efficiency but to provide a consistent, interpretable basis for comparing experimental configurations executed under identical hardware conditions.
3.10. Experiments
This study was designed to systematically evaluate the trade-offs between predictive performance and computational efficiency in ResNet-based fish versus non-fish image classification. In total, 32 experiments were conducted under consistent preprocessing pipelines, evaluation protocols, and hardware conditions to ensure fair comparison. All experiments were executed on a Google Colab environment equipped with 167 GB of system RAM and an NVIDIA A100 GPU with 80 GB of memory. Models were trained on the DeepFish dataset, which comprises 39,766 labeled images containing both fish (positive) and non-fish (negative) samples. The class distribution was preserved as provided in the dataset, and no explicit class weighting was applied during training. The experimental framework incorporated dataset split testing (replicating the dataset-provided split as well as an alternative 80%/10%/10% train/validation/test split), comparison of randomly initialized models versus ImageNet-pretrained models, and evaluation of multiple ResNet models (ResNet-18, ResNet-50, ResNet-101, and ResNet-152).
In addition to model and data-related factors, we conducted a series of optimization-focused experiments to analyze training dynamics and efficiency. These included learning-rate scheduler experiments using NONE, COSINE, COSINE_RESTARTS, PLATEAU, EXPONENTIAL, PIECEWISE, POLYNOMIAL, LINEAR_WARMUP, and LINEAR_WARMUP_COSINE, as well as batch size comparisons between 32 and 256 to study their impact on convergence behavior. Post-training threshold optimization was performed by maximizing the F1 score on the validation set and applying the selected threshold to the held-out test set. Computational efficiency across all experiments was evaluated using both ACC and the proposed EWS, derived from detailed CPU, GPU, and memory utilization measurements. Finally, to mitigate instability observed during large-batch training, gradient clipping experiments were conducted using multiple clip-norm values to assess their effect on stabilizing optimization and smoothing training curves.
We also conducted an experiment to evaluate the impact of data augmentation techniques designed to address potential issues arising from reshaping and skewing the images. As shown in
Figure 4, the augmentation procedure involved letter-boxing the image by adding black padding to reshape it into a square format. The square image was then resized to 224 × 224.
4. Results
This section presents the outcomes of all experimental evaluations conducted in this study. A detailed interpretation and contextual discussion of these results on the DeepFish dataset is provided in
Section 5.
All experiments were conducted in a Google Colab Python 3.12.12 environment equipped with 167 GB of system RAM and an NVIDIA A100 GPU with 80 GB of GPU memory. Model training and inference were implemented using TensorFlow 2.19.0 and Keras 3.13.2, while dataset partitioning and stratification were performed using scikit-learn. Image loading and resizing were handled with OpenCV 4.13.0. NumPy 2.0.2 and Pandas 2.2.2 were used for numerical operations and management of comma-separated value (CSV) files. Hardware utilization and energy-related metrics were monitored using NVIDIA-SMI 580.82.07 and psutil 5.9.5 throughout the experiments.
4.1. Split and Pretraining Experiments
Table 2 summarizes the results of experiments comparing dataset partitioning strategies and model pretraining configurations. Specifically, performance is reported for both the dataset-provided split used in prior work [
33] (40%/10%/50% train/validation/test) and the proposed 80%/10%/10% split. Experiments were conducted using four ResNet models—ResNet-18, ResNet-50, ResNet-101, and ResNet-152—under identical training and evaluation conditions. ResNet-18 is included only in its randomly initialized form, as ImageNet-pretrained weights are not available for this model in the native Keras implementation.
Overall, the proposed 80/10/10 split yielded the highest observed test accuracy (0.99975 with R18). These results suggest that increasing the proportion of training data leads to substantial improvements in classification performance for this task on DeepFish data.
4.2. Confidence Intervals
We calculated 95% confidence intervals (CI) for all primary evaluation metrics using the Wilson score method, which is more robust than the standard normal approximation, particularly for high-performance metrics near 1.0. The confidence intervals for accuracy under the 80/10/10 split are presented in
Table 3. ResNet-18 (R18) reports the highest point estimate, with an accuracy of 0.9997 and a 95% CI of [0.9986, 1.0000], whereas ResNet-50 (R50) reports the lowest accuracy at 0.9967. Overall, the confidence intervals are narrow, indicating that model performance is statistically stable on the held-out test set. Additional analysis and interpretation are provided in
Section 5.
4.3. Not Pretrained Versus Pretrained Models
Table 2 presents the results comparing models trained from scratch with those initialized using ImageNet-pretrained weights. This comparison was conducted for ResNet-50, ResNet-101, and ResNet-152, for which pretrained weights are available. ResNet-18 is reported only in its randomly initialized form, as ImageNet-pretrained weights are not available in the native Keras implementation used in this study. To preserve experimental consistency and ensure comparability of EWS measurements across models, we maintained a single implementation environment without introducing external pretrained weight sources that could alter training dynamics or system-level resource consumption.
We present
Figure 5 to illustrate the effect of transfer learning on convergence behavior. The randomly initialized ResNet-50 (R50) requires more epochs to stabilize and reach its optimal validation performance compared to the ImageNet-pretrained ResNet-50 (R50-PT). While both models ultimately converge to similar performance levels, the pretrained model exhibits faster stabilization and reduced training duration, indicating that initialization with ImageNet weights improves optimization efficiency. This difference in convergence speed demonstrates the practical benefit of transfer learning, even when final accuracy is comparable.
Overall, all evaluated models demonstrated strong classification performance. Across experiments, pretrained models exhibited faster convergence than their randomly initialized counterparts, reflecting the benefits of transfer learning. Under the proposed 80%/10%/10% data split, the highest accuracy observed was 0.99975, achieved by the ResNet-18 model trained from scratch.
4.4. Data Augmentation
To determine whether skewed resizing impacts overall model performance, we conducted experiments using letter-boxed images resized to 224 × 224 with black padding. We evaluated both R50 and R50-PT under Original (ORG) and Augmented (AUG) conditions. The results are presented in
Table 4. Accuracy slightly improved in both models following augmentation, with R50 increasing by +0.00226 and R50-PT increasing by +0.00025.
4.5. Energy-Weighted Scores (EWS)
Table 5 presents the Energy-Weighted Score (EWS), training duration, and energy consumption for each model configuration on the 80/10/10 split.
Table 6 reports the EWS for each model along with the corresponding reduction in accuracy (REDUCED-ACC) relative to the highest-performing configuration, ResNet-18 without pretraining. The table shows substantial variation in EWS across models, reflecting differences in computational cost associated with model depth and pretraining.
Among all evaluated models, ResNet-18 without pretraining achieved the lowest EWS value of 340.83, indicating the smallest weighted energy consumption during training. In contrast, deeper models exhibited markedly higher EWS values regardless of pretraining. Although these deeper models incurred higher computational costs, the corresponding reductions in accuracy were relatively small, with REDUCED-ACC values remaining below 0.0031 across all configurations. This result highlights that modest accuracy differences are associated with substantial increases in energy consumption as model complexity increases.
4.6. Threshold Optimization
To improve overall classification performance, decision thresholds were tuned by maximizing the F1 score on the validation set.
Table 7 summarizes the impact of this threshold optimization. The left set of columns reports test-set performance using the default fixed threshold (
), while the right set presents results obtained using the optimized threshold values selected exclusively from the validation set. For most models, threshold optimization improved F1 on the held-out test set; one configuration (R101-PT) decreased slightly, and some were unchanged. The most notable improvement was observed for the ResNet-18 model, where adjusting the threshold from 0.50 to 0.15 eliminated all false positives and false negatives, resulting in perfect classification performance on the test set. These results demonstrate that validation-based threshold selection can substantially improve model performance on the DeepFish dataset, even when baseline accuracy is already high.
4.7. Learning Rate Scheduler and Batch Size Experiments
Learning rate scheduler experiments were conducted using two batch sizes, 32 and 256, to evaluate the interaction between optimization strategy and batch size on model performance. For each batch size, multiple learning rate schedulers were examined, and the results are summarized in
Table 8 and
Table 9. Each table reports the scheduler achieving the highest classification accuracy (ACC) and the scheduler yielding the lowest EWS, allowing direct comparison of predictive performance and computational efficiency.
For the smaller batch size of 32, the results show noticeable variation across learning rate schedulers, indicating that convergence behavior and generalization are sensitive to optimization choices under small-batch training. In contrast, experiments with a batch size of 256 demonstrate more consistent performance across schedulers, with the cosine learning rate scheduler emerging as the top-performing configuration when jointly considering ACC and EWS.
4.8. Gradient Clipping and Learning-Rate Scaling for Batch Size 256
When training with a batch size of 256, pronounced instability was observed in the validation learning curves (
Figure 6). To investigate whether this behavior was attributable to learning-rate configuration under large-batch training, a set of controlled experiments was conducted while holding the model, dataset, data split, optimizer, and random seed constant. Learning-rate scaling was first applied following the square-root heuristic, adjusting the initial and minimum learning rates from 0.001 and 0.00001 to 0.003 and 0.00003, respectively, to account for the eightfold increase from batch size 32 to batch size 256. Despite this adjustment, validation instability persisted. Consequently, global-norm gradient clipping was introduced with thresholds of 1.0, 0.75, and 0.5, and global gradient norms were logged throughout training to characterize optimization dynamics. As summarized in
Table 9, the combination of learning-rate scaling with gradient clipping at a threshold of 0.75 substantially stabilized the validation curves while maintaining strong test-set performance, achieving an accuracy of 0.99799.
5. Discussion
We evaluated multiple ResNet models to investigate the relationship between network depth, predictive accuracy, and computational cost on the DeepFish dataset. Although deeper models provide increased representational capacity, classification performance on this binary task approached saturation. On the 80/10/10 split, ResNet-18 achieved the highest observed accuracy of 0.99975. Prior studies have reported that increasing network depth may initially improve accuracy; however, beyond a certain point, performance can plateau or even decline depending on dataset characteristics and training configuration [
38].
In our experiments, deeper models exhibited comparable or slightly lower performance, indicating that increasing architectural depth did not improve predictive accuracy for this binary classification task. Wilson 95% confidence intervals further support this interpretation. Several models, including ResNet-50, ResNet-101, and ResNet-152, exhibit overlapping confidence intervals, suggesting functional equivalence in classification performance on this dataset. In contrast, there is no confidence interval overlap between ResNet-18 and ResNet-50: the lower bound of the ResNet-18 interval exceeds the upper bound of the ResNet-50 interval (
Table 3), indicating that ResNet-18 outperforms ResNet-50 at the 95% confidence level for the DeepFish binary classification task.
The DeepFish classification dataset exhibits only mild class imbalance, with approximately 56.2% negative samples and 43.8% positive samples. The imbalance ratio (IR), defined as the ratio of the majority class size to the minority class size
, is approximately 1.28. Despite this slight skew, the evaluated models achieved consistently high performance across multiple complementary metrics. For example, ResNet-18 attained 99.975% accuracy, an AUC of 1.0, and an F1 score of 0.99971 (
Table 2). The near-perfect F1 score reflects both high precision and high recall for the positive (fish) class, suggesting that the observed performance is not primarily driven by majority-class bias and remains robust under mild class imbalance.
In contrast to the near-saturated accuracy results, computational cost increased substantially with model depth. The EWS rose markedly from ResNet-18 to deeper models under identical training conditions, reflecting longer training times and greater resource utilization. While predictive performance remained statistically similar across models of different depths, the associated computational expenditure increased significantly. From an efficiency perspective, the shallower model therefore represents a more practical choice for this binary classification setting.
Among the evaluated configurations, pretrained variants converged faster than randomly initialized models but did not alter the overall performance plateau observed for this task. These findings indicate that architectural scaling or pretraining alone does not necessarily yield improved predictive outcomes when the underlying classification problem is already highly separable.
Optimization strategies also played an important role in stabilizing training and improving performance. Cosine learning-rate scheduling promoted smooth convergence, particularly at larger batch sizes, while gradient clipping mitigated instability caused by occasional large gradient updates. In addition, validation-based threshold optimization further enhanced classification outcomes, highlighting the importance of post-training calibration. Collectively, these results demonstrate that architectural design, optimization strategy, and computational efficiency should be considered jointly when developing robust underwater computer vision models.
Binary fish detection represents a focused yet practically important application of underwater computer vision; however, it constitutes a simplified task relative to more ecologically relevant multi-species or fine-grained recognition problems. The conclusions of this study are therefore limited to the binary DeepFish setting and may not directly generalize to more complex underwater environments or species-level classification tasks.
As future work, we plan to evaluate the proposed framework on more diverse underwater datasets and incorporate advanced computer vision models and data augmentation strategies to better assess generalization and robustness. Expanding toward multi-species classification will enhance ecological relevance and further support applications in biodiversity monitoring, fisheries management, and marine conservation.
6. Conclusions
This study evaluated multiple Residual Network (ResNet) models—ResNet-18, ResNet-50, ResNet-101, and ResNet-152—for binary underwater fish classification by jointly examining predictive performance and computational efficiency. Experiments were conducted using two dataset partitioning strategies (40/10/50 and 80/10/10 train/validation/test splits) and included both randomly initialized and ImageNet-pretrained models. To quantify training cost, we introduced the Energy-Weighted Score (EWS), which integrates energy consumption and memory usage during training and serves as a practical tool to support model selection.
Across all experiments, performance on the DeepFish binary classification task approached saturation. A non-pretrained ResNet-18 achieved 99.975% accuracy on the 80/10/10 split, which increased to 100% following validation-based threshold optimization. Deeper models did not yield statistically meaningful improvements in predictive performance, while computational cost increased substantially with depth. These findings suggest that, for well-separable binary classification tasks, shallower ResNet models can achieve near-perfect accuracy while offering significant efficiency advantages. In such settings, computational efficiency becomes a critical criterion for model selection.
These conclusions are specific to the ResNet family and the binary DeepFish classification task. In future work, we will investigate more complex datasets and modern architectures to further examine how model design and learning strategies influence performance and computational efficiency in challenging environments.