A Comparative Analysis of ResNet Models on Fish Classification

Whitney, Chad; Ahmed, Mahid; Huang, Lei; He, Shuiling; Zhou, Zhaoxian; Zhang, Chaoyang

doi:10.3390/math14061055

Open AccessArticle

A Comparative Analysis of ResNet Models on Fish Classification

by

Chad Whitney

^1,†

,

Mahid Ahmed

^1,†,

Lei Huang

¹,

Shuiling He

²,

Zhaoxian Zhou

¹

and

Chaoyang Zhang

^1,*

¹

School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39402, USA

²

Independent Researcher, Sunnyvale, CA 94089, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2026, 14(6), 1055; https://doi.org/10.3390/math14061055

Submission received: 25 January 2026 / Revised: 1 March 2026 / Accepted: 12 March 2026 / Published: 20 March 2026

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

Fish identification and classification tasks allow scientists greater data on the sustainability and diversity of fisheries. Advances in computer vision, particularly convolutional neural networks (CNNs), have enabled automated fish detection at scale; however, increased model depth may not justify the additional time and computational cost. This study examines the ResNet family of models for binary fish identification to assess whether deeper networks provide meaningful performance advantages over simpler configurations. We compare ResNet-18, ResNet-50, ResNet-101, and ResNet-152 using both non-pretrained and pretrained initializations. We introduce an Energy-Weighted Score (EWS) to enable comparison of computational resource usage using cost-based weighting. A reliable fish versus no-fish classification can be achieved with as few as 18 layers, yielding 99.975% accuracy which improved to 100% accuracy following threshold optimization. For binary fish identification, increasing ResNet depth provides increased EWS scores with little impact on accuracy returns over shallower models. Overall, models with fewer layers outperformed deeper models with more parameters, and additional depth and tuning techniques were unable to outperform simpler configurations while delivering higher EWS scores.

Keywords:

fish classification; machine learning; ResNet model; transfer learning; Energy-Weighted Score

MSC:

68T07

1. Introduction

Accurate fish identification and classification are fundamental to biodiversity research and fisheries conservation. These identifications are important as they facilitate programs for healthy fish populations, detection of both cryptic and invasive species and the protection of critical ecological systems [1]. Deep learning techniques applied to underwater photography and video from coastal marine ecosystems provide reliable and accurate fish detections with lower ecological impacts for the identification task when compared with previous techniques like scientific trawling [2]. These ecosystems serve as essential spawning, nursery, and feeding grounds for a diverse collection of marine creatures. Precise species classification further enables scientists and resource managers to associate species with specific habitats and implement targeted conservation measures at appropriate times [3]. Because fish constitute a major source of protein globally, mapping habitats with high abundance is essential for sustaining fisheries, guiding conservation priorities, and preserving the coupled human–natural systems that depend on them [4,5]. Global food demand will increase in the future. Fisheries, when sustainably managed, can provide a key source of protein [6].

Challenges are especially acute in marine environments. Unlike freshwater systems [7] with extensive taxonomic baselines, estuary and ocean environments impose severe observational constraints, including limited visibility, low illumination, degraded image resolution, and complex, noisy backgrounds [4]. These conditions complicate species detection and classification, and the scarcity of high-quality labeled datasets further restricts methodological progress.

Traditional approaches to fish classification often rely on divers to collect fish samples, a process that is not only labor-intensive and time-consuming but also potentially destructive to marine habitats [2]. Consequently, marine biologists and researchers have increasingly sought automated and non-invasive methods for fish classification [4]. In recent years, machine learning (ML) models have been widely applied for this purpose. Among them, Support Vector Machine (SVM) has been used to classify fish species based on features extracted from underwater image datasets [8,9]. However, these traditional ML algorithms heavily depend on manually engineered features, limiting their scalability and adaptability to diverse aquatic environments.

Recent advances in computer vision have turned to deep learning to address these obstacles. CNN can automatically extract discriminative features from underwater imagery and has demonstrated resilience to noise, low contrast, and motion blur [10,11,12]. However, much of the existing literature relies on single-environment datasets [5,13,14], which limits out-of-distribution generalization across sites, seasons, depths, and camera setups. Moreover, training a CNN needs large collections of labeled images as well as large computational resources. To solve the data limitation, researchers often use the transfer learning model. Transfer learning is employed to leverage pretrained visual representations from large-scale datasets, thereby reducing data dependency, accelerating convergence, and enhancing classification performance in domain-specific tasks such as underwater fish identification. AlexNet [15], ResNet, and VGG16 are commonly used transfer learning models in the case of fish classification (FC).

Against this backdrop, this study systematically evaluates the performance of multiple ResNet models for underwater fish image classification by jointly considering predictive accuracy and computational efficiency. Experiments are conducted on a large-scale and well-balanced underwater image dataset consisting of approximately 40,000 images collected from 20 distinct aquatic habitats, allowing classification accuracy (ACC) to serve as a reliable primary evaluation metric. A transfer learning framework is employed using four widely adopted convolutional neural network backbones—ResNet-18, ResNet-50, ResNet-101, and ResNet-152—selected to represent increasing network depth and computational complexity. All models are trained and fine-tuned under a unified experimental protocol to ensure a fair and reproducible comparison. Beyond accuracy, we introduce an EWS to quantify the computational resources required to train each model to a given performance level, integrating model size, memory consumption, training time, and energy-related resource usage into a single efficiency metric. By jointly analyzing accuracy and EWS, this study provides a comprehensive assessment of the trade-offs between performance and computational cost, offering practical guidance for model selection in deployment scenarios where resource availability, energy efficiency, and scalability are critical considerations.

The remainder of this paper is organized as follows. Section 2 reviews relevant and recent literature. Section 3 describes the ResNet models and outlines the key methodological steps. Section 4 presents the performance of the evaluated models on the target dataset and provides comparative analyses in terms of model parameters, memory consumption, and runtime. Section 5 interprets the results and offers guidance for selecting appropriate models for related applications. Finally, Section 6 summarizes the main findings and discusses the study’s limitations.

2. Literature Review

In this section, we review recent and representative studies on underwater image classification by first summarizing commonly used datasets for fish detection and classification that have supported model development and benchmarking. We discuss their scope, scale, and limitations, particularly with respect to environmental diversity, annotation quality, and class imbalance. Building on this foundation, we examine prior work on deep learning-based computer vision methods, focusing on convolutional neural network architectures for fish detection and recognition, the adoption of transfer learning to mitigate data scarcity, and emerging efforts to balance classification performance with computational efficiency in resource-constrained underwater environments.

The existing datasets for fish identification and classification play a crucial role in advancing research in underwater biodiversity. One prominent dataset is the Fish4Knowledge dataset [16], which presents an efficient backbone for fish classification from composited underwater images. This dataset, however, has limitations in terms of its environmental diversity, which may restrict its generalizability across various underwater habitats. Similarly, Kuswantori et al. [17] developed a dataset for fish detection and classification, aiming to support automatic sorting systems using an optimized YOLO algorithm. However, this dataset is primarily focused on specific operational conditions, which limits its applicability in broader ecological studies. The UVOT400 dataset, introduced by Alawode et al. [13], aims to enhance underwater visual tracking. Yet, it faces challenges related to the noise and complexity of underwater imagery, which may hinder accurate classification. Furthermore, the FishInTurbidWater dataset, presented by Jahanbakht et al. [14], employs semi-supervised and weakly supervised deep neural networks for fish detection in turbid underwater videos, but its reliance on semi-supervised methods may result in inconsistencies due to limited labeled data. Lastly, the FishNet dataset by Ma et al. [18] advances species recognition for aquatic biodiversity monitoring through semi-supervised learning, yet it still confronts the issue of insufficient labeled examples, which is a common challenge across many datasets.

Early work in automated FC relied primarily on traditional computer vision techniques based on manually designed features. Research in this area dates back to 1994, when Castignolles et al. [19] developed a vision-based system to automatically detect, recognize, and count migratory fish passing through backlit observation windows in river fish passages. Their approach employed offline detection methods to segment fish from S-VHS video frames and enhanced visibility by improving background illumination conditions. Subsequent studies emphasized shape-based feature extraction for fish identification. Lee et al. introduced curvature function (CF)-based descriptors to represent fish contours [20], and later evaluated multiple contour-based representations—including line segments, polygonal approximations, Fourier descriptors, and CF analysis—for fish classification [21]. While these approaches demonstrated the feasibility of automated FC, they were often limited by measurement inaccuracies, sensitivity to image quality, and the need for manual determination of feature locations. To address some of these challenges, Islam et al. [22] proposed a content-based method that integrated both local and global visual features, resulting in improved classification accuracy and outperforming several contemporaneous approaches.

As research progressed, machine learning models became increasingly prevalent in fish classification tasks. These approaches leveraged morphometric measurements and mathematical transform-based descriptors to further automate the classification process and enhance accuracy. Widely adopted algorithms included Support Vector Machines (SVM), Random Forests, and Artificial Neural Networks (ANN), all of which consistently reported superior performance compared to purely rule-based or handcrafted feature-driven methods [23,24]. This transition toward data-driven learning frameworks laid the foundation for subsequent advances in fish classification, particularly the emergence of deep learning methods capable of learning discriminative features directly from raw image data in an end-to-end manner.

The emergence of deep learning marked a fundamental shift in image classification research by enabling automatic feature extraction directly from raw image data, thereby reducing reliance on manually engineered descriptors. One of the earliest applications of deep learning to unconstrained underwater fish imagery was introduced by Salman et al. [25]. In their study, the authors designed a custom CNN comprising three convolutional layers to learn discriminative visual features from fish images. The learned representations were subsequently fed into conventional classifiers, including Support Vector Machines (SVM) and k-Nearest Neighbors (kNN), for species identification. Despite its relatively shallow architecture, the proposed CNN substantially outperformed traditional hand-crafted feature pipelines by autonomously capturing salient visual patterns such as fish body morphology and texture characteristics.

A further paradigm shift in image classification research emerged with the introduction of AlexNet [26], which demonstrated the effectiveness of deeper convolutional neural network (CNN) architectures when combined with innovations such as ReLU activations, dropout regularization, and large-scale GPU-based training. The success of AlexNet accelerated the widespread adoption of deep learning in visual recognition tasks and significantly influenced subsequent applications in marine and ecological domains, particularly through benchmark initiatives such as the LifeCLEF and SeaCLEF challenges. Building on this advancement, Iqbal et al. [27] employed AlexNet trained from scratch for fish species classification and achieved an accuracy of 90.48%, while Tamou et al. [28] applied transfer learning using a pre-trained AlexNet model and reported a substantially higher accuracy of 99.45%. These results underscore the effectiveness of transfer learning for underwater fish classification, representing an early and influential demonstration that CNNs trained on large-scale natural image datasets can provide robust and transferable representations for challenging underwater visual environments.

The use of deeper neural networks has significantly improved image classification performance; however, increasing depth also introduced optimization challenges, particularly the vanishing gradient problem. This issue was effectively addressed by the introduction of residual learning in ResNet [29], which enables stable training of very deep architectures through identity shortcut connections. Although several deep models—such as VGG16, InceptionV3, Xception, DenseNet, and MobileNet—have achieved strong performance in image classification tasks, ResNet has consistently demonstrated superior optimization efficiency and representational capability across a wide range of benchmarks.

With the availability of large, annotated datasets, end-to-end deep learning models have become dominant in underwater fish classification. Researchers have increasingly fine-tuned deep architectures, especially ResNet variants, to address domain-specific challenges in underwater imagery. For example, Zhang et al. [30] proposed AdvFish, an adversarial fish recognition framework that fine-tunes a ResNet-50 backbone and incorporates an additional loss term to suppress background noise while emphasizing salient fish features, resulting in improved accuracy in complex scenes. Similarly, Pang et al. [31] employed a teacher–student knowledge distillation strategy to mitigate underwater image interference, enabling the model to learn more robust representations under conditions of turbidity and illumination variation. Together, these studies highlight the adaptability of modern deep learning techniques in improving robustness and performance for underwater image classification tasks.

3. Materials and Methods

3.1. Dataset

One of the primary requirements in computer vision tasks, particularly in object detection and classification, is the availability of an appropriate and well-curated dataset. Furthermore, deep learning models generally require a large volume of training images to achieve optimal performance. The DeepFish dataset fulfilled all of the requirements. It was introduced by Bradley et al. [32] in 2019. Originally, the dataset was developed to examine the influence of local habitat characteristics and environmental contexts on the assemblage composition of juvenile fish, rather than for classification or segmentation purposes. Subsequently, Saleh et al. [33] annotated and partitioned the dataset for classification, counting, localization, and segmentation tasks, thereby establishing it as a benchmark dataset for fish identification research.

Primarily, the dataset was captured in video format. The videos in the dataset were collected from 20 distinct habitats located in remote coastal marine regions of tropical Australia. All recordings were captured during daylight hours under low-turbidity conditions using low-disturbance techniques to minimize environmental interference. The footage was acquired in full high-definition resolution (1920 × 1080 pixels) with a digital camera.

The DeepFish dataset contains labeled image data to handle three distinct tasks for fish identification: Classification, Segmentation and Localization. In this paper we specifically focus on the classification data to test the ResNet models. The DeepFish dataset used for classification comprises a total of 39,766 labeled images, including 22,357 negative-class (“No Fish”) images and 17,409 positive-class (“Fish”) images. This yields a 56.2 percent negative to 43.8 percent positive distribution. This indicates a dataset that is slightly skewed toward the majority class. These images were created from a digital camera that captured video clips in full HD resolution 1920 × 1080 pixels in RGB 3 layers videos [34]. The authors provided a Github repo (https://github.com/alzayats/DeepFish, accessed on 10 January 2026) to offer information and their code used in their paper [35]. There is also a download link (http://data.qld.edu.au/public/Q5842/2020-AlzayatSaleh-00e364223a600e83bd9c3f5bcd91045-DeepFish/, accessed on 10 January 2026) to retrieve the entire DeepFish dataset [36].

For the classification task, Saleh et al. [33] reported a data split of 50%, 20%, and 30% for training, validation, and testing, respectively. However, we observed a discrepancy between the split described in the paper and the partition specified in the accompanying dataset CSV files. Specifically, the files indicated a 40%, 10%, and 50% distribution. To ensure reproducibility and consistency with the released resources, we adopted the split defined in Saleh et al.’s dataset files. In this study, to enhance model generalization, we instead partitioned the dataset into a stratified partition of 80%, 10%, and 10% for the training, validation, and test sets, respectively.

We resized all images to (224 × 224) to ensure compatibility with the input dimensions required by the ResNet models. Then, we normalized it by scaling pixel values to the range [0, 1] to improve convergence during training. We shuffled the data and used a random seed for repeatability.

3.2. Models

Our long-term research goals involve complex reef fish detection and classification tasks that require fine-grained visual discrimination. Such tasks are expected to benefit from models with large representational capacity capable of capturing subtle spatial features. Historically, very deep neural networks suffered from vanishing gradients as layer depth increased. Residual Networks (ResNets) addressed this limitation by introducing residual connections that improve gradient flow and enable effective training of substantially deeper architectures [29].

We utilized four widely adopted convolutional neural network models from the ResNet family—ResNet-18, ResNet-50, ResNet-101, and ResNet-152—as feature extractors in our experiments. The numerical suffix denotes the number of layers in each variant. As network depth increases, so does the number of trainable parameters and representational capacity.

In this set of models, network depth increases progressively from 18 to 152 layers. This progression enables a controlled assessment of whether increased representational capacity yields measurable performance gains on the DeepFish binary classification task or whether performance saturates at lower depths.

All of the pretrained models were initialized with ImageNet-pretrained weights to leverage their strong representational capabilities. Although several alternative architectures, such as VGG, DenseNet, MobileNetV2, AlexNet, InceptionV3, and GoogleNet, are commonly employed for vision tasks involving limited datasets, we prioritized the ResNet family due to its residual (skip) connections [29]. These connections effectively mitigate the vanishing gradient problem, enabling more stable and efficient training of deeper networks.

To comprehensively assess model performance, we evaluated each model under multiple training configurations. First, the models were trained from scratch using randomly initialized weights to establish baseline performance. Subsequently, transfer learning was employed by fine-tuning ImageNet-pretrained models, enabling the networks to leverage general visual representations learned through large-scale pretraining.

Figure 1 shows the architecture of the models we have used in our experiment. The final layer of the network, originally designed for 1000 ImageNet categories, was replaced with a fully connected layer featuring a classification head of a single neuron. This neuron outputs a scalar value z representing the logit for the valid class. The sigmoid activation function was applied to convert this logit into a probability between 0 and 1, which is mathematically defined as:

σ (z) = \frac{1}{1 + e^{- z}}

(1)

where

σ (z)

represents the predicted probability that an image belongs to the valid class. In our experiments we output a sigmoid activation function. The model was trained using the Binary Cross-Entropy (BCE) loss, a suitable choice for binary classification problems. The BCE loss measures the discrepancy between the predicted probabilities and the true labels

y \in {0, 1}

. Its mathematical formulation is expressed as:

L (y, z) = - [y \cdot log (σ (z)) + (1 - y) \cdot log (1 - σ (z))]

(2)

The Adam optimizer updates model parameters by maintaining exponential moving averages of the first and second moments of the gradients. Let

θ_{t}

denote the model parameters at iteration t, and let

g_{t} = \nabla_{θ} L (θ_{t})

represent the gradient of the loss function

L

with respect to

θ_{t}

. The first moment estimate

m_{t}

and the second moment estimate

v_{t}

are updated as

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t},

(3)

v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2},

(4)

where

m_{t}

is an exponentially weighted moving average of past gradients (first moment),

v_{t}

is an exponentially weighted moving average of the squared gradients (second raw moment), and

β_{1} \in [0, 1)

and

β_{2} \in [0, 1)

are decay coefficients that control the influence of historical gradient information.

To compensate for the bias introduced by initializing

m_{0}

and

v_{0}

to zero, bias-corrected estimates are computed as

{\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}, {\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}} .

(5)

The parameter update rule is then given by

θ_{t + 1} = θ_{t} - η \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ},

(6)

where

η

denotes the learning rate and

ϵ

is a small positive constant added to ensure numerical stability. This formulation enables Adam to adaptively adjust learning rates for individual parameters while incorporating momentum through the first moment estimate.

During training, we monitored both validation AUC and validation loss to assess convergence behavior. Models were intentionally allowed to run for additional epochs to examine training dynamics, evaluate the effects of different learning strategies on loss convergence, and assess potential overfitting. This approach ensured that the training process was convergent and that the models were sufficiently trained.

3.3. Transfer Learning

Transfer learning is an effective strategy for addressing limited labeled data in specialized visual domains such as deep-sea imaging. Formally, transfer learning aims to improve performance on a target task by leveraging knowledge learned from a related source task. Let the source domain be defined as

D_{s} = {X_{s}, P_{s} (X)},

(7)

with the corresponding source task

T_{s} = {Y_{s}, f_{s} (\cdot)},

(8)

where

X_{s}

denotes the input space,

P_{s} (X)

the data distribution,

Y_{s}

the label space, and

f_{s} (\cdot)

the predictive function learned from the source data. Similarly, the target domain and task are denoted by

D_{t} = {X_{t}, P_{t} (X)}

(9)

and

T_{t} = {Y_{t}, f_{t} (\cdot)},

(10)

respectively. Transfer learning is applicable when either the data distributions or the tasks differ, that is,

P_{s} (X) \neq P_{t} (X) or T_{s} \neq T_{t} .

(11)

In deep neural networks, the predictive function can be decomposed as

f (x; θ) = g (h (x; θ_{h}); θ_{g}),

(12)

where

h (\cdot; θ_{h})

represents the feature extraction layers and

g (\cdot; θ_{g})

denotes the task-specific classification head. During pretraining, the network parameters

θ = {θ_{h}, θ_{g}}

are learned on the source dataset by minimizing the empirical risk

θ_{s} = \arg \min_{θ} \frac{1}{N_{s}} \sum_{i = 1}^{N_{s}} L (f (x_{i}^{s}; θ), y_{i}^{s}),

(13)

where

L (\cdot)

is the loss function and

(x_{i}^{s}, y_{i}^{s})

are source samples.

For the target task, the pretrained feature extractor

θ_{h}^{s}

is reused, and the model is adapted by minimizing

θ_{t} = \arg \min_{θ} \frac{1}{N_{t}} \sum_{i = 1}^{N_{t}} L (f (x_{i}^{t}; θ), y_{i}^{t}),

(14)

where

(x_{i}^{t}, y_{i}^{t})

are samples from the target domain. During fine-tuning, the feature extraction layers are initialized with

θ_{h}^{s}

and selectively updated using a smaller learning rate, allowing the learned representations to adapt to the statistical properties of deep-sea imagery while preserving general visual features.

Figure 2 illustrates the transfer learning framework adopted in this study. We employ ResNet models pretrained on the ImageNet dataset to extract hierarchical visual representations, ranging from low-level edges and textures to higher-level shape-based patterns. These representations are subsequently fine-tuned using domain-specific underwater imagery to improve discrimination between fish and non-fish classes. Prior work in ecological monitoring and underwater computer vision has shown that such pretrained representations generalize well to novel object categories due to their domain-agnostic nature.

For implementation, we used multiple ResNet variants pretrained on ImageNet [37]. The original classification head was replaced with a global average pooling layer followed by a dense output layer tailored for binary classification. Training was conducted using the Adam optimizer with a learning rate of

10^{- 3}

and binary cross-entropy loss. Models were trained for up to 200 epochs with early stopping with patience value 20 to mitigate overfitting, using a batch size of 32 in our 14 accuracy-focused experiments. This transfer learning strategy enables robust classification under challenging underwater imaging conditions, where labeled data are limited and visual characteristics differ substantially from images used to train the original model.

Our experiments were conducted entirely within the Keras 3.13.2 environment to ensure a consistent and controlled implementation across all evaluated models. At the time of experimentation, ImageNet-pretrained weights for ResNet-18 were not natively available within the Keras Applications module.

To preserve implementation consistency—particularly for fair comparison of training dynamics and EWS measurements—we limited our experiments to models officially supported within the same framework. This approach ensured that all models shared identical software, optimization settings, and hardware measurement procedures.

3.4. Fine-Tuning Parameters

To examine the impact of optimization strategies during transfer learning, we conducted a series of fine-tuning experiments using a pretrained ResNet-50 model. The experiments evaluated two batch size configurations (32 and 256) in combination with a diverse set of learning-rate schedulers, including no scheduling (NONE), Cosine Annealing (COSINE), Cosine Annealing with Restarts (COSINE_RESTARTS), Reduce-on-Plateau (PLATEAU), Exponential Decay (EXPONENTIAL), Piecewise Constant Decay (PIECEWISE), Polynomial Decay (POLYNOMIAL), Linear Warmup (LINEAR_WARMUP), and Linear Warmup followed by Cosine Annealing (LINEAR_WARMUP_COSINE).

All configurations were trained under identical experimental conditions to ensure a fair comparison, enabling a systematic analysis of how batch size and learning-rate scheduling jointly influence convergence behavior, classification performance, and computational efficiency. Across all experiments, the initial learning rate was set to the base value specified in the experiment configuration and served as the maximum learning rate during training. When no scheduling was applied, both the initial and minimum learning rates were equal to this base value.

When learning-rate scheduling was enabled, the learning rate varied dynamically over training while remaining bounded above by the initial value. Cosine-based and warmup-based schedules reduced the learning rate toward zero by the end of training, yielding an effective minimum learning rate approaching 0. In contrast, monotonic decay strategies enforced explicit lower bounds: Exponential Decay, Polynomial Decay, and Piecewise Constant Decay reduced the learning rate to approximately 0.01 times the initial value, while Reduce-on-Plateau adaptively decreased the learning rate with a minimum bound set to 0.00001. This was achieved by setting the alpha value to 0.01 on the learning rate. This design ensured a consistent maximum learning rate across configurations while allowing controlled variation in minimum learning rates and decay behavior, facilitating a fair comparison of optimization strategies.

3.5. Gradient Clipping

Gradient clipping was employed to mitigate unstable and excessively large parameter updates during training. This technique constrains the magnitude of gradient updates, thereby improving optimization stability, particularly in large-batch training regimes. Although larger batch sizes reduce gradient variance, they can increase the magnitude of individual parameter updates, making optimization more sensitive to rare but extreme gradient excursions. By explicitly bounding the global gradient norm, clipping limits abrupt movements in parameter space and helps stabilize training dynamics. In this study, we evaluated the impact of gradient clipping using three global clip-norm thresholds (1.0, 0.75, and 0.5) to assess how progressively tighter bounds on update magnitude influence convergence behavior and validation stability.

The clip-norm values were selected after examining the gradient behavior of the unclipped ResNet-50 (ImageNet-pretrained) configuration. Without clipping, the global gradient norms exhibited two clearly separated regimes: a stable regime with values below 0.03 and intermittent extreme spikes ranging from 45 to 513. As seen in Figure 3, the smallest spike exceeded the largest stable gradient magnitude by approximately a factor of 1500, indicating a pronounced separation between ordinary updates and rare high-magnitude excursions. Based on this structure, we began with a commonly used baseline threshold of 1.0 and progressively reduced it to 0.75 and 0.5. These thresholds were chosen to remain substantially above the stable gradient range while imposing increasingly strict constraints on extreme spikes, enabling a systematic evaluation of how different clipping intensities affect training smoothness and stability.

3.6. Threshold Selection

During the initial evaluation of model performance, a default decision threshold of 0.5 was used to assign class labels in the binary classification task. To identify a more optimal threshold, we optimized the F1 score, which provides a balanced trade-off between precision and recall. Specifically, we performed a sweep over possible threshold values less than 0.5 on the validation set and selected the threshold that maximized the F1 score. This optimized threshold was subsequently fixed and applied when evaluating the trained classifier on the held-out test set.

3.7. Evaluation Metrics

To assess model performance, we employed several standard classification metrics, including Accuracy (ACC), Precision, Recall, F1 Score, and an Energy-Weighted Score (EWS). These metrics jointly capture predictive performance and computational efficiency.

Classification Metrics

Let

T P

,

T N

,

F P

, and

F N

denote the numbers of true positives, true negatives, false positives, and false negatives, respectively, where the positive class corresponds to fish images.

ACC measures the proportion of correctly classified samples and is defined as

ACC = \frac{T P + T N}{T P + T N + F P + F N} .

(15)

Precision measures the reliability of positive (fish) predictions and is given by

Precision = \frac{T P}{T P + F P} .

(16)

Recall evaluates the model’s ability to identify all positive instances and is defined as

Recall = \frac{T P}{T P + F N} .

(17)

The F1 Score provides a balanced measure of Precision and Recall through their harmonic mean:

F 1 Score = 2 \times \frac{Precision \times Recall}{Precision + Recall} .

(18)

In addition to threshold-dependent metrics, we also report the Area Under the Receiver Operating Characteristic Curve (AUC), which evaluates the model’s ability to discriminate between positive and negative classes across all possible classification thresholds. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR), where

TPR = \frac{T P}{T P + F N}, FPR = \frac{F P}{F P + T N} .

(19)

The AUC is defined as the area under the ROC curve:

AUC = \int_{0}^{1} TPR (FPR) d (FPR) .

(20)

An AUC value of 0.5 indicates random classification performance, while an AUC of 1.0 corresponds to perfect discrimination between fish and non-fish images.

3.8. Wilson Score Confidence Interval for Accuracy

To quantify the statistical uncertainty associated with classification accuracy, we computed 95% Wilson score confidence intervals based on the confusion matrix results.

Let, n denote the total number of test samples, k denote the number of correctly classified samples obtained from the confusion matrix,

\hat{p} = \frac{k}{n}

denote the observed accuracy, z denote the standard normal quantile corresponding to the desired confidence level (for 95% confidence,

z = 1.96

).

Since classification accuracy represents a binomial proportion, the Wilson score interval was used due to its improved coverage properties compared to the standard Wald approximation, particularly when

\hat{p}

is close to 0 or 1.

The Wilson confidence interval is defined as:

CI = \frac{\hat{p} + \frac{z^{2}}{2 n} \pm z \sqrt{\frac{\hat{p} (1 - \hat{p})}{n} + \frac{z^{2}}{4 n^{2}}}}{1 + \frac{z^{2}}{n}}

(21)

Accordingly, the lower and upper bounds are computed as:

{CI}_{lower} = \frac{\hat{p} + \frac{z^{2}}{2 n} - z \sqrt{\frac{\hat{p} (1 - \hat{p})}{n} + \frac{z^{2}}{4 n^{2}}}}{1 + \frac{z^{2}}{n}}

(22)

{CI}_{upper} = \frac{\hat{p} + \frac{z^{2}}{2 n} + z \sqrt{\frac{\hat{p} (1 - \hat{p})}{n} + \frac{z^{2}}{4 n^{2}}}}{1 + \frac{z^{2}}{n}}

(23)

This interval provides a statistically robust estimate of the uncertainty associated with the observed accuracy values.

In addition to these quantitative metrics, all misclassified samples were manually reviewed to determine whether errors resulted from model behavior or potential inaccuracies in the ground-truth annotations, thereby ensuring the reliability of the evaluation process.

3.9. Energy Weight Score (EWS)

Beyond classification accuracy, this study evaluates computational efficiency by systematically monitoring CPU, GPU, and memory utilization during model training. All experiments were conducted on identical hardware, and resource usage was sampled at fixed 0.5 s intervals to ensure comparability across runs.

At each sampling interval, system-level metrics were recorded, including a timestamp (converted to elapsed time in seconds), CPU utilization percentage, and system memory usage measured in megabytes. These measurements provide a temporal profile of processor and memory demand throughout training.

GPU-level metrics were collected from an NVIDIA A100 (NVIDIA Corporation, Santa Clara, CA, USA) accelerator operating within a cloud-based environment. The recorded variables included GPU utilization percentage, GPU memory usage (MB), GPU temperature (°C), and total available GPU memory. Although the hardware supports direct power queries (e.g., instantaneous power draw and power limits), such metrics were not used in this study due to access restrictions commonly imposed in multi-tenant computing environments.

Using the recorded utilization data, energy consumption was estimated for each hardware component. GPU energy consumption (

{GPU}_{Wh}

) was computed as the time-integrated product of GPU utilization and the maximum rated GPU wattage. CPU energy consumption (

{CPU}_{Wh}

) was estimated analogously using CPU utilization and an assumed processor wattage. Memory energy consumption (

{RAM}_{Wh}

) was derived from RAM usage (in gigabytes) multiplied by an estimated watts-per-GB factor over time. Total energy consumption was then defined as

TotalEnergy = {GPU}_{Wh} + {CPU}_{Wh} + {RAM}_{Wh} .

(24)

To enable a unified comparison of computational efficiency across experimental conditions, we introduce the EWS, a composite metric that aggregates energy consumption across hardware components using cost-based weighting factors. These weights were derived from the relative pricing of GPU, CPU, and memory resources offered by the cloud provider at the time of experimentation (Table 1). The EWS is defined as

EWS = w_{GPU} \times {GPU}_{Wh} + w_{CPU} \times {CPU}_{Wh} + w_{RAM} \times {RAM}_{Wh}

(25)

We can also write this as a summation of components and usages.

EWS = \sum_{i = 1}^{3} w_{i} e_{i}

(26)

where

w_{i}

denotes the infrastructure cost weight for the component i, and

e_{i}

denotes the estimated energy consumption (in Wh) for the component i. In our experiments, i = 1, 2, 3 correspond to GPU, CPU, and RAM, respectively.

The motivation for this weighting scheme is economic rather than physical. Resource cost serves as a proxy for relative scarcity and infrastructure value, reflecting the operational expense associated with providing each component in a managed cloud environment. As such, EWS should be interpreted as a normalized, hardware-dependent efficiency indicator rather than an absolute measure of energy consumption. EWS, through cost-based weighting, addresses the scarcity of high-demand hardware used in the experiments. EWS provides a single scalar metric for comparing experimental runs under identical hardware conditions.

It is important to emphasize that computational efficiency metrics are inherently hardware-specific, and substantial ongoing research seeks to standardize energy-aware evaluation in machine learning. In this work, the objective is not to claim universal efficiency but to provide a consistent, interpretable basis for comparing experimental configurations executed under identical hardware conditions.

3.10. Experiments

This study was designed to systematically evaluate the trade-offs between predictive performance and computational efficiency in ResNet-based fish versus non-fish image classification. In total, 32 experiments were conducted under consistent preprocessing pipelines, evaluation protocols, and hardware conditions to ensure fair comparison. All experiments were executed on a Google Colab environment equipped with 167 GB of system RAM and an NVIDIA A100 GPU with 80 GB of memory. Models were trained on the DeepFish dataset, which comprises 39,766 labeled images containing both fish (positive) and non-fish (negative) samples. The class distribution was preserved as provided in the dataset, and no explicit class weighting was applied during training. The experimental framework incorporated dataset split testing (replicating the dataset-provided split as well as an alternative 80%/10%/10% train/validation/test split), comparison of randomly initialized models versus ImageNet-pretrained models, and evaluation of multiple ResNet models (ResNet-18, ResNet-50, ResNet-101, and ResNet-152).

In addition to model and data-related factors, we conducted a series of optimization-focused experiments to analyze training dynamics and efficiency. These included learning-rate scheduler experiments using NONE, COSINE, COSINE_RESTARTS, PLATEAU, EXPONENTIAL, PIECEWISE, POLYNOMIAL, LINEAR_WARMUP, and LINEAR_WARMUP_COSINE, as well as batch size comparisons between 32 and 256 to study their impact on convergence behavior. Post-training threshold optimization was performed by maximizing the F1 score on the validation set and applying the selected threshold to the held-out test set. Computational efficiency across all experiments was evaluated using both ACC and the proposed EWS, derived from detailed CPU, GPU, and memory utilization measurements. Finally, to mitigate instability observed during large-batch training, gradient clipping experiments were conducted using multiple clip-norm values to assess their effect on stabilizing optimization and smoothing training curves.

We also conducted an experiment to evaluate the impact of data augmentation techniques designed to address potential issues arising from reshaping and skewing the images. As shown in Figure 4, the augmentation procedure involved letter-boxing the image by adding black padding to reshape it into a square format. The square image was then resized to 224 × 224.

4. Results

This section presents the outcomes of all experimental evaluations conducted in this study. A detailed interpretation and contextual discussion of these results on the DeepFish dataset is provided in Section 5.

All experiments were conducted in a Google Colab Python 3.12.12 environment equipped with 167 GB of system RAM and an NVIDIA A100 GPU with 80 GB of GPU memory. Model training and inference were implemented using TensorFlow 2.19.0 and Keras 3.13.2, while dataset partitioning and stratification were performed using scikit-learn. Image loading and resizing were handled with OpenCV 4.13.0. NumPy 2.0.2 and Pandas 2.2.2 were used for numerical operations and management of comma-separated value (CSV) files. Hardware utilization and energy-related metrics were monitored using NVIDIA-SMI 580.82.07 and psutil 5.9.5 throughout the experiments.

4.1. Split and Pretraining Experiments

Table 2 summarizes the results of experiments comparing dataset partitioning strategies and model pretraining configurations. Specifically, performance is reported for both the dataset-provided split used in prior work [33] (40%/10%/50% train/validation/test) and the proposed 80%/10%/10% split. Experiments were conducted using four ResNet models—ResNet-18, ResNet-50, ResNet-101, and ResNet-152—under identical training and evaluation conditions. ResNet-18 is included only in its randomly initialized form, as ImageNet-pretrained weights are not available for this model in the native Keras implementation.

Overall, the proposed 80/10/10 split yielded the highest observed test accuracy (0.99975 with R18). These results suggest that increasing the proportion of training data leads to substantial improvements in classification performance for this task on DeepFish data.

4.2. Confidence Intervals

We calculated 95% confidence intervals (CI) for all primary evaluation metrics using the Wilson score method, which is more robust than the standard normal approximation, particularly for high-performance metrics near 1.0. The confidence intervals for accuracy under the 80/10/10 split are presented in Table 3. ResNet-18 (R18) reports the highest point estimate, with an accuracy of 0.9997 and a 95% CI of [0.9986, 1.0000], whereas ResNet-50 (R50) reports the lowest accuracy at 0.9967. Overall, the confidence intervals are narrow, indicating that model performance is statistically stable on the held-out test set. Additional analysis and interpretation are provided in Section 5.

4.3. Not Pretrained Versus Pretrained Models

Table 2 presents the results comparing models trained from scratch with those initialized using ImageNet-pretrained weights. This comparison was conducted for ResNet-50, ResNet-101, and ResNet-152, for which pretrained weights are available. ResNet-18 is reported only in its randomly initialized form, as ImageNet-pretrained weights are not available in the native Keras implementation used in this study. To preserve experimental consistency and ensure comparability of EWS measurements across models, we maintained a single implementation environment without introducing external pretrained weight sources that could alter training dynamics or system-level resource consumption.

We present Figure 5 to illustrate the effect of transfer learning on convergence behavior. The randomly initialized ResNet-50 (R50) requires more epochs to stabilize and reach its optimal validation performance compared to the ImageNet-pretrained ResNet-50 (R50-PT). While both models ultimately converge to similar performance levels, the pretrained model exhibits faster stabilization and reduced training duration, indicating that initialization with ImageNet weights improves optimization efficiency. This difference in convergence speed demonstrates the practical benefit of transfer learning, even when final accuracy is comparable.

Overall, all evaluated models demonstrated strong classification performance. Across experiments, pretrained models exhibited faster convergence than their randomly initialized counterparts, reflecting the benefits of transfer learning. Under the proposed 80%/10%/10% data split, the highest accuracy observed was 0.99975, achieved by the ResNet-18 model trained from scratch.

4.4. Data Augmentation

To determine whether skewed resizing impacts overall model performance, we conducted experiments using letter-boxed images resized to 224 × 224 with black padding. We evaluated both R50 and R50-PT under Original (ORG) and Augmented (AUG) conditions. The results are presented in Table 4. Accuracy slightly improved in both models following augmentation, with R50 increasing by +0.00226 and R50-PT increasing by +0.00025.

4.5. Energy-Weighted Scores (EWS)

Table 5 presents the Energy-Weighted Score (EWS), training duration, and energy consumption for each model configuration on the 80/10/10 split. Table 6 reports the EWS for each model along with the corresponding reduction in accuracy (REDUCED-ACC) relative to the highest-performing configuration, ResNet-18 without pretraining. The table shows substantial variation in EWS across models, reflecting differences in computational cost associated with model depth and pretraining.

Among all evaluated models, ResNet-18 without pretraining achieved the lowest EWS value of 340.83, indicating the smallest weighted energy consumption during training. In contrast, deeper models exhibited markedly higher EWS values regardless of pretraining. Although these deeper models incurred higher computational costs, the corresponding reductions in accuracy were relatively small, with REDUCED-ACC values remaining below 0.0031 across all configurations. This result highlights that modest accuracy differences are associated with substantial increases in energy consumption as model complexity increases.

4.6. Threshold Optimization

To improve overall classification performance, decision thresholds were tuned by maximizing the F1 score on the validation set. Table 7 summarizes the impact of this threshold optimization. The left set of columns reports test-set performance using the default fixed threshold (

t = 0.5

), while the right set presents results obtained using the optimized threshold values selected exclusively from the validation set. For most models, threshold optimization improved F1 on the held-out test set; one configuration (R101-PT) decreased slightly, and some were unchanged. The most notable improvement was observed for the ResNet-18 model, where adjusting the threshold from 0.50 to 0.15 eliminated all false positives and false negatives, resulting in perfect classification performance on the test set. These results demonstrate that validation-based threshold selection can substantially improve model performance on the DeepFish dataset, even when baseline accuracy is already high.

4.7. Learning Rate Scheduler and Batch Size Experiments

Learning rate scheduler experiments were conducted using two batch sizes, 32 and 256, to evaluate the interaction between optimization strategy and batch size on model performance. For each batch size, multiple learning rate schedulers were examined, and the results are summarized in Table 8 and Table 9. Each table reports the scheduler achieving the highest classification accuracy (ACC) and the scheduler yielding the lowest EWS, allowing direct comparison of predictive performance and computational efficiency.

For the smaller batch size of 32, the results show noticeable variation across learning rate schedulers, indicating that convergence behavior and generalization are sensitive to optimization choices under small-batch training. In contrast, experiments with a batch size of 256 demonstrate more consistent performance across schedulers, with the cosine learning rate scheduler emerging as the top-performing configuration when jointly considering ACC and EWS.

4.8. Gradient Clipping and Learning-Rate Scaling for Batch Size 256

When training with a batch size of 256, pronounced instability was observed in the validation learning curves (Figure 6). To investigate whether this behavior was attributable to learning-rate configuration under large-batch training, a set of controlled experiments was conducted while holding the model, dataset, data split, optimizer, and random seed constant. Learning-rate scaling was first applied following the square-root heuristic, adjusting the initial and minimum learning rates from 0.001 and 0.00001 to 0.003 and 0.00003, respectively, to account for the eightfold increase from batch size 32 to batch size 256. Despite this adjustment, validation instability persisted. Consequently, global-norm gradient clipping was introduced with thresholds of 1.0, 0.75, and 0.5, and global gradient norms were logged throughout training to characterize optimization dynamics. As summarized in Table 9, the combination of learning-rate scaling with gradient clipping at a threshold of 0.75 substantially stabilized the validation curves while maintaining strong test-set performance, achieving an accuracy of 0.99799.

5. Discussion

We evaluated multiple ResNet models to investigate the relationship between network depth, predictive accuracy, and computational cost on the DeepFish dataset. Although deeper models provide increased representational capacity, classification performance on this binary task approached saturation. On the 80/10/10 split, ResNet-18 achieved the highest observed accuracy of 0.99975. Prior studies have reported that increasing network depth may initially improve accuracy; however, beyond a certain point, performance can plateau or even decline depending on dataset characteristics and training configuration [38].

In our experiments, deeper models exhibited comparable or slightly lower performance, indicating that increasing architectural depth did not improve predictive accuracy for this binary classification task. Wilson 95% confidence intervals further support this interpretation. Several models, including ResNet-50, ResNet-101, and ResNet-152, exhibit overlapping confidence intervals, suggesting functional equivalence in classification performance on this dataset. In contrast, there is no confidence interval overlap between ResNet-18 and ResNet-50: the lower bound of the ResNet-18 interval exceeds the upper bound of the ResNet-50 interval (Table 3), indicating that ResNet-18 outperforms ResNet-50 at the 95% confidence level for the DeepFish binary classification task.

The DeepFish classification dataset exhibits only mild class imbalance, with approximately 56.2% negative samples and 43.8% positive samples. The imbalance ratio (IR), defined as the ratio of the majority class size to the minority class size

(IR = \frac{N_{majority}}{N_{minority}})

, is approximately 1.28. Despite this slight skew, the evaluated models achieved consistently high performance across multiple complementary metrics. For example, ResNet-18 attained 99.975% accuracy, an AUC of 1.0, and an F1 score of 0.99971 (Table 2). The near-perfect F1 score reflects both high precision and high recall for the positive (fish) class, suggesting that the observed performance is not primarily driven by majority-class bias and remains robust under mild class imbalance.

In contrast to the near-saturated accuracy results, computational cost increased substantially with model depth. The EWS rose markedly from ResNet-18 to deeper models under identical training conditions, reflecting longer training times and greater resource utilization. While predictive performance remained statistically similar across models of different depths, the associated computational expenditure increased significantly. From an efficiency perspective, the shallower model therefore represents a more practical choice for this binary classification setting.

Among the evaluated configurations, pretrained variants converged faster than randomly initialized models but did not alter the overall performance plateau observed for this task. These findings indicate that architectural scaling or pretraining alone does not necessarily yield improved predictive outcomes when the underlying classification problem is already highly separable.

Optimization strategies also played an important role in stabilizing training and improving performance. Cosine learning-rate scheduling promoted smooth convergence, particularly at larger batch sizes, while gradient clipping mitigated instability caused by occasional large gradient updates. In addition, validation-based threshold optimization further enhanced classification outcomes, highlighting the importance of post-training calibration. Collectively, these results demonstrate that architectural design, optimization strategy, and computational efficiency should be considered jointly when developing robust underwater computer vision models.

Binary fish detection represents a focused yet practically important application of underwater computer vision; however, it constitutes a simplified task relative to more ecologically relevant multi-species or fine-grained recognition problems. The conclusions of this study are therefore limited to the binary DeepFish setting and may not directly generalize to more complex underwater environments or species-level classification tasks.

As future work, we plan to evaluate the proposed framework on more diverse underwater datasets and incorporate advanced computer vision models and data augmentation strategies to better assess generalization and robustness. Expanding toward multi-species classification will enhance ecological relevance and further support applications in biodiversity monitoring, fisheries management, and marine conservation.

6. Conclusions

This study evaluated multiple Residual Network (ResNet) models—ResNet-18, ResNet-50, ResNet-101, and ResNet-152—for binary underwater fish classification by jointly examining predictive performance and computational efficiency. Experiments were conducted using two dataset partitioning strategies (40/10/50 and 80/10/10 train/validation/test splits) and included both randomly initialized and ImageNet-pretrained models. To quantify training cost, we introduced the Energy-Weighted Score (EWS), which integrates energy consumption and memory usage during training and serves as a practical tool to support model selection.

Across all experiments, performance on the DeepFish binary classification task approached saturation. A non-pretrained ResNet-18 achieved 99.975% accuracy on the 80/10/10 split, which increased to 100% following validation-based threshold optimization. Deeper models did not yield statistically meaningful improvements in predictive performance, while computational cost increased substantially with depth. These findings suggest that, for well-separable binary classification tasks, shallower ResNet models can achieve near-perfect accuracy while offering significant efficiency advantages. In such settings, computational efficiency becomes a critical criterion for model selection.

These conclusions are specific to the ResNet family and the binary DeepFish classification task. In future work, we will investigate more complex datasets and modern architectures to further examine how model design and learning strategies influence performance and computational efficiency in challenging environments.

Author Contributions

Conceptualization, C.W., Z.Z. and C.Z.; Methodology, C.W., M.A., L.H., S.H. and C.Z.; Software, C.W. and M.A.; Validation, C.W., M.A. and L.H.; Formal analysis, C.W., M.A. and C.Z.; Investigation, C.W., M.A., L.H., S.H. and C.Z.; Resources, C.W.; Data curation, C.W.; Writing—original draft, C.W. and M.A.; Writing—review & editing, C.W., M.A., L.H., S.H., Z.Z. and C.Z.; Visualization, C.W. and M.A.; Supervision, C.Z.; Project administration, Z.Z. and C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in (1) DeepFish at http://data.qld.edu.au/public/Q5842/2020-AlzayatSaleh-00e364223a600e83bd9c3f5bcd91045-DeepFish/ (accessed on 10 January 2026) and https://github.com/alzayats/DeepFish (accessed on 10 January 2026), reference number [35,36]; (2) “A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis” at https://doi.org/10.1038/s41598-020-71639-x, reference number [33].

Acknowledgments

The authors would like to thank Poojitha Madari for her valuable contributions and involvement in this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Corinaldesi, C. New perspectives in benthic deep-sea microbial ecology. Front. Mar. Sci. 2015, 2, 17. [Google Scholar] [CrossRef]
Hammerl, C.; Möllmann, C.; Oesterwind, D. Identifying fit-for purpose methods for monitoring fish communities. Front. Mar. Sci. 2024, 10, 1322367. [Google Scholar] [CrossRef]
Knausgård, K.M.; Wiklund, A.; Sørdalen, T.K.; Halvorsen, K.T.; Kleiven, A.R.; Jiao, L.; Goodwin, M. Temperate fish detection and classification: A deep learning based approach. Appl. Intell. 2022, 52, 6988–7001. [Google Scholar] [CrossRef]
Li, D.; Wang, Q.; Li, X.; Niu, M.; Wang, H.; Liu, C. Recent advances of machine vision technology in fish classification. ICES J. Mar. Sci. 2022, 79, 263–284. [Google Scholar] [CrossRef]
Qu, H.; Wang, G.G.; Li, Y.; Qi, X.; Zhang, M. ConvFishNet: An efficient backbone for fish classification from composited underwater images. Inf. Sci. 2024, 679, 121078. [Google Scholar] [CrossRef]
Tacon, A.G.; Shumway, S.E. Critical need to increase aquatic food production and food supply from aquaculture and capture fisheries: Trends and outlook. Rev. Fish. Sci. Aquac. 2024, 32, 389–395. [Google Scholar] [CrossRef]
Deka, J.; Laskar, S.; Baklial, B. Automated freshwater fish species classification using deep CNN. J. Inst. Eng. (India) Ser. B 2023, 104, 603–621. [Google Scholar] [CrossRef]
Hossain, E.; Alam, S.S.; Ali, A.A.; Amin, M.A. Fish activity tracking and species identification in underwater video. In Proceedings of the 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV), Dhaka, Bangladesh, 13–14 May 2016; pp. 62–66. [Google Scholar]
Islam, M.A.; Howlader, M.R.; Habiba, U.; Faisal, R.H.; Rahman, M.M. Indigenous fish classification of Bangladesh using hybrid features with SVM classifier. In Proceedings of the 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), Rajshahi, Bangladesh, 11–12 July 2019. [Google Scholar]
Mittal, S.; Srivastava, S.; Jayanth, J.P. A survey of deep learning techniques for underwater image classification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 6968–6982. [Google Scholar] [CrossRef]
Naveen, P. Advancements in underwater imaging through machine learning: Techniques, challenges, and applications. Multimed. Tools Appl. 2025, 84, 24839–24858. [Google Scholar]
Elmezain, M.; Saoud, L.S.; Sultan, A.; Heshmat, M.; Seneviratne, L.; Hussain, I. Advancing underwater vision: A survey of deep learning models for underwater object recognition and tracking. IEEE Access 2025, 13, 17830–17867. [Google Scholar] [CrossRef]
Alawode, B.; Dharejo, F.A.; Ummar, M.; Guo, Y.; Mahmood, A.; Werghi, N.; Khan, F.S.; Matas, J.; Javed, S. Improving underwater visual tracking with a large scale dataset and image enhancement. arXiv 2023, arXiv:2308.15816. [Google Scholar] [CrossRef]
Jahanbakht, M.; Azghadi, M.R.; Waltham, N.J. Semi-supervised and weakly-supervised deep neural networks and dataset for fish detection in turbid underwater videos. Ecol. Inform. 2023, 78, 102303. [Google Scholar] [CrossRef]
Liawatimena, S.; Heryadi, Y.; Trisetyarso, A.; Wibowo, A.; Abbas, B.S.; Barlian, E. A fish classification on images using transfer learning and matlab. In Proceedings of the 2018 Indonesian Association for Pattern Recognition International Conference (INAPR), Jakarta, Indonesia, 7–8 September 2018; pp. 108–112. [Google Scholar]
Fisher, R.; Boom, B.; Huang, P. Preliminary experiments with the fish4knowledge dataset. Algae 2012, 49165, 58–99. [Google Scholar]
Kuswantori, A.; Suesut, T.; Tangsrirat, W.; Nunak, N. Development of object detection and classification with YOLOv4 for similar and structural deformed fish. Eureka Phys. Eng. 2022, 154–165. [Google Scholar] [CrossRef]
Ma, D.; Wei, J.; Zhu, L.; Zhao, F.; Wu, H.; Chen, X.; Li, Y.; Liu, M. Semi-supervised learning advances species recognition for aquatic biodiversity monitoring. Front. Mar. Sci. 2024, 11, 1373755. [Google Scholar] [CrossRef]
Castignolles, N.; Cattoen, M.; Larinier, M. Identification and counting of live fish by image analysis. In Image and Video Processing II: IS&T/SPIE 1994 International Symposium on Electronic Imaging: Science and Technology, San Jose, CA, USA, 6–10 February 1994; SPIE: Bellingham, WA, USA, 1994; Volume 2182, pp. 200–209. [Google Scholar]
Lee, D.; Redd, S.; Schoenberger, R.; Xu, X.; Zhan, P. An automated fish species classification and migration monitoring system. In Proceedings of the IECON’03. 29th Annual Conference of the IEEE Industrial Electronics Society, Roanoke, VA, USA, 2–6 November 2003; IEEE: New York, NY, USA, 2003; Volume 2, pp. 1080–1085. [Google Scholar]
Lee, D.J.; Archibald, J.K.; Schoenberger, R.B.; Dennis, A.W.; Shiozawa, D.K. Contour matching for fish species recognition and migration monitoring. In Applications of Computational Intelligence in Biology: Current Trends and Open Problems; Springer: Berlin/Heidelberg, Germany, 2008; pp. 183–207. [Google Scholar]
Islam, S.M.; Bani, S.I.; Ghosh, R. Content-based fish classification using combination of machine learning methods. Int. J. Inf. Technol. Comput. Sci. (IJITCS) 2021, 13, 62–68. [Google Scholar]
Khan, M.A.; Jamil, S.; Vellanki, S.; Latha, M.; Singh, K.; Batra, M. Creation of Machine Learning-Based Fish Classification Systems Based on Morphometric and Mathematical Transform Data. J. Electr. Syst. 2024, 20, 2325–2332. [Google Scholar] [CrossRef]
Indhumathi, N. Environmental Management through Machine Learning-based Fish Species Classification for Sustainable Fisheries. J. Environ. Nanotechnol. 2024, 13, 232–240. [Google Scholar] [CrossRef]
Salman, A.; Jalal, A.; Shafait, F.; Mian, A.; Shortis, M.; Seager, J.; Harvey, E. Fish species classification in unconstrained underwater environments based on deep learning. Limnol. Oceanogr. Methods 2016, 14, 570–585. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’12), Lake Tahoe, NV, USA, 3–6 December 2012; Curran Associates Inc.: Red Hook, NY, USA, 2012; Volume 1. [Google Scholar]
Iqbal, M.A.; Wang, Z.; Ali, Z.A.; Riaz, S. Automatic fish species classification using deep convolutional neural networks. Wirel. Pers. Commun. 2021, 116, 1043–1053. [Google Scholar] [CrossRef]
Tamou, A.B.; Benzinou, A.; Nasreddine, K.; Ballihi, L. Underwater live fish recognition by deep learning. In Image and Signal Processing: 8th International Conference, ICISP 2018, Cherbourg, France, 2–4 July 2018, Proceedings; Springer: Cham, Switzerland, 2018; pp. 275–283. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhang, Z.; Du, X.; Jin, L.; Wang, S.; Wang, L.; Liu, X. Large-scale underwater fish recognition via deep adversarial learning. Knowl. Inf. Syst. 2022, 64, 353–379. [Google Scholar] [CrossRef]
Pang, J.; Liu, W.; Liu, B.; Tao, D.; Zhang, K.; Lu, X. Interference distillation for underwater fish recognition. In Pattern Recognition: 6th Asian Conference, ACPR 2021, Jeju Island, South Korea, 9–12 November 2021, Revised Selected Papers, Part I; Springer: Cham, Switzerland, 2021; pp. 62–74. [Google Scholar]
Bradley, M.; Baker, R.; Nagelkerken, I.; Sheaves, M. Context is more important than habitat type in determining use by juvenile fish. Landsc. Ecol. 2019, 34, 427–442. [Google Scholar] [CrossRef]
Saleh, A.; Laradji, I.H.; Konovalov, D.A.; Bradley, M.; Vazquez, D.; Sheaves, M. A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis. Sci. Rep. 2020, 10, 14671. [Google Scholar] [CrossRef] [PubMed]
Saleh, A.; Sheaves, M.; Rahimi Azghadi, M. Computer vision and deep learning for fish classification in underwater habitats: A survey. Fish Fish. 2022, 23, 977–999. [Google Scholar] [CrossRef]
Alzayats, A. DeepFish: JCU DeepFish Dataset and Code Repository. GitHub Repository. 2020. Available online: https://github.com/alzayats/DeepFish (accessed on 12 November 2025).
Alzayat, A.; Saleh, A. DeepFish: JCU Underwater Fish Dataset and Code. 2020. Available online: http://data.qld.edu.au/public/Q5842/2020-AlzayatSaleh-00e364223a600e83bd9c3f5bcd91045-DeepFish/ (accessed on 12 November 2025).
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Chen, F.; Tsou, J.Y. Assessing the effects of convolutional neural network architectural factors on model performance for remote sensing image classification: An in-depth investigation. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102865. [Google Scholar]

Figure 1. Fine-tuned architecture for ResNet-18, ResNet-50, ResNet-110, ResNet-152 models.

Figure 2. A color-coded illustration of the transfer learning process. Gray components represent the initial state using layers initialized with pre-trained weights, while blue components indicate layers whose weights are updated during fine-tuning on our target dataset.

Figure 3. Most observed raw gradient norms were below 0.03, with rare excursions ranging from 45 to 513.

Figure 4. Original Image to a letter-boxed image to a scaled 224 × 224 image features letter-boxing image augmentation.

Figure 5. Randomly initialized ResNet50 versus a Pretrained ResNet 50 model.

Figure 6. Learning curves for the larger batch size 256 illustrating stabilization steps. Experiment 101 (top) shows a pronounced validation instability, with large spikes in validation loss and corresponding drops in validation accuracy and AUC. Experiment 102 (middle) applies a 3-times-learning-rate scaling (initial and minimum), reducing the frequency of severe validation excursions while maintaining convergence. Experiment 104 (bottom) adds gradient clipping where clip norm = 0.75, further suppressing validation spikes and yielding the most stable validation trajectory.

Table 1. Current Infrastructure Pricing (Source: Google Colab Pricing).

Category	Description	Cost (USD/hr)	Weighted Value
CPU	12 vCPU total	$0.455 per hour	0.455
RAM	Per GiB	$0.005 per GiB-hour	0.005
GPU	A100 (80 GB)	$4.714 per hour (each)	4.714

Pricing retrieved from Google Colab (https://cloud.google.com/colab/pricing, accessed on 10 January 2026).

Table 2. Accuracy-focused results at

224 \times 224 \times 3

resolution (batch size 32, Adam optimizer, patience 20). PT denotes ImageNet-pretrained initialization. Metrics shown to 6 decimal places.

Table 2. Accuracy-focused results at

224 \times 224 \times 3

resolution (batch size 32, Adam optimizer, patience 20). PT denotes ImageNet-pretrained initialization. Metrics shown to 6 decimal places.

Split	Model–Init	Ep	TN	FP	FN	TP	Acc	Prec	Rec	F1	AUC
Saleh et al. [33]	R50	–	–	–	–	–	0.650000	–	–	–	–
Saleh et al. [33]	R50-PT	–	–	–	–	–	0.990000	–	–	–	–
40/10/50	R18	55	11,110	54	85	8635	0.993009	0.993785	0.990252	0.992016	0.999707
40/10/50	R50	74	11,125	39	66	8654	0.994719	0.995514	0.992431	0.993970	0.999751
40/10/50	R50-PT	71	11,146	18	32	8688	0.997485	0.997932	0.996330	0.997131	0.999948
40/10/50	R101	74	11,060	104	190	8530	0.985214	0.987955	0.978211	0.983059	0.998934
40/10/50	R101-PT	80	11,152	12	36	8684	0.997586	0.998620	0.995872	0.997244	0.999887
40/10/50	R152	70	11,124	40	135	8585	0.991199	0.995362	0.984518	0.989911	0.999628
40/10/50	R152-PT	56	11,141	23	96	8624	0.994015	0.997340	0.988991	0.993148	0.999619
80/10/10	R18	80	2236	0	1	1740	0.999749	1.000000	0.999426	0.999713	1.000000
80/10/10	R50	67	2235	1	12	1729	0.996731	0.999422	0.993107	0.996255	0.999990
80/10/10	R50-PT	41	2235	1	3	1738	0.998994	0.999425	0.998277	0.998851	0.999957
80/10/10	R101	72	2234	2	1	1740	0.999246	0.998852	0.999426	0.999139	0.999999
80/10/10	R101-PT	59	2234	2	1	1740	0.999246	0.998852	0.999426	0.999139	0.999997
80/10/10	R152	84	2235	1	3	1738	0.998994	0.999425	0.998277	0.998851	0.999998
80/10/10	R152-PT	84	2235	1	3	1738	0.998994	0.999425	0.998277	0.998851	0.999999

Table 3. Test Accuracy with Wilson

95 %

Confidence Intervals (80/10/10 split).

Table 3. Test Accuracy with Wilson

95 %

Confidence Intervals (80/10/10 split).

Model	Accuracy [95% CI]
R18	$0.9997 [0.9986, 1.0000]$
R50	$0.9967 [0.9944, 0.9981]$
R50-PT	$0.9990 [0.9974, 0.9996]$
R101	$0.9992 [0.9978, 0.9997]$
R101-PT	$0.9992 [0.9978, 0.9997]$
R152	$0.9990 [0.9974, 0.9996]$
R152-PT	$0.9990 [0.9974, 0.9996]$

Table 4. Effect of Data Augmentation on ResNet-50 Performance (80/10/10 split).

Split	Model	Ep	FP	FN	Acc	Prec	Rec	F1	AUC	AP	ΔAcc
ORG	R50	67	1	12	0.99673	0.99942	0.99311	0.99625	0.99999	0.99999	+0.00226
AUG	R50	80	1	3	0.99899	0.99942	0.99828	0.99885	0.99999	0.99999	+0.00226
ORG	R50-PT	41	1	3	0.99899	0.99942	0.99828	0.99885	0.99996	0.99994	+0.00026
AUG	R50-PT	70	1	2	0.99925	0.99943	0.99885	0.99914	0.99999	0.99999	+0.00026

ΔAcc =

{Acc}_{AUG}

−

{Acc}_{ORG}

.

Table 5. Energy-Weighted Score (EWS; lower is better) with corresponding training duration and energy consumption.

Model–Init	Ep	Acc	Time (s)	GPU Wh	CPU Wh	RAM Wh	Tot Wh	EWS
R18	80	0.99975	1614	70.27	20.92	12.54	103.73	340.83
R50	67	0.99673	3745	223.56	30.86	31.66	286.07	1067.98
R50-PT	41	0.99899	2362	137.57	19.67	20.20	177.45	657.54
R101	72	0.99925	6550	393.10	49.43	57.27	499.80	1875.74
R101-PT	59	0.99925	5416	323.15	40.73	48.30	412.18	1542.03
R152	84	0.99899	10,683	645.84	77.91	98.05	821.79	3080.23
R152-PT	84	0.99899	11,086	651.42	82.10	104.48	838.00	3108.51

All experiments were conducted on the 80/10/10 split at

224 \times 224 \times 3

resolution with batch size 32, Adam optimizer, and early stopping (patience = 20). PT denotes ImageNet-pretrained initialization.

Table 6. Energy-weighted score (EWS) sorted in ascending order (lower is better).

Energy-Weighted Score (80/10/10 Split)
Model–Init	Acc	EWS	ΔAcc	Winner
R18	0.99975	340.83	0.00000	Yes
R50-PT	0.99899	657.54	−0.00075
R50	0.99673	1067.98	−0.00302
R101-PT	0.99925	1542.03	−0.00050
R101	0.99925	1875.74	−0.00050
R152	0.99899	3080.23	−0.00075
R152-PT	0.99899	3108.51	−0.00075

Note: ΔAcc represents the accuracy difference relative to the most accurate model (R18, non-pretrained). All results correspond to the 80/10/10 split.

Table 7. Threshold comparison between the default threshold of

0.5

and an

F_{1}

-optimized threshold (

t < 0.5

) on the

80 / 10 / 10

split.

Table 7. Threshold comparison between the default threshold of

0.5

and an

F_{1}

-optimized threshold (

t < 0.5

) on the

80 / 10 / 10

split.

	Original (Test)							Updated (Test)
Model	$t_{orig}$	FP	FN	Acc	Prec	Rec	F1	$t$	FP	FN	Acc	Prec	Rec	F1
R18	0.50	0	1	0.99975	1.00000	0.99943	0.99971	0.15	0	0	1.00000	1.00000	1.00000	1.00000
R50	0.50	1	12	0.99673	0.99942	0.99311	0.99625	0.11	3	3	0.99849	0.99828	0.99828	0.99828
R50-PT	0.50	1	3	0.99899	0.99942	0.99828	0.99885	0.45	1	3	0.99899	0.99942	0.99828	0.99885
R101	0.50	2	1	0.99925	0.99885	0.99943	0.99914	0.32	2	1	0.99925	0.99885	0.99943	0.99914
R101-PT	0.50	2	1	0.99925	0.99885	0.99943	0.99914	0.16	4	1	0.99874	0.99771	0.99943	0.99857
R152	0.50	1	3	0.99899	0.99942	0.99828	0.99885	0.19	2	2	0.99899	0.99885	0.99885	0.99885
R152-PT	0.50	1	3	0.99899	0.99942	0.99828	0.99885	0.45	1	2	0.99925	0.99943	0.99885	0.99914

Original columns report test-set metrics using the fixed default threshold of 0.5. Updated columns report test-set metrics using a tuned threshold selected by sweeping values less than 0.5 on the validation set and choosing the value that maximized validation

F_{1}

(F1-on-VAL). PT = pretrained; t = decision threshold. Updated metrics are computed on the held-out test set.

Δ F_{1} = F_{1, updated} - F_{1, original}

.

Table 8. Batch size comparison (32 and 256) across learning rate schedulers for ResNet-50 (pretrained).

Batch	LR Scheduler	${epochs}_{run}$	$FP$	$FN$	$Acc$	$F_{1}$	$EWS$
32	NONE	79	1	2	0.99925	0.99914	698.71
32	COSINE	46	0	3	0.99925	0.99914	729.32
32	COSINE_RESTARTS	36	0	4	0.99899	0.99885	571.30
32	PLATEAU	57	0	4	0.99899	0.99885	897.01
32	EXPONENTIAL	66	0	1	0.99975	0.99971	1044.89
32	PIECEWISE	78	1	5	0.99849	0.99827	1235.28
32	POLYNOMIAL	44	0	3	0.99925	0.99914	701.12
32	LINEAR_WARMUP	56	2	1	0.99925	0.99914	501.30
32	LINEAR_WARMUP_COSINE	67	0	3	0.99925	0.99914	1053.35
256	NONE	51	0	3	0.99925	0.99828	615.23
256	COSINE	38	0	1	0.99975	0.99943	456.62
256	COSINE_RESTARTS	39	0	3	0.99925	0.99828	467.92
256	PLATEAU	36	0	3	0.99925	0.99828	433.82
256	EXPONENTIAL	42	0	2	0.99950	0.99885	504.17
256	PIECEWISE	36	0	4	0.99899	0.99770	433.05
256	POLYNOMIAL	49	1	2	0.99925	0.99885	586.38
256	LINEAR_WARMUP	49	1	1	0.99950	0.99943	585.80
256	LINEAR_WARMUP_COSINE	45	0	1	0.99975	0.99943	541.23

Note: Bold values indicate the best result in each column; for EWS, the lowest value is considered best.

Table 9. Optimization stability experiments for batch size 256 using ResNet-50 with learning-rate scaling and gradient clipping.

Exp	LR_init	LR_min	Clipnorm	Epochs	FP	FN	Acc	Prec	Rec	F1	AUC
101	0.001	0.00001	–	50	1	3	0.99899	0.99942	0.99828	0.99885	0.99999
102	0.003	0.00003	–	44	41	40	0.97963	0.97646	0.97702	0.97674	0.99758
103	0.003	0.00003	1.00	90	2	1	0.99925	0.99885	0.99943	0.99914	0.99998
104	0.003	0.00003	0.75	63	3	5	0.99799	0.99827	0.99713	0.99770	0.99998
105	0.003	0.00003	0.50	46	2	11	0.99673	0.99885	0.99368	0.99626	0.99996

Note: Bold values indicate the best result in each column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Whitney, C.; Ahmed, M.; Huang, L.; He, S.; Zhou, Z.; Zhang, C. A Comparative Analysis of ResNet Models on Fish Classification. Mathematics 2026, 14, 1055. https://doi.org/10.3390/math14061055

AMA Style

Whitney C, Ahmed M, Huang L, He S, Zhou Z, Zhang C. A Comparative Analysis of ResNet Models on Fish Classification. Mathematics. 2026; 14(6):1055. https://doi.org/10.3390/math14061055

Chicago/Turabian Style

Whitney, Chad, Mahid Ahmed, Lei Huang, Shuiling He, Zhaoxian Zhou, and Chaoyang Zhang. 2026. "A Comparative Analysis of ResNet Models on Fish Classification" Mathematics 14, no. 6: 1055. https://doi.org/10.3390/math14061055

APA Style

Whitney, C., Ahmed, M., Huang, L., He, S., Zhou, Z., & Zhang, C. (2026). A Comparative Analysis of ResNet Models on Fish Classification. Mathematics, 14(6), 1055. https://doi.org/10.3390/math14061055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Analysis of ResNet Models on Fish Classification

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Dataset

3.2. Models

3.3. Transfer Learning

3.4. Fine-Tuning Parameters

3.5. Gradient Clipping

3.6. Threshold Selection

3.7. Evaluation Metrics

Classification Metrics

3.8. Wilson Score Confidence Interval for Accuracy

3.9. Energy Weight Score (EWS)

3.10. Experiments

4. Results

4.1. Split and Pretraining Experiments

4.2. Confidence Intervals

4.3. Not Pretrained Versus Pretrained Models

4.4. Data Augmentation

4.5. Energy-Weighted Scores (EWS)

4.6. Threshold Optimization

4.7. Learning Rate Scheduler and Batch Size Experiments

4.8. Gradient Clipping and Learning-Rate Scaling for Batch Size 256

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI