Non-Intrusive Fish Weight Estimation in Turbid Water Using Deep Learning and Regression Models

Underwater fish monitoring is the one of the most challenging problems for efficiently feeding and harvesting fish, while still being environmentally friendly. The proposed 2D computer vision method is aimed at non-intrusively estimating the weight of Tilapia fish in turbid water environments. Additionally, the proposed method avoids the issue of using high-cost stereo cameras and instead uses only a low-cost video camera to observe the underwater life through a single channel recording. An in-house curated Tilapia-image dataset and Tilapia-file dataset with various ages of Tilapia are used. The proposed method consists of a Tilapia detection step and Tilapia weight-estimation step. A Mask Recurrent-Convolutional Neural Network model is first trained for detecting and extracting the image dimensions (i.e., in terms of image pixels) of the fish. Secondly, is the Tilapia weight-estimation step, wherein the proposed method estimates the depth of the fish in the tanks and then converts the Tilapia’s extracted image dimensions from pixels to centimeters. Subsequently, the Tilapia’s weight is estimated by a trained model based on regression learning. Linear regression, random forest regression, and support vector regression have been developed to determine the best models for weight estimation. The achieved experimental results have demonstrated that the proposed method yields a Mean Absolute Error of 42.54 g, R2 of 0.70, and an average weight error of 30.30 (±23.09) grams in a turbid water environment, respectively, which show the practicality of the proposed framework.


Introduction
Thailand's economy has been generally based on agriculture production, with the sector employing around one-third of the country's labour force. Aquaculture production in Thailand has continuously increased since 1995 [1]. Fish are a healthy food that are an excellent source of protein, minerals, and essential nutrients. This leads to enormous demand that exceeds the production capacity. Therefore, the development of fish farming with modern technology will improve fish monitoring operations to efficiently feed and harvest fish, while also being environmentally friendly. In addition, non-contact measurements of fish body size and weight will reduce stress and injury to the fish. This research is the first step of modern fish farming in Thailand to measure fish weight by non-intrusive methods. Modern aquaculture has rapidly developed in recent years. Extensive expansion of traditional aquaculture has resulted in it being transformed into modern 5G aquaculture by automatic and precise task-based machines. The machines perform classification, prediction, and estimation, and have many benefits, including reducing operational time.
LS-SVM were employed to estimate the volume and weight of apples. A 3D image can be directly obtained via special cameras, for example, binocular stereo cameras, a laserbased camera, or an RGB-depth camera that generates the spatial information of the X-Y dimension with the height information of Z represented.
Machine learning approaches for weight prediction can be categorized into two groups, where the first group is based on the regression approach and the second one relies on a deep learning approach. The regression approach takes advantage of the simplicity and fast performance but requires the feature extraction process. The selected features are vital and significantly affect the prediction performance. Generally, the regression approach needs more than five features. Therefore, feature acquisition is still a challenging issue, and is time-and cost-consuming. On the other hand, deep learning approaches deliver a compact algorithm given by input images and then return a result. However, deep learning approaches usually require special cameras with high computational complexity for weight estimation. A captured image by an underwater camera is influenced by complex non-linear factors due to luminosity change, turbidity, various backgrounds, and moving aquatic animals. Underwater monitoring is one of the most challenging problems due to uncertain environments caused by changes in illumination and shadow, turbidity, underwater-aquatic confusion, camera limitations, and moving aquatics. These result in the low quality of image capture. Therefore, the practicality of a fish weight-prediction method for turbid water that can be used in real fish-farming applications is still an open problem.
The present paper proposes a novel low-cost practical single sensor imaging system with deep and regression learning algorithms for the non-intrusive estimation of Tilapia weight in turbid water environments. The proposed method brings new contributions. Firstly, only a low-cost single camera is required for observing the underwater fish (no other special equipment or sensor is used for monitoring fish). Thus, the fish are not injured during the weighing process, which is beneficial to the health of the fish. Secondly, the proposed method can determine the fish's weight in the turbid underwater environment. For turbid water, the proposed method can process the video frames with or without an image-enhancement process. This flexibility favors practicality in real fish-farming applications. Only as little as three attributes are required for predicting the fish's weight: (i) fish's age, (ii) the length and width of the fish, and (iii) the depth between the fish and the camera. These attributes are automatically computed by the proposed algorithm in one-go. Finally, the proposed method is computationally simple and comprises two major steps, i.e., Tilapia detection-based deep transfer learning and Tilapia weight estimation-based regression learning. This augments the proposed method with low computational time and thus results in faster execution. The proposed machine learning models are amenable to interpretability by the users. For example, once the fish is detected, the estimated length and height of the fish, as well as the depth information from the camera, are made known to the user. By manually inputting the age of the fish by the user, the user will be able to determine the weight of the intended fish. This paper is organized as follows: Section 2 presents the machine vision algorithm to estimate the weight of Tilapia in an underwater environment. Next, Section 3 evaluates and elucidates the performance of the proposed Tilapia weight estimating algorithm. Finally, Section 4 summarizes the proposed estimation method and future research prospects.

Methodology
The proposed method combines two steps: a Tilapia detection step and Tilapia weightestimation step. The proposed method was started by training the models and then using the trained models in an evaluation phase. The training phase performs data preparation for Tilapia Detection Training and then generates a Tilapia detection (TDet) model that is based on deep transfer learning. In the Tilapia weight-estimation step, three models are trained by using regression learning. These models are are Tilapia depth estimation (TDepE), Tilapia pixel-to-centimeter estimation (TP2CME), and tilapia weight estimation (TWE). Therefore, the proposed algorithm is named Tilapia weight estimation-i.e., the deep regression learning "TWE-DRL" algorithm. The algorithm of the proposed method is illustrated in Figure 1.
weight-estimation step. The proposed method was started by training the models and then using the trained models in an evaluation phase. The training phase performs data preparation for Tilapia Detection Training and then generates a Tilapia detection (TDet) model that is based on deep transfer learning. In the Tilapia weight-estimation step, three models are trained by using regression learning. These models are are Tilapia depth estimation (TDepE), Tilapia pixel-to-centimeter estimation (TP2CME), and tilapia weight estimation (TWE). Therefore, the proposed algorithm is named Tilapia weight estimationi.e., the deep regression learning "TWE-DRL" algorithm. The algorithm of the proposed method is illustrated in Figure 1. The input parameters of the TDepE model consist of the age of the fish and the fish's length and width in pixel units. In the process of data acquisition, the ages of the fish were recorded along with the fish-image capture every two weeks during the feeding process. The actual length and width of the fish were obtained by manually extracting this information from the image-annotated labels of the fish. Therefore, the training dataset of the TDepE model contains the actual values of the fish's age, length, and width. In practice, the age of the fish will be obtained from a fish farmer with prior knowledge. The input parameters of the TP2CME model use the same parameter set as TDepE and add the distance between the fish and the camera with regards for the depth parameter. The depth dataset contains three independent attributes, which are the age, the fish's length and width in pixel units, and the depth. Firstly, depth information acquisition was manually determined by humans. There are stripes on the ground and indicated sides from the front of the camera to the end of the tank. Each strip is 10 cm apart from one another. Strips are used as a reference distance from the camera. Hence, the fish's distances were estimated The input parameters of the TDepE model consist of the age of the fish and the fish's length and width in pixel units. In the process of data acquisition, the ages of the fish were recorded along with the fish-image capture every two weeks during the feeding process. The actual length and width of the fish were obtained by manually extracting this information from the image-annotated labels of the fish. Therefore, the training dataset of the TDepE model contains the actual values of the fish's age, length, and width. In practice, the age of the fish will be obtained from a fish farmer with prior knowledge. The input parameters of the TP2CME model use the same parameter set as TDepE and add the distance between the fish and the camera with regards for the depth parameter. The depth dataset contains three independent attributes, which are the age, the fish's length and width in pixel units, and the depth. Firstly, depth information acquisition was manually determined by humans. There are stripes on the ground and indicated sides from the front of the camera to the end of the tank. Each strip is 10 cm apart from one another. Strips are used as a reference distance from the camera. Hence, the fish's distances were estimated in response to the nearest band where the fish was located. The depth of the fish affects the size of the fish, i.e., when a fish is close to a camera then the depth is small, and the length and the width of the fish are larger when it is further away. The input parameters of the TWE model follow the same steps as the TP2CME model where the output of the TP2CME model is an independent parameter of the TWE model plus all of the independent parameters from the TP2CME dataset. For the TWE training dataset, the actual length and width of fish was provided from Studio photography. The details of each individual step are elucidated in the following sections.
The proposed TWE-DRL algorithm has two major processes, which are to detect and extract the size of an individual Tilapia in an image and to estimate the depth of the fish from the camcorder, then convert the size of the Tilapia from pixels to centimeters given the Sensors 2022, 22, 5161 6 of 29 estimated depth. Finally, the weight of the Tilapia is predicted from the fish's size with the inclusion of the fish's age in weeks. In order to achieve these goals, four-training models are required and named TDet, TDepE, TP2CME, and TWE. The details of each individual step are elucidated in the following sections.

Tilapia Detection
Tilapia-detection-based deep transfer learning is used to create a model for detecting Tilapia in digital images. Tilapia detection is established through deep learning networks as their backbone and the detection network is used to extract features from the input images and localization, respectively. An object detection approach can be categorized into two types, i.e., one-stage detectors and two-state detectors. One-stage detectors use a single network to predict object bounding boxes from images directly then classify the probability scores from the images-for example, YOLO, SSD, and RetinaNet.
Two-stage detectors mark regions of the target instead of learning from the whole image. Next, the proposal regions will be passed into a classifier and regressor, respectively. Region Proposal Networks (RPNs) are used for searching possible target regions as the first stage. The second stage extracts significant features by using a region-of-interest (RoI) pooling operation from individual candidate regions for the following classification and boundingbox regression. Examples of two-stage detectors are Faster R-CNN and Mask R-CNN.
RetinaNet is a one-stage object detector with focal loss as a classification. RetinaNet utilizes ResNet as its backbone. RetinaNet inherits the fast speed of previous one-stage detectors by avoiding the use of RPNs. Faster R-CNN extracts features from region proposals and then passes the region-of-interest (RoI) pooling layer to get the various size features as the input of the following classification and bounding-box regression fully-connected layers. Mask R-CNN [16] is an extending work to Faster R-CNN by using RoIAlign to extract a small feature map from each RoI and adding a parallel mask branch. The feature pyramid network (FPN) is the backbone that extracts RoI features from different levels of the feature pyramid according to extract features that achieve excellent accuracy and processing speeds. Given that higher-resolution feature maps are important for detecting small objects while lower-resolution feature maps are rich in semantic information, a feature pyramid network extracts significant features.
Deep transfer learning comprises two steps: Firstly, the pre-training step and secondly, the post-training step. The pre-training step loads the learned weights from the pre-trained model as initial values for the deep learning network. For the post-training step, the deep learning network will learn and fine-tune the weight given by the Tilapia-image dataset. Deep transfer learning has the advantage of reducing learning time and increasing the accuracy of the model. The COCO-pre-trained Mask region R-CNN model was employed to determine the initial value of the deep learning architecture. Mask R-CNN is an object detection algorithm that performs target detection, target classification, and instance segmentation simultaneously in a neural network. Mask R-CNN returns two outputs that are a class and a bounding-box offset, as illustrated in Figure 2, where FC depicts fully-connected layers. A m × m mask representation encodes the spatial structure from an input image by the pixel-to-pixel method that corresponds to the convolutions. The m × m mask is generated from a region of interest (RoI) by using a fully convolutional network (FCN) with a per-pixel sigmoid and a binary loss to semantic segmentation. This naturally leads Mask R-CNN to maintain the 2-dimentinal spatial layout rather than transform it into a vector representation. structure from an input image by the pixel-to-pixel method that corresponds to the convolutions. The × mask is generated from a region of interest (RoI) by using a fully convolutional network (FCN) with a per-pixel sigmoid and a binary loss to semantic segmentation. This naturally leads Mask R-CNN to maintain the 2-dimentinal spatial layout rather than transform it into a vector representation.
where (•) is a residual function and ∑ (•) is a residual function. The = , | term is a set of weights (and biases) associated with the lth Residual Unit. A 3 × 3 convolution layer has been set for RPN. Secondly, RoIAlign performs per-pixel preservation of spatial features extraction by using a fully convolutional network and RoIPool for the feature map. Mask R-CNN applies a multi-loss function during the learning to evaluate the model and ensure its fitting to unseen data. This loss function is computed as a weighted total sum of various losses during the training at every phase of the model on each proposal RoI, which is shown by Equation (2). This weighted loss is defined as [25]: where , , and represent the classification loss, bounding-box loss, and the average binary cross-entropy loss, respectively. The shows the convergence of the predictions to the true class.
combines the classification loss during the training of RPN and Mask R-CNN heads.
shows how well the model localizes objects and it combines the bounding-box localization loss during the training of RPN and Mask R-CNN heads. The and losses are computed by Equations (3) and (4): where ( , ) is the predicted probability of ground truth class u for each positive bounding box.
where ( ) = 0.5 | | < 1 | | − 0.5 ℎ and ( − ) are the predicted bounding-box for class u and ground truth bounding-box for each input . The has × × dimensional output for each RoI where represents a number of a class and × is a matrix representation of the class. A per-pixel sigmoid is applied and the is computed using the average binary cross-entropy loss that the Mask R-CNN consists of two components. Firstly, the backbone network of the proposed method is based on ResNet. ResNet consists of many stacks of residual units. Each unit can be expressed as in Equation (1), where x l and x L indicate an input feature to the lth Residual Unit and an output of any deeper unit L [24]: where F(·) is a residual function and term is a set of weights (and biases) associated with the lth Residual Unit. A 3 × 3 convolution layer has been set for RPN. Secondly, RoIAlign performs per-pixel preservation of spatial features extraction by using a fully convolutional network and RoIPool for the feature map. Mask R-CNN applies a multi-loss function during the learning to evaluate the model and ensure its fitting to unseen data. This loss function is computed as a weighted total sum of various losses during the training at every phase of the model on each proposal RoI, which is shown by Equation (2). This weighted loss is defined as [25]: where L class , L BB , and L mask represent the classification loss, bounding-box loss, and the average binary cross-entropy loss, respectively. The L class shows the convergence of the predictions to the true class. L class combines the classification loss during the training of RPN and Mask R-CNN heads. L BB shows how well the model localizes objects and it combines the bounding-box localization loss during the training of RPN and Mask R-CNN heads. The L class and L BB losses are computed by Equations (3) and (4): where L class (p, u) is the predicted probability of ground truth class u for each positive bounding box.
where L smooth and L smooth 1 t u i − v i are the predicted boundingbox for class u and ground truth bounding-box v for each input i.
The L mask has K × m × m dimensional output for each RoI where K represents a number of a class and m × m is a matrix representation of the class. A per-pixel sigmoid is applied and the L mask is computed using the average binary cross-entropy loss that the K mask is associated with the Kth class, i.e., K = 1 = Tilapia. The L mask can be expressed in Equation (5) [26,27]: where P K i,j denotes the ith pixel of the jth generated mask. The backbone network has used a 101-layer ResNet and a 3 × 3 convolution layer has been set for RPN. Secondly, RoIAlign performs a per-pixel preservation of spatial features extraction by using a fully convolutional network and RoIPool for the feature map. This network outputs a K × m × m mask representation that is upscaled and the channels are reduced to 256 using a m × m convolution, where K is the number of classes, i.e., K = 1, and m = 28 for the ResNet_101 network as a backbone. All training parameters use the same values, where the batch size is 128 images, the learning rate is 2.5 × 10 −4 , and the maximum iterations are 300.
The TDet model delivers the bounding-box output as a set of coordinate points (x, y) of a detected fish. The coordinate points from the bounding box were extracted to compute the length and width of the detected fish. However, these measurements are subject to perspective projection (pixel units). The fish size in perspective projection relies on the depth between the fish and the camera. This results in the fish body that is closer to the camera being wider and longer than those further away. Thus, the fish size due to perspective projection is essentially converted into real-measurement units of the fish's actual size before estimating the weight of the Tilapia.

Tilapia Weight Estimation
The next step is to estimate the weight that comprises three sub-steps: First, estimating the depth of the fish; second, converting the fish's width and length from pixel to centimetre; and finally, determining the fish's weight by using all estimated data of the fish by training the TDepE, TP2CME, and TWE models, respectively. These three models specifically required the following independent data and delivered the dependent output as shown in Table 1. Table 1. Independent data and dependent output of TDepE, TP2CME, and TWE models.

Independent Data Dependent Output
TDepE age of fish (weeks) actual depth (cm) length of fish (pixel) width of fish (pixel) TP2CME age of fish (weeks) length of fish (pixel) length of fish (cm), width of fish (pixel) width of fish (cm) depth (cm) TWE age of fish (weeks) weight of fish (g) length of fish (pixel) width of fish (pixel) depth (cm) length of fish (cm) width of fish (cm) The three models are sequentially related to one another, where an output of the previous model is an input of the next model. The regression models of the TDepE (ŷ depth ), TP2CME (ŷ l_cm ,ŷ w_cm ), and TWE (ŷ w ) models can mathematically be expressed in Equations (6)-(8), respectively, as: y l_cm ,ŷ w_cm = f x age , x w_pix , x h_pix ,ŷ depth , a age , a w_pix , a h_pix , a depth + e cm (7) y w = f x age , x w_pix , x h_pix ,ŷ depth ,ŷ l_cm ,ŷ w_cm , a age , a w_pix , a h_pix , a depth , a w_cm , a h_cm + e w where e depth , e cm , and e w denotes an additive error term. The closed form equation to link all the above equations together is to be determined by the machine learning model. To achieve the goal, the regression models, i.e., Tilapia depth estimation, Tilapia pixel-tocentimetre estimation, and Tilapia weight estimation, were constructed by employing three well-known regression methods. The regression models are LR, RFR, and SVR. Linear regression is a linear model of relationship between independent variables and a dependent variable. The linear model is expressed in Equation (9): where x j and y denote the j-th independent variable and the dependent variable, respectively. The terms a j , j = 0, 1, . . . , J are the coefficients of the model and J is the total number of features used for the regression. Secondly, random forest is a decision-tree extension by constructing a multitude of trees in a training period. Random forest is deep learning for classification or regression tasks. In the multitude trees, individual trees randomly select a subset of features. The optimal splitting point is determined by the predicted squared error as a criterion of a regression model. RFR output (ŷ) is based on a weighted sum of datapoints, as expressed in Equation (10): where x i and y i denote the dataset and w i is a weight of y i . The x term represents the neighbour node that shares the same leaf in a tree j with the point x i [28]. The squared error is expressed in Equation (11): Finally, the support vector regression is an extension of the support vector machine for solving regression problems. The objective function of SVR is to minimize the coefficients by using the l 2 -norm of the coefficient vector [29,30] instead of the squared error, as expressed in Equation (12). The constraint called the maximum error ( ) is represented by the absolute error in Equation (13). The paremeter will be tuned by the regression function to gain the best fit line, where a hyperplane has a maximum number of points [31].
The value determines the distance of the support-vector line (so-called decision boundary) that deviates from the hyperplane line.
A subsequent training phase delivers the TDet, TDepE, TP2CME, and TWE models. The evaluation phase, as shown in Figure 1, will use these models for estimating the weight of the Tilapia given by an observed video input. An overview of the proposed Tilapia weight-estimation evaluation phase is explained in Algorithm 1.

Algorithm 1. Overview of the Proposed Tilapia Weight-Estimation Evaluation Phase
(1) Convert an observed video input to images: s[n] = s(nT) (2) Enhance images in a case of turbid water: (2.1) Image sharpening by the convolution function g 1 (x, y): denotes the original image, and ω(·) is the filter kernel, i.e., sharpen, filter.
(2.2) Color correction matrix (CCM) [32]: where S R S G S B S W denote the red, green, blue, and white spaces; C is the color-component vector; and is the gain and β represents the bias parameter.

Data Collection
The Tilapia were raised in 3 tanks where each tank contained 30 fish. The tanks are round with a radius of 1.5 m and a depth of 1.8 m. A new fish cultivation method was used for the efficient feeding of fish, which is called the biofloc culture. The biofloc tank is a microorganism cultured fish, thus the biofloc microorganisms caused the water to turn turbid. Bacteria are put into an aquaculture system to convert nitrogen from the water into protein. The protein will be the food of the fish. The wastewater that contains nitrates, nitrites, and ammonia will be treated and reused as supermolecule feed. Biofloc fish feeding is a technology that feeds aquaculture systems with macroaggregates that decrease the fish diet cost and improve the aquatic environment of a fish tank.
Datasets developed in this research can be categorized into the (a) Tilapia-image datasets and (b) Tilapia-file datasets. Firstly, the Tilapia-image datasets are in-house curated from two sources: studio-based photography of the off tanks and from video recordings of the tanks, as shown in Figure 3.
The studio-based photography is set up by using a camera (Cannon EOS 200D II) mounted in a fixed position that is 0.5 m from the fish and parallel to the platform with a resolution of 1920 × 1080 pixels. The fish were weighed with an electronic scale before photographing. The video recording (GoPro Hero 8 and waterproof case) was carried out by sampling five fish from the tanks and putting them into the recording tanks. The videos were recorded at a resolution of 1920 × 1080 pixels, with a frame rate of 60 fps and 8-bit RGB. Data collection of each fish from the studio and video was performed, including age (weeks), width and length throughout the fish in centimeters (cm), and the weight of the fish in grams. Secondly, the Tilapia-file dataset was created for training the regression models. The Tilapia-file dataset includes three attributes, which are the fish's age, the physical dimensions of the fish in pixel and centimetre units, and the depth between the fish and the camera. The two Tilapia datasets were employed for training the models to estimate the Tilapia's weight.
is a microorganism cultured fish, thus the biofloc microorganisms caused the water to turn turbid. Bacteria are put into an aquaculture system to convert nitrogen from the water into protein. The protein will be the food of the fish. The wastewater that contains nitrates, nitrites, and ammonia will be treated and reused as supermolecule feed. Biofloc fish feeding is a technology that feeds aquaculture systems with macroaggregates that decrease the fish diet cost and improve the aquatic environment of a fish tank.
Datasets developed in this research can be categorized into the (a) Tilapia-image datasets and (b) Tilapia-file datasets. Firstly, the Tilapia-image datasets are in-house curated from two sources: studio-based photography of the off tanks and from video recordings of the tanks, as shown in Figure 3. The studio-based photography is set up by using a camera (Cannon EOS 200D II) mounted in a fixed position that is 0.5 m from the fish and parallel to the platform with a resolution of 1920 × 1080 pixels. The fish were weighed with an electronic scale before photographing. The video recording (GoPro Hero 8 and waterproof case) was carried out by sampling five fish from the tanks and putting them into the recording tanks. The videos were recorded at a resolution of 1920 × 1080 pixels, with a frame rate of 60 fps and 8-bit RGB. Data collection of each fish from the studio and video was performed, including age (weeks), width and length throughout the fish in centimeters (cm), and the weight of the fish in grams. Secondly, the Tilapia-file dataset was created for training the regression models. The Tilapia-file dataset includes three attributes, which are the fish's age, the physical dimensions of the fish in pixel and centimetre units, and the depth between the fish and the camera. The two Tilapia datasets were employed for training the models to estimate the Tilapia's weight.

Data Preparation
Data pre-processing of the videos refers to the proposed processes of converting video to images, an image enhancing process for the biofloc tanks, and an image annotation process. All fish images have 24-bits of a red, green, and blue channel and each channel has 256 intensity levels. Both images from the studio and videos are required in the annotation process. In the case of videos, firstly, the video-to-image process is the diminution of a continuous-time signal ( ) to a discrete-time signal. The original signal will

Data Preparation
Data pre-processing of the videos refers to the proposed processes of converting video to images, an image enhancing process for the biofloc tanks, and an image annotation process. All fish images have 24-bits of a red, green, and blue channel and each channel has 256 intensity levels. Both images from the studio and videos are required in the annotation process. In the case of videos, firstly, the video-to-image process is the diminution of a continuous-time signal s(t) to a discrete-time signal. The original signal will be sampled at a T period to obtain a series of discreate signals that instantaneously become the original continuous signal. The sampling image process can be expressed in (14) as: where n denotes the sequence index of the T period. The biofloc tank is a microorganism cultured fish, thus biofloc microorganisms cause the water to turn turbid. Therefore, the sampled images of the biofloc tanks were pre-processed and enhanced in order to be able to identify fish by applying the image enhancement process. The image enhancement process consists of four steps, i.e., image sharpen, color filter, color balance, and exposure adjustment, where the values of the individual channels of an image are modified to improve the images' quality. Starting with image sharpening, this involves increasing the contrast, edge detection, noise suppression, and Gaussian Blur algorithms [33,34]. Next, color filter and color balance aim to adjust the color temperature by using curve shifting [35]. Color balance is used to manipulate any unwanted color that dominates an image by estimating the illumination and applying correction to the image [36]. Finally, exposure adjustment is focused on controlling the light of on an image via two parameters: the exposure time and the light sensitivity of the image [37]. Enhanced images are presented in Figure 4. The image annotation is the process of describing the target objects in an image, as shown in Figure 5. The descriptive data allow the computer to interpret the image in a similar way as human understanding. A computer understands digital images by extracting data from a real-world image into numerical information then interprets the information via a deep learning algorithm. Visual images will be provided as description data of a target object in the image, which is known as image annotation. In a similar way to a human learning an object, image annotation is the procedure of labeling images to train a deep learning model. The deep learning algorithm then transforms the image by disentangling symbolic information into numerical sparse information through the convolution process.
Finally, an objective model is then learned by using the fully-connected MLP networks given by the information from the convolution phase. Three attributes were defined for the explanation of a fish, which are age (weeks); distance between a fish and a camera, i.e., so-called depth (cm); and a coordinate-position set of a fish. The fish annotation yields a JSON file as an output of the process. This process is performed via the Visual Geometry Group Image Annotator website (https://www.robots.ox.ac.uk/~vgg/software/via/via_ demo.html accessed on 20 June 2022).
where denotes the sequence index of the period. The biofloc tank is a microorganism cultured fish, thus biofloc microorganisms cause the water to turn turbid. Therefore, the sampled images of the biofloc tanks were pre-processed and enhanced in order to be able to identify fish by applying the image enhancement process. The image enhancement process consists of four steps, i.e., image sharpen, color filter, color balance, and exposure adjustment, where the values of the individual channels of an image are modified to improve the images' quality. Starting with image sharpening, this involves increasing the contrast, edge detection, noise suppression, and Gaussian Blur algorithms [33,34]. Next, color filter and color balance aim to adjust the color temperature by using curve shifting [35]. Color balance is used to manipulate any unwanted color that dominates an image by estimating the illumination and applying correction to the image [36]. Finally, exposure adjustment is focused on controlling the light of on an image via two parameters: the exposure time and the light sensitivity of the image [37]. Enhanced images are presented in Figure 4. The image annotation is the process of describing the target objects in an image, as shown in Figure 5. The descriptive data allow the computer to interpret the image in a similar way as human understanding. A computer understands digital images by extracting data from a real-world image into numerical information then interprets the information via a deep learning algorithm. Visual images will be provided as description data of a target object in the image, which is known as image annotation. In a similar way to a human learning an object, image annotation is the procedure of labeling images to train a deep learning model. The deep learning algorithm then transforms the image by disentangling symbolic information into numerical sparse information through the convolution process. Finally, an objective model is then learned by using the fully-connected MLP networks given by the information from the convolution phase. Three attributes were defined for the explanation of a fish, which are age (weeks); distance between a fish and a camera, i.e., so-called depth (cm); and a coordinate-position set of a fish. The fish annotation yields a JSON file as an output of the process. This process is performed via the Visual Geometry Group Image Annotator website (https://www.robots.ox.ac.uk/~vgg/software/via/via_demo.html accessed on 20 June 2022). The experimental scheme has been established for 3 months, where the start age of the Tilapia was 20 weeks. The assumptions made in the work are that the Tilapia weight can be estimated with a good level of accuracy. The input of the proposed TWE-DRL algorithm, as illustrated in Figure 2, is made up of two types, where images (i.e., studio) and video signals are processed five times every two weeks. The studio-based photography is set up by using a camera mounted in a fixed position that is 0.5 m from the fish and is parallel to the platform, with a resolution of 1920 × 1080 pixels. The Tilapias were recorded in a turbid water recording tank (i.e., in video) with a resolution of 1920 × 1080 The experimental scheme has been established for 3 months, where the start age of the Tilapia was 20 weeks. The assumptions made in the work are that the Tilapia weight can be estimated with a good level of accuracy. The input of the proposed TWE-DRL algorithm, as illustrated in Figure 2, is made up of two types, where images (i.e., studio) and video signals are processed five times every two weeks. The studio-based photography is set up by using a camera mounted in a fixed position that is 0.5 m from the fish and is parallel to the platform, with a resolution of 1920 × 1080 pixels. The Tilapias were recorded in a turbid water recording tank (i.e., in video) with a resolution of 1920 × 1080 pixels, with a frame rate of 60 fps and 8-bit RGB. For the first actual weighting of 20-week-old Tilapia, the average weight was 166.45 ± 26.38 g, while it was 482.24 ± 91.64 g at the last weighing for 28-week-old Tilapia. The Tilapia-image dataset contains 5037 images, where 750 images were from studio and 4287 images were from video, while Tilapia-file dataset contains 2777 files. The video recording will be converted to images by every second and then the quality of the images will be improved by the image enhancement process. Next, the enhanced images will be used as input data for the Tilapia detection step, which is based on deep transfer learning. All one-class training parameters use the same values, where the backbone is a RestNet learning network, the batch size is 128 images, the learning rate is 2.5 × 10 −4 , and the maximum iterations are 300. The output of the detection step will be the input of the Tilapia weight-estimation step that is based on regression models. The regression models are LR, RFR with 2 level maximum depth, and SVR with radial basis function (RBF) methods. The inputs of individual TDepE, TP2CME, and TWE are expressed in Table 1 and Equations (6)- (8). Finally, the proposed methods will deliver the estimated weight of Tilapia in a data file.
The experimental results have been conducted in two major sections: The first section rigorously determines the optimal models of Tilapia detection, i.e., TDet, and Tilapia weight estimation, i.e., TDepE, TP2CME, and TWE. The second section verifies the effectiveness of the proposed Tilapia weight-estimation methods. The Tilapia-images dataset has 4287 images with various ages, which were split into 60% for training and the rest for testing. The Tilapia-file dataset contains 2777 files, which were partitioned into 70% for training and the rest for testing. The number of training and testing data corresponding to each model is presented in Table 2. The proposed TWE algorithm is used to train the various regression models and its effectiveness is assessed using the following measurements in Equations (15) and (16): The mean absolute error (MAE): The coefficient of determination, R 2 : The experiments were conducted using the following hardware and software environments: hardware environment employed the AMD Ryzen 9 4900H with Radeon Graphics 3.30 GHz, Nvidia GeForce GTX 1660 Ti, 16.00 GB DDR4. Software tools are Python 3.x and TensorFlow-GPU v2.3.0, Keras v2.4.3 in Windows 10 operating system.

Determining the Optimal Tilapia Detection Models
The state-of-the-art deep learning networks have been used to determine the optimal Tilapia detecting models of Mask R-CNN, Faster R-CNN, RetinaNet, and YOLO. YOLOv5 has been used as Tilapia detection experiment, where the following parameters have been determined: scaled weight decay at 0.0005, training for 300 epochs, batch size at 128, and a learning rate of 0.01, as well, the optimizer that is relied on is a Gradient descent with momentum optimizer. All training parameters used the same values, where the batch size was 128 images, the learning rate was 2.5 × 10 −4 , and the maximum iterations are 300.
The object-detecting performance of the three methods were averaged over multiple Intersection-over-Union (IoU) scores, called AP, which used 10 IoU with various thresholds. The experimental results are shown in Table 3. The detection results of the above detection networks are presented in three scenarios, which are a single Tilapia, two Tilapia with more than 50% of a body size appearance, and multiple Tilapia overlapping. The samples of the observed images from the three scenarios are shown in Figure 6, and the detected results are then illustrated in Figures 7-9. The detection results of the above detection networks are presented in three scenarios, which are a single Tilapia, two Tilapia with more than 50% of a body size appearance, and multiple Tilapia overlapping. The samples of the observed images from the three scenarios are shown in Figure 6, and the detected results are then illustrated in Figures 7-9.   The results from Figures 7-9 have shown that Mask R-CNN yields the highest AP scores among the three thresholds. The reason is due to the RoIAlign operation of Mask R-CNN, which is able to extract features from small objects, i.e., Tilapia in blurred, low light, and noisy backgrounds. This leads to a higher accuracy than the Faster R-CNN and RetinaNet models. Therefore, TDet is built based on the Mask R-CNN model for determining the length and width of Tilapia from images. The TDet model obtained by the YOLOv5 framework is able to detect the case of a single Tilapia. In the cases with more complex scenarios where the fish appear to be blurry and small, as in Figure 6b, or chaotic, as in Figure 6c, the YOLOv5 model is unable to detect the fish. On the other hand, Mask R-CNN outperformed TDet-based YOLOv5 for the complex scenarios. YOLO network architecture employs convolutional neural networks (CNN) for extracting the significant features of the fish. A regression problem is treated by a single forward propagation to provide the class probabilities of the detected Tilapia. Therefore, it is difficult for YOLOv5 to extract key features from intricate images due to the spatial plane coordinate, as the grid location constrains the algorithm. Mask R-CNN takes advantage of RoI and RoIAlign processes for selecting the high-level features. This leads to a higher accuracy than all the other comparison methods. (c) (d) The results from Figures 7-9 have shown that Mask R-CNN yields the highest AP scores among the three thresholds. The reason is due to the RoIAlign operation of Mask R-CNN, which is able to extract features from small objects, i.e., Tilapia in blurred, low light, and noisy backgrounds. This leads to a higher accuracy than the Faster R-CNN and RetinaNet models. Therefore, TDet is built based on the Mask R-CNN model for

Determining the Regression Learning Methods for the TDepE, TP2CME, and TWE Models
The Tilapia-file dataset was used for training the TDepE, TP2CME, and TWE models by splitting 80% of data is for training and the remaining data is for testing. The three sub-steps of Tilapia weight estimation are sequentially performed. A grid search and validation dataset were used to find the optimal parameter of the TDepE, TP2CME, and TWE models by specifying every combination of the parameter settings. Grid search passes all combinations of the hyperparameters one-by-one into the model to determine the optimal values for a given model. Hyperparameters are the variables that are used to evaluate the optimal parameters of the model. The hyperparameters for RFR and SVR were determined, which are the {maximum depth, maximum features, minimum samples leaf, minimum sample split, the number of estimators} and {regularization parameter, kernel coefficient, kernel types} sets, respectively. Finally, grid search delivers the set of hyperparameters that gives the best performance for the model. The validation dataset is used to determine the hyperparameters of each of the machine learning models in TDepE, TP2CME, and TWE. Next, TDepE model is firstly presented with the chosen regression method and then followed by the rest of the steps in succession.

Tilapia Depth Estimation Performance
The TDepE model was trained by learning data consisting of the age, the length, and the width of the fish (pixel), as well as the actual depth of the fish. In terms of the performance, the obtained TDepE models based on LR, RFR with 2 level maximum depth, and SVR with radial basis function (RBF) methods [38,39] are illustrated in Table 2 and Figure 10, respectively. The RBF kernel [40] is expressed in Equation (17) as: where σ 2 denotes the variance as the hyperparameter and X 1 − X 2 represents the Euclidean (L 2 -norm) Distance between two points X 1 and X 2 . The distance between the fish and the camera is between 5 cm and 60 cm. Depth data of Tilapia-file dataset was collected by using a manual visual distance estimation method with reference to distance markers every 10 cm, which were installed in the fish recording cube. The depth estimating performance of the LR, RFR, and SVR models are explicitly presented in Figure 10,  According to Table 4, the SVR method yields outstanding performance for estimating depth of the fish. Therefore, the TDepE-based SVR model is set for the depth estimation step. Next, the experiment aims to figure out the regression method for TP2CME and TWE by measuring weight-estimating accuracy.

Tilapia Pixel-to-Centimeter Estimation and Tilapia Weight Performance
The three investigational cases were set as presented in Table 4 for TP2CME and TWE. Each case starts from the TDet and TDepE steps. The TP2CME model learned from the fish attributes, including age, length and width of the fish in pixel units, and depth of the fish. The TWE model requires the length and the width of the fish in cm units. The experimental cases consist of two steps of TP2CME and TWE. The TP2CME for the indi-  Table 3. The SVR model provides the best scores for MAE, R 2 , and the MAE ratio over the LR and RFR models at 5.52 cm and 1.56 cm for the MAE values, 0.46 and 0.12 for the R 2 values, and 18.67 and 2.82 for the MAE ratio values, respectively.
According to Table 4, the SVR method yields outstanding performance for estimating depth of the fish. Therefore, the TDepE-based SVR model is set for the depth estimation step. Next, the experiment aims to figure out the regression method for TP2CME and TWE by measuring weight-estimating accuracy.

Tilapia Pixel-to-Centimeter Estimation and Tilapia Weight Performance
The three investigational cases were set as presented in Table 4 for TP2CME and TWE. Each case starts from the TDet and TDepE steps. The TP2CME model learned from the fish attributes, including age, length and width of the fish in pixel units, and depth of the fish.
The TWE model requires the length and the width of the fish in cm units. The experimental cases consist of two steps of TP2CME and TWE. The TP2CME for the individual case used a different regression learning method. Hence, we have three main cases of SVR, RFR, and LR, where the depth estimation is based on SVR-as shown in Table 5. Finally, the TWE step of the three cases is then applied for all three regression methods to estimate the weight of the fish. The box plots represent the weight-estimation errors of the three cases, as illustrated in Figure 11. The TP2CME-and TWE-based LR models yield the minimum errors and deviation that are obviously noticed by the smallest size of the weight-error box from SLL with the average error at 43.80 ± 47.69 g. The three investigational cases were set as presented in Table 4 for TP2CME and TWE. Each case starts from the TDet and TDepE steps. The TP2CME model learned from the fish attributes, including age, length and width of the fish in pixel units, and depth of the fish. The TWE model requires the length and the width of the fish in cm units. The experimental cases consist of two steps of TP2CME and TWE. The TP2CME for the individual case used a different regression learning method. Hence, we have three main cases of SVR, RFR, and LR, where the depth estimation is based on SVR-as shown in Table 5. Finally, the TWE step of the three cases is then applied for all three regression methods to estimate the weight of the fish. The box plots represent the weight-estimation errors of the three cases, as illustrated in Figure 11. The TP2CME-and TWE-based LR models yield the minimum errors and deviation that are obviously noticed by the smallest size of the weight-error box from SLL with the average error at 43.80 ± 47.69 g. Figure 11. Box-plot comparison of Tilapia-weight estimating errors of the nine candidates corresponding to Case 1, Case 2, and Case 3 for determining the regression method to TP2CME and TWE.
The MAE and 2 scores for all cases are presented in Table 6. The SLL method yields the best estimating performance among all cases with the MAE, R2, and MAE ratio values at 42.54 cm, 0.70, and 60.77, respectively. Figure 11. Box-plot comparison of Tilapia-weight estimating errors of the nine candidates corresponding to Case 1, Case 2, and Case 3 for determining the regression method to TP2CME and TWE.
The MAE and R 2 scores for all cases are presented in Table 6. The SLL method yields the best estimating performance among all cases with the MAE, R2, and MAE ratio values at 42.54 cm, 0.70, and 60.77, respectively. According to Table 6, the weight-estimating procedure can be recapped by the regressionlearning solution of the TDepE, TP2CME, and TWE steps, which are the SVR model, the LR model, and the LR model, respectively.
The relationship of the weight and size of Tilapia with linear regression by the R 2 measurement is shown in Figure 12. The R 2 value of LR is 0.95 for the weight-length relationship and 0.85 for the weight-width relationship, respectively. This result shows that the length and width of Tilapia is significantly correlated to the weight of Tilapia. According to Figure 12, the values indicate the strength of the relationship between the proposed TWE-DRL model and the dependent length and width variable at 95.17% and 85.19%, respectively.

Tilapia Weight Estimation Performance
This section demonstrates the weight estimation performance of the proposed TWE-DRL method against the benchmarks of seven fish weight estimation-based areas (A) of the fish's size in [6]. The area-based weight estimation methods with various coefficients can be expressed through the following equations in Equations (18) According to Figure 12, the R 2 values indicate the strength of the relationship between the proposed TWE-DRL model and the dependent length and width variable at 95.17% and 85.19%, respectively.

Tilapia Weight Estimation Performance
This section demonstrates the weight estimation performance of the proposed TWE-DRL method against the benchmarks of seven fish weight estimation-based areas (A) of the fish's size in [6]. The area-based weight estimation methods with various coefficients can be expressed through the following equations in Equations (18) (24) where an area (A) of the fish's body in cm 2 have been computed from multiplying the length and the width of that fish, which was obtained from the Tilapia detection phase with a coefficient, i.e., A = length × width × coefficient. The coefficients in Equations (20)- (24) were obtained by formulating lines corresponding to individual equations for representing the relationship between the actual fish's area and its actual weight. The plots are illustrated in Figure 13.  (24) where an area (A) of the fish's body in cm 2 have been computed from multiplying the length and the width of that fish, which was obtained from the Tilapia detection phase with a coefficient, i.e., A = length × width × coefficient. The coefficients in Equations (20)-(24) were obtained by formulating lines corresponding to individual equations for representing the relationship between the actual fish's area and its actual weight. The plots are illustrated in Figure 13. The evaluated Tilapia datasets were established for 3 months and recorded every 2 weeks, with the Tilapia being 20-week-olds. All comparison methods were provided by the estimated length and width of the Tilapia that were obtained from the TDet and TP2CME models of the proposed method. The estimated weight results are presented in Table 7. The evaluated Tilapia datasets were established for 3 months and recorded every 2 weeks, with the Tilapia being 20-week-olds. All comparison methods were provided by the estimated length and width of the Tilapia that were obtained from the TDet and TP2CME models of the proposed method. The estimated weight results are presented in Table 7. According to the results in Table 7, the proposed methods obtained the smallest MAE score and highest R 2 scores, where an average error is 42.54 g from the actual weight of fish. The regression models of the proposed methods can predict that the weight of the Tilapia has a 70% fit to the actual weight. The proposed method estimates the fish weight from the length and width of the fish, while the other methods use the area of the fish. From Figure 9, the R 2 values of length and width are 0.9517 and 0.8519, while the maximum R 2 value from Equations (14)-(18) is 0.7507. Hence, the length and width of the fish is significantly accurate for estimating the weight of the fish. Therefore, the proposed TWE-DRL method yields the highest accuracy over the area-based weight-estimation methods.
The average estimated weight of the proposed method for each week is illustrated in Figure 14 against the average actual weight of the Tilapia. The results of Tilapia weight estimation from turbid water by the proposed TWE-DRL method vary by the fish's age and are plotted compared to the actual weight. The proposed TWE-DRL method has estimated the Tilapia weights consistently and is tallied with the actual Tilapia weight patterns by using the TDet, TDepE, TP2CME, and TWE models. The obtained results show that across the eight weeks, the proposed method has only accrued an estimated weight error of 30.30 (±23.09) grams. The proposed approach can perform at high accuracies and is able to track the weight evolution of the fish in the tank from week to week. In addition, once the system has completed the estimation processes, all the estimated results will be saved to a Microsoft Excel file as an output of the system. According to the results in Table 7, the proposed methods obtained the smallest MAE score and highest R 2 scores, where an average error is 42.54 g from the actual weight of fish. The regression models of the proposed methods can predict that the weight of the Tilapia has a 70% fit to the actual weight. The proposed method estimates the fish weight from the length and width of the fish, while the other methods use the area of the fish. From Figure 9, the R 2 values of length and width are 0.9517 and 0.8519, while the maximum R 2 value from Equations (14)-(18) is 0.7507. Hence, the length and width of the fish is significantly accurate for estimating the weight of the fish. Therefore, the proposed TWE-DRL method yields the highest accuracy over the area-based weight-estimation methods.
The average estimated weight of the proposed method for each week is illustrated in Figure 14 against the average actual weight of the Tilapia. The results of Tilapia weight estimation from turbid water by the proposed TWE-DRL method vary by the fish's age and are plotted compared to the actual weight. The proposed TWE-DRL method has estimated the Tilapia weights consistently and is tallied with the actual Tilapia weight patterns by using the TDet, TDepE, TP2CME, and TWE models. The obtained results show that across the eight weeks, the proposed method has only accrued an estimated weight error of 30.30 (±23.09) grams. The proposed approach can perform at high accuracies and is able to track the weight evolution of the fish in the tank from week to week. In addition, once the system has completed the estimation processes, all the estimated results will be saved to a Microsoft Excel file as an output of the system. Examples of the fish body and size detection results are shown in Figure 15, where fish were recorded from underwater at various depths. The TDet model can detect multiple fish in the image with their bodies aligned horizontally in the image. The proposed method can precisely detect the body size of each fish even when the fish overlap, as presented in Figure 15.
The proposed TWE-DRL method can detect fish in turbid water in a variety of distances, both near and far from the camera recorder. The proposed algorithm for the TDet results is set at 0.8 for the probability criterion so that images with a probability equal to or greater than 0.8 will be passed through for further processing. Subsequently, the size of the fish in pixels was converted to cm with the TP2CME model using the fish size data from the detecting process together with the depth information obtained from the TDepE model. Turbid water and the depth of the fish have a major influence on fish detection-for example, two fish that overlap with one another at a further distance from the camera. The performance of the Tilapia size estimation from the proposed TWE-DRL method is shown by MAE, while the box plot values are shown in Figure 16. The estimated error accrued by the proposed method is 2.3 cm and 0.96 cm for length and width, respectively. The actual fish have a length and width that range from 20-30 cm and 7-12 cm, depending on the age of the fish. The estimated-length error of the fish, as shown in Figure 16, has a wider spread error than the estimated-width error. This is caused by a wider range of the fish's actual length than that of the fish's width. This leads to the consistency for estimating the performance of the proposed TWE method. In some cases, the proposed TWE method may detect the overlapping fish as a single fish. The Tilapia was raised in 3 biofloc tanks for 3 months, and the Tilapia were 20 weeks old at the start. The Tilapia were recorded underwater every two weeks. The estimated weight of the Tilapia from 20-weeks-old to 28-weeks-old are plotted against their actual weight from the video, which is related to the actual length of the Tilapia, as illustrated in Figure 17. Examples of the fish body and size detection results are shown in Figure 15, where fish were recorded from underwater at various depths. The TDet model can detect multiple fish in the image with their bodies aligned horizontally in the image. The proposed method can precisely detect the body size of each fish even when the fish overlap, as presented in Figure 15. The proposed TWE-DRL method can detect fish in turbid water in a variety of distances, both near and far from the camera recorder. The proposed algorithm for the TDet results is set at 0.8 for the probability criterion so that images with a probability equal to or greater than 0.8 will be passed through for further processing. Subsequently, the size of the fish in pixels was converted to cm with the TP2CME model using the fish size data from the detecting process together with the depth information obtained from the TDepE model. Turbid water and the depth of the fish have a major influence on fish detectionfor example, two fish that overlap with one another at a further distance from the camera. The performance of the Tilapia size estimation from the proposed TWE-DRL method is shown by MAE, while the box plot values are shown in Figure 16. The estimated error accrued by the proposed method is 2.3 cm and 0.96 cm for length and width, respectively. The actual fish have a length and width that range from 20-30 cm and 7-12 cm, depending on the age of the fish. The estimated-length error of the fish, as shown in Figure 16, has a wider spread error than the estimated-width error. This is caused by a wider range of the fish's actual length than that of the fish's width. This leads to the consistency for estimating the performance of the proposed TWE method. In some cases, the proposed TWE method may detect the overlapping fish as a single fish. The Tilapia was raised in 3 biofloc tanks for 3 months, and the Tilapia were 20 weeks old at the start. The Tilapia were recorded underwater every two weeks. The estimated weight of the Tilapia from 20-weeks- old to 28-weeks-old are plotted against their actual weight from the video, which is related to the actual length of the Tilapia, as illustrated in Figure 17.    Note that, at 24 weeks of age, the second tank has no data due to all the fish dying and a new set of fish from a reserve tank was supplied instead. The proposed TWE-DRL method has estimated the Tilapia weight given by observed videos where the results show a close resemblance to the actual weight. This is to show the correctness of the proposed method.
The next section will demonstrate the performance of the proposed TWE-DRL method, which is given by a dataset of estimates derived from the models. All attributes in the estimated-value dataset were obtained by the models proposed in this paper, i.e., TDepE, TP2CME, and TWE. This dataset was used to train the TDepE, TP2CME, and TWE Note that, at 24 weeks of age, the second tank has no data due to all the fish dying and a new set of fish from a reserve tank was supplied instead. The proposed TWE-DRL method has estimated the Tilapia weight given by observed videos where the results show a close resemblance to the actual weight. This is to show the correctness of the proposed method.
The next section will demonstrate the performance of the proposed TWE-DRL method, which is given by a dataset of estimates derived from the models. All attributes in the estimated-value dataset were obtained by the models proposed in this paper, i.e., TDepE, TP2CME, and TWE. This dataset was used to train the TDepE, TP2CME, and TWE models by following the same steps in Sections 3.4.1 and 3.4.2. From the experiments, it was found that the SVR, RFR, and LR methods of the TDepE, TP2CME, and TWE models yield the best estimation results. The fish weights predicted from the estimated-value models were compared with the weight results obtained from the actual-value models. This is shown in Figure 18. The estimated weight using the trained models performs with a slightly higher error than the actual value trained model with 14.50 cm of MAE across the test dataset. was found that the SVR, RFR, and LR methods of the TDepE, TP2CME, and TWE models yield the best estimation results. The fish weights predicted from the estimated-value models were compared with the weight results obtained from the actual-value models. This is shown in Figure 18. The estimated weight using the trained models performs with a slightly higher error than the actual value trained model with 14.50 cm of MAE across the test dataset. The well-known weight estimation of fish can be categorized into two cases, in case of off-water and underwater scenarios. Firstly, in the case of off-water, fish weight-estimation-based CNNs are proposed in Refs. [5,41] by using ResNet-34 and LinkNet-34 for segmenting fish images, then the weight of the fish is computed from the surface area of the fish. The datasets from this research contain 2445 images of fish with weights in the The well-known weight estimation of fish can be categorized into two cases, in case of off-water and underwater scenarios. Firstly, in the case of off-water, fish weight-estimationbased CNNs are proposed in Refs. [5,41] by using ResNet-34 and LinkNet-34 for segmenting fish images, then the weight of the fish is computed from the surface area of the fish. The datasets from this research contain 2445 images of fish with weights in the range of 15 g to 2500 g, where the distance between the fish and the camera is constant in all images. Thus, the depth of the fish will be provided as a priori information. The mass estimation performance of Ref. [42] yields the R 2 value of 0.976. Another off-tank method is presented in Ref. [5], the dataset contains 694 images of fish from the 22 species of fish from 9 tributaries where images were captured. The fish's weight is between 500 g and 1200 g. Six cameras were set at a fixed distance, with three being near-infrared cameras and three being general cameras. The output of the DCNNs phase is passed into the regression phase where the final output will be an averaged value of nine images. The performance of the weight estimation from Ref. [5] gains an MAE of 634 g. Secondly, underwater fish-weight estimation is presented in Ref. [7], where the fish weight-estimation methods are the weight prediction system for Nile Tilapia. This method uses stereo cameras for distance measurements and captured 10 Tilapia in a tank of clear water for 3 weeks. The fish's weight is in a range of 24 g to 41 g. CNNs are used for fish detection. Regression equations are proposed for computing the depth of the fish, converting pixel-to-cm, and weight prediction. The correlation of the weight and length based on linear regression has an R 2 value of 0.87. The fish's weight from the proposed TWE method is between 155 g to 561 g and the R 2 value is 0.95. Moreover, underwater fish weight estimation was exploited in Ref. [43]. A unidirectional tunnel controlled underwater studio was established by using a single camera. A fish is assumed to be positioned along the x-axis. A combination of 2D saliency detection and morphological operators are used for fish segmentation. The curve estimation for length measurement from segmented images is estimated by using a third-degree polynomial regression on the fish mid-point. Several regression algorithms were investigated to compute the weight of the fish. The performance of the method from Ref. [43] obtained an R 2 value of 0.97. Based on the current state-of-the art fish weight-estimation methods, a special camera or controlled environment are commonly required for collecting fish images. A CNNs approach were used to identify fish in images. A regression learning approach is applied to estimate the weight of the fish and the significant fish features related to its weight. Those methods were used in different scenarios. For the proposed TWE method, a single camera is required without any other controlled environment. The general CNNs and regression learning models are formulated in a similar process as the other famous methods. However, the TWE-DRL algorithm requires only three features, i.e., the age, length, and width of the fish.
The limitations of the underwater fish weight-estimation methods are mostly based on the requirement to have special cameras and/or a controlled environment for collecting fish images. A fish weight-estimation-based deep learning approach consumes high computational complexity, while the regression learning approach is mostly applied for the case of off-water weight estimation. On the other hand, the limitation of our proposed method is that it requires a priori information of the fish's age. In addition, the turbidity of the water has influences on fish detection to a certain degree. This is evident in the obtained results presented in the experiments across the different weeks due to the biofloc. For future work, a pseudo-stereo image will be introduced for extracting the depth of the fish directly from a single channel image recording and this will be used to produce the depth estimation [44,45].
The computational complexity of the proposed algorithm can be represented by a big-O notation. The proposed method has two major components: Firstly, the Tilapia detection based on the deep learning method and secondly, the Tilapia weight estimation based on the regression methods. For a deep learning algorithm, the computational complexity of the proposed method is dominated by the number of iterations and the number of network layers corresponding to the number of input data. The computational complexity of a neural network [46,47] in FC is O n 4 ,O(n), O n 2 , O(k * n * log(n) * m), k where n denotes the number of neighbors, m is the number of training data, and represents the number of features [48]. The complexity of the deep learning algorithm causes a large number of model parameters, which leads to a large memory. Mask R-CNN architecture is comprised of three major components, i.e., the Backbone, Head, and Mask Branch.
Each RoI needs to be calculated separately, which is time-consuming. In addition, the number of feature channels after RoI pooling is large, which makes the two FC layers consume a lot of memory and potentially affects the computational speed. The number of ResNet-50 parameters varies based on the number of layers, which are presented in Table 8. Therefore, in our proposed method, the fish detection using Mask R-CNN consumes the most computational time. However, Mask RCNN yields higher accuracy. Though, given the current GPU configuration, this computational complexity is relatively modest.

Conclusions
Fish monitoring in underwater environments remains a challenging task due to many factors, such as the dynamics of fish moving, lighting conditions, the quality of water, and background noise. The focus of the paper lies in developing a low-cost practical single sensor imaging system with deep and regression learning algorithms for the non-intrusive estimation of fish weight. The proposed method consists of a Tilapia detection step and Tilapia weight-estimation step. The Tilapia datasets are curated and contain two types of datasets, one for the estimation of the fish's depth from the camera and another for the estimation of the fish's physical dimensions. A low-cost off-the-shelf camera is used for recording the fish. The Tilapia detection model has been trained by the image datasets using deep neural network, Mask R-CNN, with transfer learning. The Tilapia weightestimating models are based on regression learning that require only three features of the fish, the fish's length and width, depth, and age. Three regression learning methods have been investigated for Tilapia weight estimation. The experimental results show that the proposed algorithm has remarkable efficiency in estimating Tilapia weight with a MAE of 40.78 g, R2 of 0.74, and an average weight error of only 30.30 (±23.09) grams in a turbid water environment, which shows the practicality of the proposed framework. The principal strength of the proposed method is the continuous extraction of only three fish's features that results in less time-consuming training processes, and its ability to estimate the weight of Tilapia in turbid water using low-cost video recording. The proposed algorithm has been demonstrated to be highly amenable to real-world fish farms by using only low-cost video cameras without including other special sensors.