An Experimental Study on Estimating the Quantity of Fish in Cages Based on Image Sonar

: To address the highly demanding assessment of the quantity of ﬁ sh in cages, a method for estimating the ﬁ sh quantity in cages based on image sonar is proposed. In this method, forward-looking image sonar is employed for continuous detection in cages, and the YOLO target detection model with a tt ention mechanism as well as a BP neural network are combined to achieve a real-time automatic estimation of ﬁ sh quantity in cages. A quantitative experiment was conducted in the South China Sea to render a database for training the YOLO model and neural network. The experimental results show that the average detection accuracy mAP50 of the improved YOLOv8 is 3.81% higher than that of the original algorithm. The accuracy of the neural network in ﬁ tt ing the ﬁ sh quantity reaches 84.63%, which is 0.72% be tt er than cubic polynomial ﬁ tt ing. In conclusion, the accurate assessment of the ﬁ sh quantity in cages contributes to the scienti ﬁ c and intelligent management of aquaculture and the rational formulation of feeding and ﬁ shing plans.


Introduction
As a major agricultural country in the world, the development of China's agricultural economy is related to the development of the national economy [1].As an important branch of aquaculture, fishery farming has always been an important pillar of China's agricultural economy.With the development of society, science, and technology, the level of agricultural modernization has rapidly improved, and the intelligent development of fish farming has accelerated.The monitoring and regulation of the breeding environment and the decision making of feed feeding have gradually shifted from completely relying on manual diagnosis, decision making, and adjustment to the mechanization and precision of monitoring equipment, and then to the digitalization and intelligence of the system [2].
At present, fish farming in China varies between pond and cage cultures.Among these, cage culture exhibits the highest level of intensification, with a myriad of issues arising during fish farming [3].Fish quantity monitoring, as an important part of cage aquaculture production management, is of profound significance mainly in the following three aspects: 1. Intelligent management of aquaculture, allowing aquaculture managers to adjust the feeding amount and make fishery harvesting plans according to the fish production; 2. Early warning of the safety of the fishnet and the breakage of the fishnet in the case of abnormal fish quantity, to repair it in time to reduce losses; 3. Facilitation of the assessment of the financial assets of the catch, rendering necessary technical conditions for achieving financial assets of fishery harvesting [4].
Given the above requirements, experts and scholars at home and abroad put forward solutions based on different monitoring methods.Baumgartner et al. [5] observed fish in artificial ponds and calculated the fish quantity and body length by software, concluding that sonar was effective in observing fish activities and obtaining quantitative information.Ding et al. [6] collected 59h underwater data by using ARIS sonar and completed the automatic processing of a large number of acoustic data through an image processing algorithm, including target extraction and counting.A remote cage monitoring system that combines light and sound with motor rotation scanning was jointly developed by the Massachusetts Institute of Technology and Woods Hole Oceanographic Institution, which can identify individual fish well to achieve safe monitoring of fishnet [7].However, its high cost and the prolonged acoustic imaging time required by motor rotation detection (compared to the standard imaging time of 3 min) cause the repeated detection of swimming fish in cages, resulting in a large error in fish quantity estimation.Domestically, the Fishery Machinery and Instrument Research Institute of the Chinese Academy of Fishery Sciences developed a multi-angle cage monitor by optical means [8].Because of the turbid sea water in most coastal areas of our country, except Hainan, the instrument had been limited by effectively observing a range of underwater targets and higher power consumption.Given the limitation of the above optical monitoring technology in the actual condition of cages, most of the domestic research has prioritized acoustic monitoring methods.The Shanghai Acoustics Laboratory of the Chinese Academy of Science put forward the acoustic warning tape method and the remote-operated vehicle patrol method, which were mainly used for monitoring the size of netting and fish but were less able to obtain quantity data.Xiamen University has successively developed acoustic monitoring systems based on the vertical detection method and single-beam transducer motor-rotating horizontal scanning method.The circular multi-beam scanning detection method had high requirements for the estimation of fish swimming speed, and either the underestimation or overestimation will lead to partial fish missed detection or repeated detection [9][10][11].Yihan Feng et al. [12] introduced an automated method for estimating fish abundance in sonar images based on the modified MCNN (multi-column convolutional neural network), named FS-MCNN.They also proposed the multi-dilation rate fusion loss, which improved the accuracy and robustness of the model.This method improved the impact of low pixels in sonar images and blurry edges of target objects in sonar images.
The target recognition technique was indispensable for locating and counting fish in acoustic images.Since the R-CNN (Region with CNN Features) was put forward in 2014, the target detection method based on deep learning has become the main technique, instead of the traditional method [13].Initially, the two-stage method was adopted for target detection based on deep learning, that is, the detection process was explicitly divided into two stages: candidate region selection and target region judgment, with a high detection accuracy but slow detection speed.Later, in 2016, the one-stage target detection method represented by YOLOv1 came into being.Instead of extracting candidate regions in advance, the method directly predicted the category probability and position of the output target object, which attracted more attention by greatly reducing the consumption of computing resources and improving the detection speed [14].The YOLO series of target detection methods, along with the development of single-stage object detection, has been regarded as a typical representative of the one-stage method.Ye Zhaobing et al. [15] proposed the YOLOv3-SPP (Spatial Pyramid Pooling) underwater target detection algorithm to solve the problem of missed detection and false detection caused by unclear images and the complex underwater environment in underwater target detection.Chen Yuliang et al. [16] put forward a method for detecting and identifying underwater biological targets in shallow water based on the YOLOv3 network, aiming to overcome the low detection accuracy of underwater biological targets in a shallow sea caused by color distortion, rough image, local overexposure, and large size difference in underwater images.
In response to the deficiencies of the aforementioned detection methods, this paper proposes a method for estimating the quantity of fish in net cage farming based on forward-looking imaging sonar.This method utilizes forward-looking sonar to generate acoustic images of aquaculture net cages, employs a YOLOv8 neural network model with an added attention mechanism to identify fish targets, and utilizes a BP neural network to invert feature data to estimate the overall quantity of fish.Quantitative detection experiments were conducted in constructed fish cages, with multiple sets of experimental results showing that the average accuracy of fish quantity assessment reached 84.63%, thereby validating the feasibility of this method.By using this method, fish farmers can gain realtime insights into the quantity of fish inside net cages during the farming process, enabling scientific aquaculture management and reducing farming risks.

Overall Process
On the whole, the adopted method is divided into three steps: Firstly, the image sonar is fixed on one side of the cage and observed for more than 10 min, recording sonar data and exporting it to video.Secondly, the improved YOLOv8 model is used to detect all the frames of the current video, and there is only one detection category, namely fish, Thirdly, the number of fish shoals detected in each frame of the video is sorted from the largest to the smallest, and the actual quantity of fish in the cage is estimated by using the trained neural network model according to the top 20 fish quantity.In the second step, the YOLOv8 model needs to be trained with fish sonar image data, and the neural network in the third step is trained by the mapping relationship between the previous observation data and the actual quantity.

Introduction to Image Sonar
The ARIS1800 (Adaptive Resolution Imaging Sonar) sonar used in the present study was introduced by Sound Metrics in 2012.When forward-looking sonar performs detection, the transducer at its top emits ultrasonic waves in the forward direction, and subsequently, the objects illuminated by these waves reflect them, forming echo signals.The sonar receives these signals to generate acoustic images.Typically, dividing the detection beam horizontally into multiple smaller fan-shaped beams can enhance imaging precision, with the vertical angle of each beam group remaining unchanged.Table 1 below lists the specific parameters of ARIS-1800 [17].Figure 1 is a physical diagram of ARIS1800 sonar.In the process of acoustic image generation, water reverberation, channel change, interference, and self-noise generated by target activity are usually accompanied.The non-sequential emission of the ARIS transducer elements can effectively reduce the influence of self-noise and crosstalk.As for how the ARIS system works, the transducer actively emits sound waves in the field of view according to the size of the reflected echo, thus forming acoustic images with different light and dark characteristics.The acoustic image includes a bright area corresponding to the bottom, a bright area representing the fish target, and a dark area corresponding to the water background, as shown in Figure 2 [18].

Sonar Data Acquisition of Fish Quantity in Cages
The experimental data were measured in the sea area of Guishan Island, Zhuhai City, Guangdong Province, China, in March 2023 (latitude and longitude: 113.84473 and 22.12571, respectively).Figure 3 shows the satellite image of the experimental sea area, and Figure 4 shows the aerial image of the experimental base.The cage used in this experiment is shown in Figure 5, with the width and height of the fishnet being 6 × 3 × 4 m, respectively.During the experiment, iron blocks were tied to the four corners of the fishnet as counterweights to open the netting.The object of sonar detection was a golden pomfret with a body length of about 15 cm, which was placed in the experimental cage.It can be seen from Figure 6 that the ARIS sonar is tied to a lifebuoy and floating in the water, with the sonar probe placed at a depth ranging from 30 to 40 cm and a 45-degree angle inclined to the left.The sonar was placed in the middle of the short side of the netting, and the sonar signal covered as much water space as possible.Then, the sonar was connected to a laptop computer, and the supporting software ARISFish (v2.6.3) was used for data acquisition.The upper computer software ARISFish communicates with the sonar device to receive and process the sonar-collected data.It then displays the real-time processing results graphically.A high-frequency mode was used in the sonar, that is, the frequency was 1.8 MHz, and the detection distance was set to just observe the netting on the opposite side, which was about 4.6 m.The quantitative experiment was carried out with every 20 fish as the standard group, and 20, 40, 60, and 80 golden pomfrets were put into the experimental cage in turn.Each group of fish was continuously detected by sonar, and the data every 10 min were recorded as an ARIS source file, which was saved in the computer for subsequent processing.
The sonar images of different groups of fish are presented in Figure 7, from which the clear outlines of fish and netting were visible.The direction of the images was not the same because the waves were constantly beating the sonar, causing the sonar probe to swing left and right in a certain range.Meanwhile, only a 28° sonar opening angle made it impossible for the sonar detection waves to cover the entire cage, that is, not every fish was visible, which put forward higher requirements for the next estimation method.

Introduction to YOLO Algorithm
YOLO is a two-step target detection model based on a neural network.Firstly, the input image is divided into S × S grids, and each grid generates B prediction frames, each of which is represented by a corresponding feature vector, generally taking S = 7 and B = 2.The feature vector is composed of: the coordinates of the center point of the corresponding prediction frame, the width and height of the prediction frame, and the confidence of the existence of the object, and each grid will generate a classification prediction feature vector.Finally, the prediction frame with high confidence and its classification are returned to the original input image [19].
By adding the feature fusion method to the feature extraction network, the algorithm adopts the backbone network of Darknet-53.The feature extraction network structure of the YOLO model is shown in Figure 8.The network is a full convolution network, which is trained and tested on the COCO dataset, and finally outputs a feature map of size 13 × 13 × 255.After the feature map is input to the target detection layer 1, position regression and classification regression are performed.Moreover, the feature map of the last layer and the feature map of the middle layer are fused by the above sampling method and input into the target detection layer 3 and the target detection layer 2, respectively, to achieve position regression and classified regression on the feature maps of multiple sizes [20].Considering the performance and stability of the model comprehensively, the YOLOv8 model was used for fish target detection in this study.YOLOv8 directly transforms the problem of fish detection into a regression problem.After a regression, not only the position coordinates of each fish group are generated, but also the probability of each candidate region belonging to the category is obtained.

YOLO Algorithm Improvement
On the premise of satisfying real-time performance and high detection accuracy, a target detection model based on an improved YOLOv8 algorithm is proposed.Considering that every fish is a small target in a sonar image, the core idea of the improved algorithm is to improve the network's perception ability of small target feature information [21].Firstly, the CBAM (Convolutional Block Attention Module) [22] is improved by using the attention mechanism, and the channel-space attention module CSAM is proposed, which is lighter and can focus on the dimensional features of a small target space.The CSAM is embedded after each convolution of the backbone network to extract features.Then, a 4-fold down-sampling process is added to the YOLOv8 backbone network using 4-scale detection.After the input image is down-sampled by 4 times, a large shallow feature map is obtained.Because of the small receptive field, the feature map contains rich position information to improve the detection effect of small targets [23].
CAM is the channel attention module in CBAM.It consists of two fully connected layers to capture non-linear cross-channel interaction.However, the introduction of the fully connected layer causes a large amount of computation.Even if the channel characteristics are compressed, the parameter quantity is still proportional to the square of the number of channels [24].For a reduced computational burden, a one-dimensional convolution with convolution kernel length k is used to achieve local cross-channel interaction by referring to the idea of ECANet, aiming to extract the dependency between channels [25].L-CAM represents the improved lightweight channel attention module, and the convolution kernel length k is calculated by Formula (1): where C is the number of channels of the input characteristic map, and γ and b are set to 2 and 1, respectively."lb" means log-based binary.
SAM stands for the spatial attention module in CBAM.In this study, a new channelspatial attention structure CSAM was constructed by using the improved L-CAM and SAM modules, as shown in Figure 9. Firstly, L-CAM and SAM were used to obtain the channel attention weight Mc and spatial attention weight  , respectively.Then, the map of attention  and  was extended to the size of R W×H×C ; W and H represent the width and height of the image, respectively; and C represents the number of channels.The sum of elements and sigmoid normalization were carried out to obtain the attention weight matrix  based on the space and channel.The weight reflects the attention distribution in the feature map so that the model can obtain more effective features in the more accurate attention area, as shown in Formula (2): Finally, the mixed attention weight matrix  was multiplied with the input feature map F element by element and added to the original input feature map to obtain a refined feature map  , which was calculated as shown in Formula (3): The attention mechanism tells the model where to concentrate more calculations and improve the expressive force of the region of interest [26].The idea of CSAM was to obtain attention weight matrices  and  from the input feature map F along the spatial dimension and the channel dimension, respectively, to improve the effective flow of feature information in the network.This module emphasizes paying attention to meaningful features, focusing on important features and suppressing invalid features in the two dimensions of channel and space.For small targets, a single feature region gains more weight and contain more effective targets.The model will place a much higher premium on learning the features of this region to extract features better with limited computing resources.

Dataset Making
One hundred sonar images of fish schools were intercepted from experimental data and processed by the MakeSense online data labeling website.There was only one labeling category, namely fish.After the completion of all labeling, the label file was exported, and each sonar image corresponded to a text file with the same name to record the labeling results.The labeled datasets were divided into two categories by random numbers, with 80 images as the training sets and 20 as the test sets.

Experimental Environment
The Windows 10 system was used in the experiment, with NVIDIA GeForce RTX 3070 (8 GB) as the GPU and Intel i9-12900H as the processor.The experimental environment was python3.9.13, pytorch1.13.1, and cuda11.7.

Evaluation Indicators
For the detection performance, the average precision (mAP), parameter quantity (Params), calculation quantity (GFLOPs), and speed (FPS) were used as evaluation indexes [27].In the process of calculating mAP, it was necessary to calculate the average accuracy (AP) first, which represents the average accuracy of a category in the dataset.The calculation process is shown in Formula (4).Then, the AP values of different categories were averaged to obtain a mAP, and the calculation process was shown in Formula ( 5): where P represents the precision ratio, that is, the ratio of the correct result of model recognition among all the recognized results; r represents the recall ratio, that is, the ratio of the correct results of model recognition to the results that need to be recognized in the dataset; N represents the number of categories of samples, and N = 1 in this study.

Training Process
When training the detection network model, the number of iterations was set to 300, the weight attenuation coefficient to 0.0005, the initial learning rate to 0.01, the learning rate momentum to 0.937, and the batch size to 16.As shown in Figures 10 and 11, the model triggered "Early Stopping" to stop training after 120 iterations, at which time the loss decreased to 0.6 and the mAP50 reached 73.02%.

Ablation Experiments
To verify the effectiveness of the channel-space attention mechanism CSAM proposed in this paper, different modules were added to the YOLOv8 detection algorithm under the same experimental conditions, and the influence of each module on the performance of the detection algorithm was evaluated.The results are shown in Table 2.In the added attention module, CSAM improved the accuracy of the detection algorithm the most, which was 3.81 percentage points, while CSAM also ensured fewer parameters, less computation, and the real-time performance of the algorithm.To verify the superiority of the improved detection algorithm, three mainstream detection algorithms were selected for comparative experiments, as shown in Table 3.When the input sizes were all set to 640 × 640 pixels, the detection accuracy of the improved detection algorithm in this paper was better than other algorithms based on ensuring realtime detection.Compared to the Faster RCNN, mAP50 increased by 18.06 percentage points, while Params and FLOPs decreased by 1.59 × 10 8 and 8.56 × 10 10 , respectively.Compared to YOLOv5, the mAP50 of this algorithm increased by 8.24%.On the whole, the improved detection algorithm added the attention module CSAM to the backbone network, which improved the feature extraction ability of small targets and made the model better in detecting fish sonar images.A comparison between the algorithm in this paper and YOLOv8 in detecting fish sonar images without an attention mechanism is presented in Figure 12. Figure 12a-c show our algorithm and Figure 12d-f show YOLOv8 in this paper.The upper and lower parts correspond to the same frame image.It can be seen that the model can distinguish the fish from the netting, and the detected fish was selected by the red identification box.By comparing Figure 12a and Figure 12d, it can be observed that the algorithm in this paper has detected the leftmost small fish, but YOLOv8 has not, and instead mislabeled the rightmost blackfish.Comparing with Figure 12b and Figure 12e, it can also be observed that the algorithm in this paper recognized one more small fish than the original algorithm.Figure 12c turns on the label and confidence display, and it can be seen that the average confidence of fish identification was higher than 80%, which shows that the neural network model can identify fish well.

Introduction to the BP Neural Network
In this study, the estimation of the detected fish quantity to the actual quantity was a nonlinear mapping problem.Neural networks boast strong applicability in dealing with nonlinear mapping and are considered an effective method of data fitting and widely used [28].
The BP (back propagation) neural network is a widely used algorithm at present.The training steps are: initializing the weights and thresholds of each layer, inputting sample data in the input layer, and finally outputting the results in the output layer after calculation in the hidden layer.In the process of the forward transmission of each layer, the current layer only affects the adjacent next layer.If the results of the output layer do not meet the expected output value, the error with the expected value will be propagated back to the network, so that the error function will decrease along the negative gradient direction [29].
The BP neural network includes one input layer, one or more hidden layers, and one output layer.The basic topological structure of the BP neural network (taking one hidden layer as an example) is shown in Figure 13.In neurons, the input acts on another function after a series of weighted summations, and this function is the activation function here.The function of the activation function in a neural network is to transform multiple linear inputs into nonlinear relationships, to achieve the mapping function from linear to nonlinear.The definition of a sigmoid function is shown in Formula (6) [30].

Training Data
To automatically obtain the fish quantity in the cage, human subjective factors and manual intervention should be minimized.In this paper, all the images collected by sonar were selected for target recognition and detection.Sonar data were divided into four groups: 20 fish, 40 fish, 60 fish, and 80 fish, and each group had 10 continuous detection videos with a frame rate of 15 frames per second.Seven videos from each group were randomly selected as the fitting data, and the remaining three were used as detection data.These 40 sonar videos were detected by this algorithm, and the identification data of each frame was saved as a text file.
Each 10-minute video had nearly 10,000 images.If all such vast data were used for fitting, it would not only be a vast amount of calculation but also make it difficult for the algorithm to learn the key features of the data.Considering that the goal is to obtain the fish quantity in the cage, and there was a certain mapping relationship between the quantity of fish in the detection image and the actual quantity, the amount of fish detected in a single frame in each video was sorted from large to small in this paper, taking the top 30 fish quantity detected.The statistical results are shown in Figure 14.

Evaluation Indicators
In the process of neural network training, the error between the predicted or fitted data and the measured data can be expressed by the MSE (mean square error), as shown in Formula ( 7): In Equation (7), "n" represents the data quantity, " " represents the measured data, and " " represents the predicted or fitted data based on the neural network model.

Training Process
The top 10, top 20, and top 30 fish abundance detected were input into the network for training, and the fitting target was the corresponding actual quantity.After comparative experiments, the best effect parameters were the top 20 fish quantity detected in fitting, and the best number of neurons in the hidden layer was 30.Bayesian regularization was used for training.There were 28 groups of data, 85% of which were randomly selected as training data and the remaining 15% as test data.
Based on the above parameters, the neural network was trained, and the training results are shown in Figures 15 and 16. Figure 15 shows the change in the sample mean square error.After 45 training operations, the MSE of the training group produced the best result, with a value of 85.1716.Figure 16 shows the prediction errors of the training group and the test group, in which the vast majority of sample errors were between −12 and 12, with positive numbers indicating that the prediction was greater than the actual quantity and negative numbers indicating that the prediction was lower than the actual quantity.The learning results of the BP neural network are shown in Figure 17.The regression results of the training group, the test group, and all data, that is, the fitting degree between the output value and the target value, are shown in these three small graphs.As it can be seen from the figure, most of the data are concentrated near the diagonal, and some data are far away, and the fitting results are all above 0.82, indicating that the fitting effect is relatively good.

Fitting Test
The BP neural network was used to estimate the top 20 fish quantity of the three tests in each group of test data, and the fitting results are shown in Table 4. Error number = total fitting quantity-actual quantity, error percentage = absolute value of error quantity/actual quantity; the average error was the average of all error percentages and average accuracy = 1 − average error.It can be seen from Table 4 that the algorithm in this paper had a high accuracy in fitting the sonar image data of 20, 40, and 60 groups and achieved a single-digit error.However, when fitting the sonar data of 80 fish, the error was large, and the quantity sequence detected was small, resulting in a large error of about 27%.The manual inspection of the detection videos with serial numbers 11 and 12 showed that there were few fish in the sonar images.It was speculated that the sonar probe shook badly during this period due to heavy sea waves, and the swimming trajectory of the fish was different from the usual one; so, the data detected by the sonar did not reflect the real situation in the cage.The solution can be to observe in multiple periods, obtain multiple groups of sonar image data and carry out target recognition and detection, eliminate detection sequences with too large data differences, and then estimate by a neural network.The obtained data were more objective and more realistic after averaging.

Data Fitting and Comparison
The commonly used data fitting methods are linear fitting and polynomial fitting.Because they can only deal with one-to-one mapping relationships, it is necessary to extract key data from the detection sequence [31].In this paper, the quantity of fish detected in a single image in each video was sorted from large to small, and the maximum quantity of fish detected, the average of the top 10 fish quantity, and the average of the top 20 fish quantity were statistically analyzed.
The training data and neural network fitting were the same.Firstly, linear fitting and cubic polynomial fitting were carried out for these three statistical data, and the results are shown in Figure 18.The upper left corner of each small graph shows the fitting formula and fitting coefficient R 2 , which reflect the overall accuracy of the model, that is, the fitting degree.The closer its value is to 1 shows that the model accurately reflects the changes in the observed data, and the better the reliability of the data.It can be seen from the figure that the R 2 of cubic polynomial fitting is greater than the corresponding linear fitting, and the fitting results of the average of the top 10 fish quantity in the two fitting methods are better than those of the maximum quantity of detected and the average of the top 20 fish quantity.When the polynomial fitting was performed on the average of the top 10 fish quantity, the fitting results of the quadratic, cubic, and quartic polynomials are compared as shown in Figure 19.It can be seen that the best fitting result of the quartic polynomial was R 2 = 0.8387, but the highest term was too high, which leads to a better effect on sample data, but the effect of test data will decline, that is, there will be over-fitting.Generally, the highest term was not higher than three times when the polynomial fitting was used.To sum up, the cubic polynomial was selected to fit the fish quantity in the comparison test, and the equation is shown in Formula (8):  = −0.0262+ 1.3648 − 18.273 + 91.322 (8) where x was the average of the top 10 fish abundance detected, y was the estimated fish quantity in the cage, and R 2 of the equation was 0.8135, which can be understood as the theoretical accuracy of data fitting as about 81.35%.Formula (8) was used to estimate the average of the top 10 fish quantity of the three tests in each group of test data, and the fitting results are shown in Table 5.It can be seen that the average accuracy is 83.58%, which is lower than the 84.63% of neural network fitting.

Comparison of the Fish Quantity Estimation Methods
The traditional methods to obtain the fish abundance in cages are the mark-recapture method, fish finder measurement, annular underwater acoustic multi-beam detection, and others [32].Due to the impossibility of conducting comparison experiments in the same environment, the instruments and equipment used in various methods vary.At present, there are few reports on the estimation algorithm and estimation accuracy of fish abundance in cages.In this paper, a comparison table of the different estimation methods was made based on previous studies by scholars, which is shown in Table 6.

Methods
Equipment Used Precision Advantages Disadvantages Mark-recapture method [33] Fishing net, stain

Large discrete interval
No electronic equipment is needed Low precision, time-consuming, and laborious, affecting the growth of fish Fish finder measurement [34] Fish detector About 50% Low equipment cost Low accuracy, fish density, sometimes vast errors Annular underwater acoustic multi-beam detection [35] Annular multibeam detector 60%-70% Wide detection angle, high precision Expensive equipment, difficult layout The method in this study Image sonar About 84% High precision, automatic measurement, simple layout Expensive equipment It can be seen from Table 6 that the forward-looking image sonar used in the method presented in this paper is more expensive than the equipment used in previous methods, and the average purchase unit price is USD 30,000, but the layout is relatively simple.After the sonar is installed, the data can be obtained and processed automatically, and the estimated fish abundance in the cage can be obtained without additional manual intervention.Compared to the traditional methods, the accuracy of this method is significantly improved, reaching about 84%.
As a high-definition image sonar, ARIS1800 is widely used in fishery.Both at home and abroad, image sonar is mainly used in the study of fish behavior, rather than in the assessment of fish quantity.This paper makes a very meaningful attempt to evaluate the quantity of fish in cages by using the imaging characteristics of ARIS1800, based on fixed detection and prediction methods.ARIS1800 can display the size, shape, and position of fish in the cage with high-definition images.It eliminates the limitation of traditional fish finders only being able to assess fish quantities by target strength, achieving a higher credibility.

Error Analysis
The estimation of fish quantity in cages is a major challenge in fish acoustics research, which is influenced by various factors: complicated and changeable ocean factors, such as wind and waves and tidal currents in aquaculture areas; feeding, sailing, and other interferences; transducer reverberation blind area and strong sea-floor reflection; obscuration of beam detection by fish in dense schools; some fish swim close to the wall, which make the fish echo and the net echo overlap and difficult to distinguish; and repeated detection caused by swimming fish [36].These uncontrollable factors cause the data collected by sonar at different times to be inconsistent, and in turn, the estimated neural network model has inevitable errors, affecting the final estimated quantity of fish.
Figure 20 is a histogram of the estimation and error of the fitting test in Table 4.The error of the neural network estimation in groups 1-10 is relatively small, and the prediction results in groups 11 and 12 are affected by the large wave fluctuation.By observing the cages in different periods, it was possible to estimate the average value of multiple groups of data to reduce the error.Given the measurement results of 80 fish, three additional observation data at different times were selected in this paper, and a new test dataset was formed together with the three data in the previous fitting test.The fish quantity was estimated by using this method, and the results are shown in Table 7.  7 reveals the fluctuations in the data measured and predicted in different periods, but only the two groups of data with serial numbers 2 and 3 have a deviation of 20, with the other groups having an error of less than 10.The average value of six groups of data prediction was 69.31, the error was −10.69, and the average accuracy was 85.67%, which was significantly improved by 5.83 percentage points compared to the average accuracy of 79.84% of data only using the same period.

Conclusions
This paper proposes a method for estimating the quantity of fish in net cages based on forward-looking imaging sonar.The method first investigates the YOLO neural network model and makes improvements for underwater fish identification tasks.An attention mechanism is introduced into the YOLO model construction, allocating more computing power to focus on small targets, thereby enhancing the performance of the region of interest, especially for small targets.Through ablation experiments, the addition of the CSAM module is shown to improve the accuracy of the detection algorithm by 3.81 percentage points, and compared to the YOLOv5, the improved algorithm in this paper increases the mAP50 by 8.24 percentage points.Subsequently, quantitative detection experiments for 80 oval damselfish are conducted in the constructed fish cages.Due to the limited visual angle of the sonar, the experiments are conducted by deploying the sonar on one side of the net cage and continuously observing to obtain video images.The improved and trained YOLOv8 model is used to detect fish shoals in sonar images.The detection quantity results are sorted from large to small, and the quantity of fish in the net cage is estimated based on the top 20 maximum counts using a trained BP neural network.Multiple experimental results show that the average accuracy of fish quantity assessment reaches 84.63%, validating the feasibility of this method.
Through research on detection methods, target identification, and quantity inversion of fish in net cages, a new method for estimating the quantity of fish in net cage farming based on imaging sonar has been developed.This method achieves a high-precision assessment of fish quantity in net cage farming, providing technical support for the development of intelligent equipment for net cages in China.
Nevertheless, there are still the following problems in this research method, which need to be improved in future research: 1.In the part of fish target recognition, the background of the image is not removed in advance, and the netting in the background fluctuates with the waves.In some cases, fish will swim against the netting, and the two are mixed in the sonar image, which will affect the fish recognition effect of the YOLO model and make the recognition quantity fluctuate [37]; 2. The YOLO target detection model and neural network prediction model used in this method are highly dependent on training data.For this reason, quantitative fish data collection should be carried out under the condition that the cage size and sonar layout are consistent before practical application.The above two models can only be applied to the fish quantity prediction after learning the collected data.As for the simplification of the model training process and the production of general datasets, further in-depth research is needed; 3. The quantitative experiment in this paper was carried out in a small fishing raft, and it is planned to be applied to a large deep-sea cage in the future.With the increase in the cage scale and the quantity of fish, the density of fish will increase obviously, and more fish will overlap and block each other.In theory, when detecting training data, the situation of fish occlusion is roughly the same as that when estimating the quantity, and the neural network will be relatively accurate when fitting the total quantity.However, as to whether the actual prediction effect can meet the precision of a smallscale quantitative experiment, it still needs to be tested.

Figure 3 .
Figure 3. Satellite image of the experimental sea area.

Figure 4 .
Figure 4. Aerial image of the experimental base.

Figure 6 .
Figure 6.Schematic diagram of the sonar deployment.

Figure 7 .
Figure 7. Sonar images of different groups of fish.(a) Twenty fish; (b) forty fish; (c) sixty fish; and (d) eighty fish.

Figure 13 .
Figure 13.Topology of the neural network.

Figure 14 .
Figure 14.Statistical diagram of the maximum quantity detected.

Figure 15 .
Figure 15.The sample mean square error.

Figure 16 .
Figure 16.Prediction error of the training group and the test group.

Figure 17 .
Figure 17.Regression results of the BP neural network model.

Figure 18 .
Figure 18.Comparison of the data fitting results.(a) Linear fitting; (b) cubic polynomial fitting.

Figure 19 .
Figure 19.Comparison of the fitting results of higher order polynomials.(a) Quadratic polynomial; (b) cubic polynomial; and (c) quartic polynomial.

Figure 20 .
Figure 20.Estimation and error bar chart of the fitting test.

Table 2 .
Comparative results of the ablation experiments.

Table 3 .
Comparative experimental results of the different detection algorithms.

Table 4 .
The statistical results of the method in this paper on the test dataset.

Table 5 .
The statistical results of high-order polynomial fitting on the test dataset.

Table 6 .
Comparison of the different estimation methods.

Table 7 .
The statistical results of the method in this paper across multiple time periods of datasets.

No. Actual Quantity Testing Time Fitting Total Quantity Error Quantity
administration, X.H.; funding acquisition, X.H. and Y.H.All authors have read and agreed to the published version of the manuscript.