Underwater Fish Body Length Estimation Based on Binocular Image Processing

: Recently, the information analysis technology of underwater has developed rapidly, which is beneﬁcial to underwater resource exploration, underwater aquaculture, etc. Dangerous and laborious manual work is replaced by deep learning-based computer vision technology, which has gradually become the mainstream. The binocular cameras based visual analysis method can not only collect seabed images but also construct the 3D scene information. The parallax of the binocular image was used to calculate the depth information of the underwater object. A binocular camera based reﬁned analysis method for underwater creature body length estimation was constructed. A fully convolutional network (FCN) was used to segment the corresponding underwater object in the image to obtain the object position. A ﬁsh’s body direction estimation algorithm is proposed according to the segmentation image. The semi-global block matching (SGBM) algorithm was used to calculate the depth of the object region and estimate the object body length according to the left and right views of the object. The algorithm has certain advantages in time and accuracy for interest object analysis by the combination of FCN and SGBM. Experiment results show that this method e ﬀ ectively reduces unnecessary information, improves e ﬃ ciency and accuracy compared to the original SGBM algorithm.


Introduction
Ocean exploration and underwater information analysis play a major role in preventing various marine disasters, protecting the ocean's ecological environment, and developing and utilizing ocean resources [1]. At present, ocean exploration technology can realize underwater creature detection, ocean aquaculture, ocean monitoring and separation, etc. Aquaculture is an important means for humans to directly utilize ocean resources [2,3]. Aquaculture has made considerable progress, but there are still many problems such as the long all-day monitoring of aquaculture growth and water quality, which is high cost but inefficient. Fishermen mainly feed fish automatically, which means it is easy to fail to feed them suitably and affect the growth of fish, or feed too much, resulting in waste fodder and water pollution. Additionally, the growth status of creatures is unknown, which is mainly an empirical judgment. Combining the current problems and the sustainable development of mariculture, refinement analysis and research on mariculture is particularly necessary.
Underwater creature detection is a basic assignment of underwater information analysis. With the development of deep neural networks, deep learning-based ocean object detection capability is increasing. Rafael et al. [4] proposed an image-based individual fish detection, wherein a Mask Regions

Camera Calibration
Camera calibration is an important step for binocular vision-based depth estimation, which determines whether the machine vision system can effectively identify, locate and calculate the depth of the object. The Zhang's calibration method [16] is adopted, the checkerboard image taken by the camera is used as the reference object, and the coordination relationship between the threedimensional world to the imaging plane is established through digital image processing and spatial arithmetic operations, then the internal parameter matrix and external parameter matrix of the camera are obtained to perform distortion correction for the collected image. The world coordinate system is ( , , ), the camera coordinate system is ( , , ), the image coordinate system is ( , ), and the pixel coordinate system is ( , ). The mapping among coordinate systems is shown in Figure 2 and expressed as Equations (1)-(3).

Camera Calibration
Camera calibration is an important step for binocular vision-based depth estimation, which determines whether the machine vision system can effectively identify, locate and calculate the depth of the object. The Zhang's calibration method [16] is adopted, the checkerboard image taken by the camera is used as the reference object, and the coordination relationship between the three-dimensional world to the imaging plane is established through digital image processing and spatial arithmetic operations, then the internal parameter matrix and external parameter matrix of the camera are obtained to perform distortion correction for the collected image. The world coordinate system is (X W , Y W , Z W ), the camera coordinate system is (X C , Y C , Z C ), the image coordinate system is (x, y), and the pixel coordinate system is (u, v). The mapping among coordinate systems is shown in Figure 2 and expressed as Equations (1)- (3).
where d x and d y represent the proportionality coefficient between image coordinates and pixels.
(u 0 , v 0 ) is the center pixel coordinate of the image. f is the focal length of the camera, R is a 3 × 3 Information 2020, 11,476 4 of 17 rotation matrix, T is a 3 × 1 transformation vector. According to the above formulas, the mapping between the world coordinate system and the pixel coordinate system is described as Equation (4). Figure 1. Flowchart of object body length analysis.

Camera Calibration
Camera calibration is an important step for binocular vision-based depth estimation, which determines whether the machine vision system can effectively identify, locate and calculate the depth of the object. The Zhang's calibration method [16] is adopted, the checkerboard image taken by the camera is used as the reference object, and the coordination relationship between the threedimensional world to the imaging plane is established through digital image processing and spatial arithmetic operations, then the internal parameter matrix and external parameter matrix of the camera are obtained to perform distortion correction for the collected image. The world coordinate system is ( , , ), the camera coordinate system is ( , , ), the image coordinate system is ( , ), and the pixel coordinate system is ( , ). The mapping among coordinate systems is shown in Figure 2 and expressed as Equations (1)-(3).  The internal and external parameters of the camera can be calculated through the mapping. Assuming that the chessboard is at Z = 0 in Zhang's calibration method, the above formulas can be described as Equations (5) and (6).
A is the camera internal parameter matrix, and s is a scale factor.
is defined as the combination of the internal parameter matrix and the external parameter matrix. H is a 3 × 3 matrix that an element is a homogeneous coordinate, Therefore, there are 8 unknown elements to be solved.
There are 5 unknowns in A. At least 3 different checkerboard pictures are required to solve these unknowns.

Fully Convolutional Network
Object segmentation is a method to find specific contour information of the object through a segmentation algorithm. Segmentation algorithms include traditional segmentation algorithms based on threshold [17] and edge detection [18], as well as popular deep learning-based methods including Mask R-CNN [19], FCN [20] and Resnet [21], etc. In this paper, the underwater object data set is adopted to train a fully convolutional network (FCN) to accurately distinguish fish in the image. FCN was proposed by Long et al. in 2015, which is a semantic image segmentation network classifying all pixels of the picture. FCN network replaces the fully connected layer of the convolutional neural network with a convolutional layer, which has two advantages, first, the size of the fully connected layer is fixed, which makes the size of the input image fixed. For a convolutional layer, the size of the input image is not limited. In addition, the output of the fully connected layer is a value to classify the whole image, however, the output of the convolutional layer is a feature map and all pixels in the image can be classified after upsampling.
The main body of FCN is the combination of the convolutional layer and pooling layer alternately, which is used to continuously process a picture and extract features. Generally, it is composed of several parts, and each part includes several convolutional layers with the same kernel size followed by a pooling layer. The convolution layer uses a k × k convolution kernel to traverse the feature map with the corresponding elements. The convolution calculation and image size calculation after convolution are expressed as Equations (9) and (10).
where a i,j represent the element value of the output feature map after convolution, x is the input value of the convolution layer, and w m,n are the parameters of the convolution kernel, which are also called weights. w b is the bias term. W 1 is the size of the original image, and S is the step size, which represents the number of interval elements. P is the padding layer, which means adding 0 elements to P layer around the input image. F is the size of the convolution kernel, and W 2 is the feature map size after convolution. The pooling layer is used to select the most representative features in the feature map to reduce the number of parameters. The pooling layer generally has two calculation methods, maximum pooling, and average pooling. The maximum pooling is adopted to divide the feature map into multiple regions of the same size, and the maximum value of each region is selected to combine as a new feature map. At the end of each convolution, an activation function is used to remove negative correlation features to ensure that the features are related to the final goal. The Rectified Linear Units (ReLU) activation function is adopted as Equation (11).
One of the important characteristics of FCN is transposed convolution, which is also called upsampling. The function of transposed convolution is to expand the convolution image to the size of the original image without restoring the original value. The calculation method of transposed convolution is similar to convolution. The image size after transposed convolution is described in Equation (12).
where W 1 , P, F, W 2 have the same meaning as convolution operation and S is the step size, which means that S − 1 zero elements are added to the neighborhood. Transposed convolution can be seen as a process of enlarging the feature map and then performing the convolution operation. As shown in Figure 3, a 3 × 3 feature map is expanded into a 5 × 5 feature map through a 3 × 3 convolution kernel after internally filling.
of the original image without restoring the original value. The calculation method of transposed convolution is similar to convolution. The image size after transposed convolution is described in Equation (12).
, , , have the same meaning as convolution operation and is the step size, which means that − 1 zero elements are added to the neighborhood. Transposed convolution can be seen as a process of enlarging the feature map and then performing the convolution operation. As shown in Figure 3, a 3 × 3 feature map is expanded into a 5 × 5 feature map through a 3 × 3 convolution kernel after internally filling. The baseline of FCN is VGG-16 as shown in Figure 4. There are 7 convolutional layers and 5 pooling layers in FCN where the blue block represents the convolutional layer, the yellow block represents the pooling layer, the green block represents the feature fusion layer which sums the corresponding elements with the same dimension, and the orange block represents the transposed convolution layer. The feature map is generated after a series of convolution and pooling operations on the input image. The skip structure is also playing an important role in the FCN. The feature map of the pool4 layer is merged with the feature maps of the pool3 layer to increase the details of the image. Finally, a transposed convolution is used to expand the size of the image as the original image. A softmax is applied to determine the probability of a pixel belonging to a certain class. The feature maps of the third and fourth pooling layers are sequentially added to the conv7′s feature map for taking into account both local and global information. The 1 × 1 convolution kernel is used to change the number of feature map channels. The feature map after the 7th convolution layer has the same dimensions as the feature map of the 4th pooling The baseline of FCN is VGG-16 as shown in Figure 4. There are 7 convolutional layers and 5 pooling layers in FCN where the blue block represents the convolutional layer, the yellow block represents the pooling layer, the green block represents the feature fusion layer which sums the corresponding elements with the same dimension, and the orange block represents the transposed convolution layer. The feature map is generated after a series of convolution and pooling operations on the input image. The skip structure is also playing an important role in the FCN. The feature map of the pool4 layer is merged with the feature maps of the pool3 layer to increase the details of the image. Finally, a transposed convolution is used to expand the size of the image as the original image. A softmax is applied to determine the probability of a pixel belonging to a certain class. The feature maps of the third and fourth pooling layers are sequentially added to the conv7 s feature map for taking into account both local and global information.
upsampling. The function of transposed convolution is to expand the convolution image to the size of the original image without restoring the original value. The calculation method of transposed convolution is similar to convolution. The image size after transposed convolution is described in Equation (12).
, , , have the same meaning as convolution operation and is the step size, which means that − 1 zero elements are added to the neighborhood. Transposed convolution can be seen as a process of enlarging the feature map and then performing the convolution operation. As shown in Figure 3, a 3 × 3 feature map is expanded into a 5 × 5 feature map through a 3 × 3 convolution kernel after internally filling. The baseline of FCN is VGG-16 as shown in Figure 4. There are 7 convolutional layers and 5 pooling layers in FCN where the blue block represents the convolutional layer, the yellow block represents the pooling layer, the green block represents the feature fusion layer which sums the corresponding elements with the same dimension, and the orange block represents the transposed convolution layer. The feature map is generated after a series of convolution and pooling operations on the input image. The skip structure is also playing an important role in the FCN. The feature map of the pool4 layer is merged with the feature maps of the pool3 layer to increase the details of the image. Finally, a transposed convolution is used to expand the size of the image as the original image. A softmax is applied to determine the probability of a pixel belonging to a certain class. The feature maps of the third and fourth pooling layers are sequentially added to the conv7′s feature map for taking into account both local and global information.  The 1 × 1 convolution kernel is used to change the number of feature map channels. The feature map after the 7th convolution layer has the same dimensions as the feature map of the 4th pooling layer after the first transposed convolution, and their channel numbers are adjusted to the number of the category. The increased layer fuses the two feature maps by adding elements at corresponding positions. The same operation is adopted to merge the feature map after the third pooling. The feature map is expanded to the size of the original image after the third transposed convolutions.
The segmentation map is used to determine the object region. The region where the object is located is selected, and the irrelevant information is eliminated. The coordinates of the selected regions in the left and right views are used for the subsequent stereo matching algorithm.

Depth Prediction
The binocular vision theory is a method to calculate the depth according to the position difference of the same object shot by different cameras based on the parallax principle and is mostly used in three-dimensional reconstruction. Stereo matching is an important technology to find the pixel pair with the highest similarity in the binocular picture. The three-dimensional coordinate system is shown in Figure 5, O l and O r represent the positions of the left and right cameras. f is the camera focal length, B is the pitch of lens, d represents the parallax, Z is the desired distance, and P is the pixel of the image. According to the principle of similar triangles, the depth calculation is expressed as Equation (13). (13) regions in the left and right views are used for the subsequent stereo matching algorithm.

Depth Prediction
The binocular vision theory is a method to calculate the depth according to the position difference of the same object shot by different cameras based on the parallax principle and is mostly used in three-dimensional reconstruction. Stereo matching is an important technology to find the pixel pair with the highest similarity in the binocular picture. The three-dimensional coordinate system is shown in Figure 5, and represent the positions of the left and right cameras. is the camera focal length, is the pitch of lens, represents the parallax, is the desired distance, and is the pixel of the image. According to the principle of similar triangles, the depth calculation is expressed as Equation (13). The SGBM [22] algorithm is used to estimate the object depth. The SGBM algorithm is a semiglobal matching method that utilizes mutual information for pixel matching and approximates a global two-dimensional smoothness constraint. A global energy function about the disparity map is formed by the disparity of each pixel. The optimal disparity of each pixel is calculated by minimizing this energy function. The energy function is expressed as Equation (14).
where refers to the disparity map. ( ) is the energy function corresponding to the disparity map.
and q represent the pixels in the image; refers to the adjacent pixels centered on pixel , in which usually 8 adjacent pixels around pixel ; The first term ( , ) refers to the sum of the costs of all matching pixels when the current disparity map is ; The second term adds a constant for all pixels q in the for which the disparity changes 1 pixel, this term is used to adapt to the condition of slightly inclined and curved planes. The third term adds a larger constant for which the disparity changes more than 1 pixel, this term is used to preserve the edge information of the The SGBM [22] algorithm is used to estimate the object depth. The SGBM algorithm is a semi-global matching method that utilizes mutual information for pixel matching and approximates a global two-dimensional smoothness constraint. A global energy function about the disparity map is formed by the disparity of each pixel. The optimal disparity of each pixel is calculated by minimizing this energy function. The energy function is expressed as Equation (14).
where D refers to the disparity map. E(D) is the energy function corresponding to the disparity map. p and q represent the pixels in the image; N p refers to the adjacent pixels centered on pixel p, in which usually 8 adjacent pixels around pixel p; The first term C p, D p refers to the sum of the costs of all matching pixels when the current disparity map is D; The second term adds a constant P 1 for all pixels q in the N p for which the disparity changes 1 pixel, this term is used to adapt to the condition of slightly inclined and curved planes. The third term adds a larger constant P 2 for which the disparity changes more than 1 pixel, this term is used to preserve the edge information of the image. It has always to be ensured that P 1 ≤ P 2 . I[.] is a logical function, the function returns 1 if the parameter in the function is true, otherwise, it returns 0. There are also parameters such as the initial disparity and disparity range, which can be adjusted accordingly through the segmentation map. The object position difference between the left and right segmentation images can be used as the pre-parallax to constrain the maximum disparity range in the SGBM algorithm. The solution of the function is an NP-complete problem to find the optimal solution in a two-dimensional image. Therefore, the problem is approximately decomposed into multiple one-dimensional problems which can be solved by dynamic programming. In the algorithm, it is decomposed into 8 one-dimensional problems because one pixel has 8 adjacent pixels.

Estimation of Fish Body Length
The object body length is calculated according to the coordination corresponding to the object's head and tail pixels. The head and tail pixel positions of the object are found through the segmentation map, and the depth is calculated based on the disparity map. The corresponding relationship is shown in Figure 6. The fish's body length estimation is described as Equation (15).
where d_r is the fish body length in the real world, f is the focal length, d_i is the fish body length in the image, and Z t is the depth from the camera to the object. image. It has always to be ensured that ≤ .
[. ] is a logical function, the function returns 1 if the parameter in the function is true, otherwise, it returns 0. There are also parameters such as the initial disparity and disparity range, which can be adjusted accordingly through the segmentation map. The object position difference between the left and right segmentation images can be used as the pre-parallax to constrain the maximum disparity range in the SGBM algorithm. The solution of the function is an NP-complete problem to find the optimal solution in a twodimensional image. Therefore, the problem is approximately decomposed into multiple onedimensional problems which can be solved by dynamic programming. In the algorithm, it is decomposed into 8 one-dimensional problems because one pixel has 8 adjacent pixels.

Estimation of Fish Body Length
The object body length is calculated according to the coordination corresponding to the object's head and tail pixels. The head and tail pixel positions of the object are found through the segmentation map, and the depth is calculated based on the disparity map. The corresponding relationship is shown in Figure  6. The fish's body length estimation is described as Equation (15).
where _ is the fish body length in the real world, is the focal length, _ is the fish body length in the image, and is the depth from the camera to the object. As shown in Figure 7, the object region is extracted based on the edge pixels of the segmentation region. Finding the head and tail of the object is a vital step to estimate the body length. Firstly, the direction of the object is determined. The object orientation is divided into 4 modes, upper-right, upper-left, left-right, and up-down.  As shown in Figure 7, the object region is extracted based on the edge pixels of the segmentation region. Finding the head and tail of the object is a vital step to estimate the body length. Firstly, the direction of the object is determined. The object orientation is divided into 4 modes, upper-right, upper-left, left-right, and up-down.
map. The object position difference between the left and right segmentation images can be used as the pre-parallax to constrain the maximum disparity range in the SGBM algorithm.
The solution of the function is an NP-complete problem to find the optimal solution in a twodimensional image. Therefore, the problem is approximately decomposed into multiple onedimensional problems which can be solved by dynamic programming. In the algorithm, it is decomposed into 8 one-dimensional problems because one pixel has 8 adjacent pixels.

Estimation of Fish Body Length
The object body length is calculated according to the coordination corresponding to the object's head and tail pixels. The head and tail pixel positions of the object are found through the segmentation map, and the depth is calculated based on the disparity map. The corresponding relationship is shown in Figure  6. The fish's body length estimation is described as Equation (15).
where _ is the fish body length in the real world, is the focal length, _ is the fish body length in the image, and is the depth from the camera to the object. As shown in Figure 7, the object region is extracted based on the edge pixels of the segmentation region. Finding the head and tail of the object is a vital step to estimate the body length. Firstly, the direction of the object is determined. The object orientation is divided into 4 modes, upper-right, upper-left, left-right, and up-down.   The block diagram is adjusted to a square, and then divided into 4 areas with the same size by the centerline. The number of the pixel occupied by the object in the 4 areas are counted as Equation (16), respectively, when the sum of pixels in region 1 and 4 is greater than the sum of pixels in region 2 and 3, the direction of fish is the upper-left, which is expressed as l t = 1 as Equation (17). On the contrary, the direction is the upper-right r t = 1 as Equation (18).
When the sum of the two is not much different, as shown in Figure 8, the long side of the rectangle is defined. If the left side is longer, the direction of the object is up-down as Equation (19). If the right side is longer, the direction of the object is left-right as Equation (20).
Information 2020, 11, x FOR PEER REVIEW 9 of 18 The block diagram is adjusted to a square, and then divided into 4 areas with the same size by the centerline. The number of the pixel occupied by the object in the 4 areas are counted as Equation (16), respectively, when the sum of pixels in region 1 and 4 is greater than the sum of pixels in region 2 and 3, the direction of fish is the upper-left, which is expressed as = 1 as Equation (17). On the contrary, the direction is the upper-right = 1 as Equation (18) When the sum of the two is not much different, as shown in Figure 8, the long side of the rectangle is defined. If the left side is longer, the direction of the object is up-down as Equation (19). If the right side is longer, the direction of the object is left-right as Equation (20).
Then the head and tail points can be found. As shown in Figure 9, for the left-right orientation object, the midpoint pixel is selected as the head and tail pixel; the same applies for the up-down orientation. For the upper-left object, select the leftmost pixel on the top side of the object and the uppermost pixel on the left of the object, a rectangular region is defined through the two points. The coordinate of the head point is the mean value of coordinates of all pixels belonging to the object in the region as Equation (21). The coordinate of the tail point is found through similar operations for the lower-right region as Equation (22). The same goes for the upper-right object. Then the head and tail points can be found. As shown in Figure 9, for the left-right orientation object, the midpoint pixel is selected as the head and tail pixel; the same applies for the up-down orientation. For the upper-left object, select the leftmost pixel on the top side of the object and the uppermost pixel on the left of the object, a rectangular region is defined through the two points. The coordinate of the head point is the mean value of coordinates of all pixels belonging to the object in the region as Equation (21). The coordinate of the tail point is found through similar operations for the lower-right region as Equation (22). The same goes for the upper-right object. head = avg(sum(i, j)) (i, j) ∈ white_pixel _t (21) tail = avg(sum(i, j)) (i, j) ∈ white_pixel _u (22) Information 2020, 11, x FOR PEER REVIEW 9 of 18 The block diagram is adjusted to a square, and then divided into 4 areas with the same size by the centerline. The number of the pixel occupied by the object in the 4 areas are counted as Equation (16), respectively, when the sum of pixels in region 1 and 4 is greater than the sum of pixels in region 2 and 3, the direction of fish is the upper-left, which is expressed as = 1 as Equation (17). On the contrary, the direction is the upper-right = 1 as Equation (18) When the sum of the two is not much different, as shown in Figure 8, the long side of the rectangle is defined. If the left side is longer, the direction of the object is up-down as Equation (19). If the right side is longer, the direction of the object is left-right as Equation (20).
Then the head and tail points can be found. As shown in Figure 9, for the left-right orientation object, the midpoint pixel is selected as the head and tail pixel; the same applies for the up-down orientation. For the upper-left object, select the leftmost pixel on the top side of the object and the uppermost pixel on the left of the object, a rectangular region is defined through the two points. The coordinate of the head point is the mean value of coordinates of all pixels belonging to the object in the region as Equation (21). The coordinate of the tail point is found through similar operations for the lower-right region as Equation (22). The same goes for the upper-right object.

Experiments
There are few binocular pictures of underwater creatures in the public data sets, so we constructed data sets by binocular cameras. The binocular images were intercepted from the video stream, and the LabelMe software was used to generate a label for the images. There are 200 left and right images. This part of the data was used for FCN and subsequent SGBM algorithms. The amount of data cannot meet the requirements of the deep neural network training, which easily leads to overfitting. Therefore, the underwater fish data set from the fish4knowledge project was adopted for deep neural network training. The fish4knowledge data set was divided into 23 clusters a total of 27,370 pictures, each cluster is represented by a representative species, the species is based on the isomorphic characteristics of the degree of taxa monophyletic. The experimental environment was Linux 16.04, Python 3.5, MATLAB 2019a, GPU TitanX (12G), and OpenCV 4.1.

Camera Calibration
The camera calibration was used to correct the distortion of the image captured by the camera. In the camera calibration process, 20 chessboard pictures were captured. These images were calibrated through the Stereo Camera Calibrator toolbox in the MATLAB 2019a. The calibration board and the result are shown in Figures 10 and 11.
Information 2020, 11, x FOR PEER REVIEW 10 of 18

Experiments
There are few binocular pictures of underwater creatures in the public data sets, so we constructed data sets by binocular cameras. The binocular images were intercepted from the video stream, and the LabelMe software was used to generate a label for the images. There are 200 left and right images. This part of the data was used for FCN and subsequent SGBM algorithms. The amount of data cannot meet the requirements of the deep neural network training, which easily leads to overfitting. Therefore, the underwater fish data set from the fish4knowledge project was adopted for deep neural network training. The fish4knowledge data set was divided into 23 clusters a total of 27,370 pictures, each cluster is represented by a representative species, the species is based on the isomorphic characteristics of the degree of taxa monophyletic. The experimental environment was Linux 16.04, Python 3.5, MATLAB 2019a, GPU TitanX (12G), and OpenCV 4.1.

Camera Calibration
The camera calibration was used to correct the distortion of the image captured by the camera. In the camera calibration process, 20 chessboard pictures were captured. These images were calibrated through the Stereo Camera Calibrator toolbox in the MATLAB 2019a. The calibration board and the result are shown in Figures 10 and 11. These images are captured at various angles to ensure the accuracy of the calibration. The average reprojection error of most image pixels does not exceed 0.4, and the total average error is less than 0.5, which proves the reliability of the calibration method. The internal parameter matrices of the left and right cameras are expressed as Equations (23) and (24).   Figure 10. Chessboard image.

Experiments
There are few binocular pictures of underwater creatures in the public data sets, so we constructed data sets by binocular cameras. The binocular images were intercepted from the video stream, and the LabelMe software was used to generate a label for the images. There are 200 left and right images. This part of the data was used for FCN and subsequent SGBM algorithms. The amount of data cannot meet the requirements of the deep neural network training, which easily leads to overfitting. Therefore, the underwater fish data set from the fish4knowledge project was adopted for deep neural network training. The fish4knowledge data set was divided into 23 clusters a total of 27,370 pictures, each cluster is represented by a representative species, the species is based on the isomorphic characteristics of the degree of taxa monophyletic. The experimental environment was Linux 16.04, Python 3.5, MATLAB 2019a, GPU TitanX (12G), and OpenCV 4.1.

Camera Calibration
The camera calibration was used to correct the distortion of the image captured by the camera. In the camera calibration process, 20 chessboard pictures were captured. These images were calibrated through the Stereo Camera Calibrator toolbox in the MATLAB 2019a. The calibration board and the result are shown in Figures 10 and 11. These images are captured at various angles to ensure the accuracy of the calibration. The average reprojection error of most image pixels does not exceed 0.4, and the total average error is less than 0.5, which proves the reliability of the calibration method. The internal parameter matrices of the left and right cameras are expressed as Equations (23) and (24).   Figure 11. Calibration results.
These images are captured at various angles to ensure the accuracy of the calibration. The average reprojection error of most image pixels does not exceed 0.4, and the total average error is less than 0.5, which proves the reliability of the calibration method.
The internal parameter matrices of the left and right cameras are expressed as Equations (23) and (24).
The distortion matrices of the left and right cameras are expressed as Equations (25) and (26). The rotation and translation matrices between the cameras are expressed as Equations (27)

Fish Segment Based on FCN
The FCN was trained by the fish4knowledge data set and the performance of the trained model M1 determines whether it can be directly applied to the image segment of a practical binocular camera. The results of the trained model M1 are shown in Figures 12 and 13. Figure 12 is the validation of the model on the fish4knowledge testing data set. The fish can be segmented correctly. The model works well for the fish4knowledge data but does not work well on practical data as Figure 13. The fish coming from the practical data set cannot be segmented. The reason is that the background, the tone, and shape of the fish between the two data sets are different, which makes the model parameters unsuitable. Therefore, the self-made fish data set was added to the fish4knowledge data set for training. The practical data set was used for training to enhance the generalization ability of the model. The distortion matrices of the left and right cameras are expressed as Equations (25) and (26).

Fish Segment Based on FCN
The FCN was trained by the fish4knowledge data set and the performance of the trained model M1 determines whether it can be directly applied to the image segment of a practical binocular camera. The results of the trained model M1 are shown in Figures 12 and 13. Figure 12 is the validation of the model on the fish4knowledge testing data set. The fish can be segmented correctly. The model works well for the fish4knowledge data but does not work well on practical data as Figure  13. The fish coming from the practical data set cannot be segmented. The reason is that the background, the tone, and shape of the fish between the two data sets are different, which makes the model parameters unsuitable. Therefore, the self-made fish data set was added to the fish4knowledge data set for training. The practical data set was used for training to enhance the generalization ability of the model.  The distortion matrices of the left and right cameras are expressed as Equations (25) and (26).

Fish Segment Based on FCN
The FCN was trained by the fish4knowledge data set and the performance of the trained model M1 determines whether it can be directly applied to the image segment of a practical binocular camera. The results of the trained model M1 are shown in Figures 12 and 13. Figure 12 is the validation of the model on the fish4knowledge testing data set. The fish can be segmented correctly. The model works well for the fish4knowledge data but does not work well on practical data as Figure  13. The fish coming from the practical data set cannot be segmented. The reason is that the background, the tone, and shape of the fish between the two data sets are different, which makes the model parameters unsuitable. Therefore, the self-made fish data set was added to the fish4knowledge data set for training. The practical data set was used for training to enhance the generalization ability of the model. The training data set was constructed using the fish4knowledge data set and the self-made data set; a total of 27,570 pictures were divided into the training set, validation set, and test set with the ratio of 8:1:1. The loss function of the model was the SoftMax cross-entropy, the learning rate was 1e-5, and the number of training iterations was 15,000 to prevent training over-fitting. The results and pictures captured by the camera are shown in Figure 14. The training data set was constructed using the fish4knowledge data set and the self-made data set; a total of 27,570 pictures were divided into the training set, validation set, and test set with the ratio of 8:1:1. The loss function of the model was the SoftMax cross-entropy, the learning rate was 1e-5, and the number of training iterations was 15,000 to prevent training over-fitting. The results and pictures captured by the camera are shown in Figure 14. The training data set was constructed using the fish4knowledge data set and the self-made data set; a total of 27,570 pictures were divided into the training set, validation set, and test set with the ratio of 8:1:1. The loss function of the model was the SoftMax cross-entropy, the learning rate was 1e-5, and the number of training iterations was 15,000 to prevent training over-fitting. The results and pictures captured by the camera are shown in Figure 14. The loss changes during training as seen in Figures 15 and 16. It can be seen that the overall training process is convergent, but there is a phenomenon of a sharp increase in errors at a certain stage. This is because the difference between our data sets was large, and a certain part of the weight during the training process changes may lead to an increase in the segmentation accuracy of part of the image, but a sharp drop in another part of the image. The test result is shown in Figure 16 which represents accuracy, cross-entropy, and weight loss, respectively. The loss changes during training as seen in Figures 15 and 16. It can be seen that the overall training process is convergent, but there is a phenomenon of a sharp increase in errors at a certain stage. This is because the difference between our data sets was large, and a certain part of the weight during the training process changes may lead to an increase in the segmentation accuracy of part of the image, but a sharp drop in another part of the image. The test result is shown in Figure 16 which represents accuracy, cross-entropy, and weight loss, respectively.

Depth Prediction
The difference of object position in the left and right images is regarded as a maximum disparity constraint in the stereo matching algorithm. The penalty parameters were set to = 600 and = 2400 after multiple experiments, so the disparity map has good performance in the smoothness of the same type and the difference of different types. The sliding window size was adjusted to 7. There were some different filters to preprocess the image to highlight the meaningful characters. Each pixel value in the disparity map represents the disparity of the pixel, and the depth map can be calculated using Formula (2).
The comparisons of the disparity map between the original image and the preprocessed image are shown in Figure 17 with different colors indication. There are 11 preprocessing methods including Guided Image Filter [23], Bilateral filter, Histogram Normalization, HE (histogram equalization), CLAHE (Contrast Limited Adaptive histogram equalization) [24], Meanshift Filter, Median Filter, Rgf (Rolling Guidance Filter) [25], Gamma Filter [26], Gaussian Filter, and Wavelet Transform. Among them, the performance of Guided Filter, Mean shift Filter, Median Filter, and Gaussian Filter are better. In the general area of the object, the values of the disparity maps are similar, but the object is separated from the background by Guided Image Filter. Therefore, the Guided Image Filter was adopted for

Depth Prediction
The difference of object position in the left and right images is regarded as a maximum disparity constraint in the stereo matching algorithm. The penalty parameters were set to = 600 and = 2400 after multiple experiments, so the disparity map has good performance in the smoothness of the same type and the difference of different types. The sliding window size was adjusted to 7. There were some different filters to preprocess the image to highlight the meaningful characters. Each pixel value in the disparity map represents the disparity of the pixel, and the depth map can be calculated using Formula (2).
The comparisons of the disparity map between the original image and the preprocessed image are shown in Figure 17 with different colors indication. There are 11 preprocessing methods including Guided Image Filter [23], Bilateral filter, Histogram Normalization, HE (histogram equalization), CLAHE (Contrast Limited Adaptive histogram equalization) [24], Meanshift Filter, Median Filter, Rgf (Rolling Guidance Filter) [25], Gamma Filter [26], Gaussian Filter, and Wavelet Transform. Among them, the performance of Guided Filter, Mean shift Filter, Median Filter, and Gaussian Filter are better. In the general area of the object, the values of the disparity maps are similar, but the object is separated from the background by Guided Image Filter. Therefore, the Guided Image Filter was adopted for

Depth Prediction
The difference of object position in the left and right images is regarded as a maximum disparity constraint in the stereo matching algorithm. The penalty parameters were set to P 1 = 600 and P 2 = 2400 after multiple experiments, so the disparity map has good performance in the smoothness of the same type and the difference of different types. The sliding window size was adjusted to 7. There were some different filters to preprocess the image to highlight the meaningful characters. Each pixel value in the disparity map represents the disparity of the pixel, and the depth map can be calculated using Formula (2).
The comparisons of the disparity map between the original image and the preprocessed image are shown in Figure 17 with different colors indication. There are 11 preprocessing methods including Guided Image Filter [23], Bilateral filter, Histogram Normalization, HE (histogram equalization), CLAHE (Contrast Limited Adaptive histogram equalization) [24], Meanshift Filter, Median Filter, Rgf (Rolling Guidance Filter) [25], Gamma Filter [26], Gaussian Filter, and Wavelet Transform. Among them, the performance of Guided Filter, Mean shift Filter, Median Filter, and Gaussian Filter are better. In the general area of the object, the values of the disparity maps are similar, but the object is separated from the background by Guided Image Filter. Therefore, the Guided Image Filter was adopted for preprocessing. smooth_acc smooth_xentrop smooth_weight_loss Figure 16. Accuracy, cross-entropy, and loss diagram.

Depth Prediction
The difference of object position in the left and right images is regarded as a maximum disparity constraint in the stereo matching algorithm. The penalty parameters were set to = 600 and = 2400 after multiple experiments, so the disparity map has good performance in the smoothness of the same type and the difference of different types. The sliding window size was adjusted to 7. There were some different filters to preprocess the image to highlight the meaningful characters. Each pixel value in the disparity map represents the disparity of the pixel, and the depth map can be calculated using Formula (2).
The comparisons of the disparity map between the original image and the preprocessed image are shown in Figure 17 with different colors indication. There are 11 preprocessing methods including Guided Image Filter [23], Bilateral filter, Histogram Normalization, HE (histogram equalization), CLAHE (Contrast Limited Adaptive histogram equalization) [24], Meanshift Filter, Median Filter, Rgf (Rolling Guidance Filter) [25], Gamma Filter [26], Gaussian Filter, and Wavelet Transform. Among them, the performance of Guided Filter, Mean shift Filter, Median Filter, and Gaussian Filter are better. In the general area of the object, the values of the disparity maps are similar, but the object is separated from the background by Guided Image Filter. Therefore, the Guided Image Filter was adopted for preprocessing.

Fish Body Length Estimation
The body length of fish is estimated according to the position of head and tail. After gathering the position of these pixels, the body length of the fish can be estimated by Formula (3). The body length of the fish in the image is shown in Figure 18.

Fish Body Length Estimation
The body length of fish is estimated according to the position of head and tail. After gathering the position of these pixels, the body length of the fish can be estimated by Formula (3). The body length of the fish in the image is shown in Figure 18. To determine the body length estimation performance of our model, the fish was estimated under different depths, including 450-500 mm, 500-550 mm, 550-600 mm, and 600-650 mm, respectively. The results are shown in Table 1. The actual length of the fish is 10 cm. Table 1 shows the average body length of the estimation at different depths. The error is about 4-5%. The body length estimation accuracy is the highest in the depth of 550-600 mm. The statistics of the results at different depths are drawn in the form of box plots in Figure 19. According to the figure, the results at the depth of 550-600 mm are the most accurate. The original SGBM cannot detect the fish. Therefore, it cannot estimate the body length of the fish. A comparison of computation was carried for the disparity map calculation. The SGBM should calculate the whole depth of the image to generate disparity map aimlessness, and the proposed model should do an object segment to generate the disparity map of the object region. A total of 20 samples were randomly selected, and each sample was tested 10 times for a mean value. The final result is shown in Figure 20. For the SGBM algorithm, the time consumption for each pair of pictures is between 100 ms and 120 ms. The proposed algorithm estimates the body length of the fish with 90-130 ms. The fluctuation of time consumption is largely due to the different size of the object region. According to the experiment, the time consumption of the two algorithms is comparative. To determine the body length estimation performance of our model, the fish was estimated under different depths, including 450-500 mm, 500-550 mm, 550-600 mm, and 600-650 mm, respectively. The results are shown in Table 1. The actual length of the fish is 10 cm. Table 1 shows the average body length of the estimation at different depths. The error is about 4-5%. The body length estimation accuracy is the highest in the depth of 550-600 mm. The statistics of the results at different depths are drawn in the form of box plots in Figure 19. According to the figure, the results at the depth of 550-600 mm are the most accurate.  To determine the body length estimation performance of our model, the fish was estimated under different depths, including 450-500 mm, 500-550 mm, 550-600 mm, and 600-650 mm, respectively. The results are shown in Table 1. The actual length of the fish is 10 cm. Table 1 shows the average body length of the estimation at different depths. The error is about 4-5%. The body length estimation accuracy is the highest in the depth of 550-600 mm. The statistics of the results at different depths are drawn in the form of box plots in Figure 19. According to the figure, the results at the depth of 550-600 mm are the most accurate. The original SGBM cannot detect the fish. Therefore, it cannot estimate the body length of the fish. A comparison of computation was carried for the disparity map calculation. The SGBM should calculate the whole depth of the image to generate disparity map aimlessness, and the proposed model should do an object segment to generate the disparity map of the object region. A total of 20 samples were randomly selected, and each sample was tested 10 times for a mean value. The final result is shown in Figure 20. For the SGBM algorithm, the time consumption for each pair of pictures is between 100 ms and 120 ms. The proposed algorithm estimates the body length of the fish with 90- The original SGBM cannot detect the fish. Therefore, it cannot estimate the body length of the fish. A comparison of computation was carried for the disparity map calculation. The SGBM should calculate the whole depth of the image to generate disparity map aimlessness, and the proposed model should do an object segment to generate the disparity map of the object region. A total of 20 samples were randomly selected, and each sample was tested 10 times for a mean value. The final result is shown in Figure 20. For the SGBM algorithm, the time consumption for each pair of pictures is between 100 ms and 120 ms. The proposed algorithm estimates the body length of the fish with 90-130 ms. The fluctuation of time consumption is largely due to the different size of the object region. According to the experiment, the time consumption of the two algorithms is comparative.

Conclusions
Aiming at the refine-grained analysis of underwater creature, a body length estimation algorithm combining image segment and stereo matching was proposed. The FCN segmentation algorithm was used to find the specific object region in the image. The SGBM algorithm was used to generate the disparity image of the segment region of the object. The object body length can be calculated according to the disparity map and the object position. The algorithm can be applied to underwater creature detection and reduce the amount of depth estimation computation. At the same time, the accuracy is also improved by the pre-parallax. The optimization of the stereo matching algorithm and multi-objective matching will be our future work to improve the accuracy, speed, and generalization ability of the algorithm.

Conclusions
Aiming at the refine-grained analysis of underwater creature, a body length estimation algorithm combining image segment and stereo matching was proposed. The FCN segmentation algorithm was used to find the specific object region in the image. The SGBM algorithm was used to generate the disparity image of the segment region of the object. The object body length can be calculated according to the disparity map and the object position. The algorithm can be applied to underwater creature detection and reduce the amount of depth estimation computation. At the same time, the accuracy is also improved by the pre-parallax. The optimization of the stereo matching algorithm and multi-objective matching will be our future work to improve the accuracy, speed, and generalization ability of the algorithm.