Object segmentation is a method to find specific contour information of the object through a segmentation algorithm. Segmentation algorithms include traditional segmentation algorithms based on threshold [

17] and edge detection [

18], as well as popular deep learning-based methods including Mask R-CNN [

19], FCN [

20] and Resnet [

21], etc. In this paper, the underwater object data set is adopted to train a fully convolutional network (FCN) to accurately distinguish fish in the image. FCN was proposed by Long et al. in 2015, which is a semantic image segmentation network classifying all pixels of the picture. FCN network replaces the fully connected layer of the convolutional neural network with a convolutional layer, which has two advantages, first, the size of the fully connected layer is fixed, which makes the size of the input image fixed. For a convolutional layer, the size of the input image is not limited. In addition, the output of the fully connected layer is a value to classify the whole image, however, the output of the convolutional layer is a feature map and all pixels in the image can be classified after upsampling.

The main body of FCN is the combination of the convolutional layer and pooling layer alternately, which is used to continuously process a picture and extract features. Generally, it is composed of several parts, and each part includes several convolutional layers with the same kernel size followed by a pooling layer. The convolution layer uses a

$k\times k$ convolution kernel to traverse the feature map with the corresponding elements. The convolution calculation and image size calculation after convolution are expressed as Equations (9) and (10).

where

${a}_{i,j}$ represent the element value of the output feature map after convolution,

$x$ is the input value of the convolution layer, and

${w}_{m,n}$ are the parameters of the convolution kernel, which are also called weights.

${w}_{b}$ is the bias term.

${W}_{1}$ is the size of the original image, and

$S$ is the step size, which represents the number of interval elements.

$P$ is the padding layer, which means adding 0 elements to

$P$ layer around the input image.

$F$ is the size of the convolution kernel, and

${W}_{2}$ is the feature map size after convolution. The pooling layer is used to select the most representative features in the feature map to reduce the number of parameters. The pooling layer generally has two calculation methods, maximum pooling, and average pooling. The maximum pooling is adopted to divide the feature map into multiple regions of the same size, and the maximum value of each region is selected to combine as a new feature map. At the end of each convolution, an activation function is used to remove negative correlation features to ensure that the features are related to the final goal. The Rectified Linear Units (ReLU) activation function is adopted as Equation (11).

One of the important characteristics of FCN is transposed convolution, which is also called upsampling. The function of transposed convolution is to expand the convolution image to the size of the original image without restoring the original value. The calculation method of transposed convolution is similar to convolution. The image size after transposed convolution is described in Equation (12).

where

${W}_{1}$,

$P$,

$F$,

${W}_{2}$ have the same meaning as convolution operation and

$S$ is the step size, which means that

$S-1$ zero elements are added to the neighborhood. Transposed convolution can be seen as a process of enlarging the feature map and then performing the convolution operation. As shown in

Figure 3, a 3 × 3 feature map is expanded into a 5 × 5 feature map through a 3 × 3 convolution kernel after internally filling.

The baseline of FCN is VGG-16 as shown in

Figure 4. There are 7 convolutional layers and 5 pooling layers in FCN where the blue block represents the convolutional layer, the yellow block represents the pooling layer, the green block represents the feature fusion layer which sums the corresponding elements with the same dimension, and the orange block represents the transposed convolution layer. The feature map is generated after a series of convolution and pooling operations on the input image. The skip structure is also playing an important role in the FCN. The feature map of the pool4 layer is merged with the feature maps of the pool3 layer to increase the details of the image. Finally, a transposed convolution is used to expand the size of the image as the original image. A softmax is applied to determine the probability of a pixel belonging to a certain class. The feature maps of the third and fourth pooling layers are sequentially added to the conv7′s feature map for taking into account both local and global information.

The 1 × 1 convolution kernel is used to change the number of feature map channels. The feature map after the 7th convolution layer has the same dimensions as the feature map of the 4th pooling layer after the first transposed convolution, and their channel numbers are adjusted to the number of the category. The increased layer fuses the two feature maps by adding elements at corresponding positions. The same operation is adopted to merge the feature map after the third pooling. The feature map is expanded to the size of the original image after the third transposed convolutions.

The segmentation map is used to determine the object region. The region where the object is located is selected, and the irrelevant information is eliminated. The coordinates of the selected regions in the left and right views are used for the subsequent stereo matching algorithm.