Underwater Target Detection and 3D Reconstruction System Based on Binocular Vision

To better solve the problem of target detection in marine environment and to deal with the difficulty of 3D reconstruction of underwater target, a binocular vision-based underwater target detection and 3D reconstruction system is proposed in this paper. Two optical sensors are used as the vision of the system. Firstly, denoising and color restoration are performed on the image sequence acquired by the vision of the system and the underwater target is segmented and extracted according to the image saliency using the super-pixel segmentation method. Secondly, aiming to reduce mismatch, we improve the semi-global stereo matching method by strictly constraining the matching in the valid target area and then optimizing the basic disparity map within each super-pixel area using the least squares fitting interpolation method. Finally, based on the optimized disparity map, triangulation principle is used to calculate the three-dimensional data of the target and the 3D structure and color information of the target can be given by MeshLab. The experimental results show that for a specific size underwater target, the system can achieve higher measurement accuracy and better 3D reconstruction effect within a suitable distance.


Introduction
With the rapid development of computer vision and robotics, now an enormous number of underwater operations can be conducted by underwater robots [1,2]. To succeed in autonomous underwater intervention, the integration process of the required robotic system is critical, which includes the mechatronics integration and the software integration. The mechatronics integration usually composes of three parts: underwater vehicle, robotic manipulator and a stereo vision system, while the software integration may include vehicle navigation, target identification, target tracking, arm control and visual control of the manipulator [3]. Underwater target detection and 3D reconstruction offered by the vision system, is a key issue for intervention missions carried out by underwater robots [4,5]. For intervention missions requiring grasping and manipulation of objects, the vision system must provide accurate target detection and distance estimation [6]. However, water not only influences the mechanical and electrical design of the robot sub-systems but also causes difficulties to the underwater vision. Water turbidity, color distortion, light absorption and scattering phenomena represent the major problems with underwater vision applications, which may affect the perceived features of the object and accordingly brings difficulties to underwater target detection and 3D reconstruction of the vision system [6]. Moreover, due to the flat-panel glass windows that are usually adopted for underwater vision housings, significant distortions due to refraction in

Description of the System
The underwater target detection and 3D reconstruction system based on binocular vision proposed in this paper is shown in Figure 1. The whole system can be divided into two parts: perception module in underwater environment and data processing module on land surface. Under the normal operation of the system, the binocular vision sensing module, which is composed of two optical cameras, is used as the system input. The client of the data processing module on the land surface is set as the output of the system. The whole process mainly includes three steps: (1) Underwater target image sequence is collected by binocular vision system through video input interface module. (2) The collected image sequence is transmitted by the embedded subsystem through the data sending module, received by the data receiving module of the server and stored in the server's memory. (3) The clients access the server's memory through the Wireless Local Area Network (WLAN) to acquire the binocular image sequence and accomplish the image processing, which includes image pre-processing, stereo matching and 3D reconstruction. The disparity map and 3D coordinate information of targets are finally output by the system software interface.

Hardware of the System
As shown in Figure 2, the system hardware consists of three parts: data acquisition module, embedded subsystem and data transmission module. The CMOS Sensors (model FMVU-03MTC supplied by Lingliang Photoelectric Technology company in Shanghai, China) are adopted in this paper. Their resolution and frame are 640/480 and 60 fps, respectively. They are connected to the embedded subsystem through LVDS/HiSPi interface. The subsystem works based on HUAWEI Hass Hi3519V101 scheme, which uses ARM Cortex-A17 as the control core and supports video image acquisition of 1920 × 1080 30 fps and Gigabit Ethernet connections. The data transmission module uses optical fiber communication technology to transmit the collected image sequence data to the data receiving module of the server through the TCP/IP protocol.

Hardware of the System
As shown in Figure 2, the system hardware consists of three parts: data acquisition module, embedded subsystem and data transmission module. The cmOS Sensors (model FMVU-03MTC supplied by Lingliang Photoelectric Technology company in Shanghai, China) are adopted in this paper. Their resolution and frame are 640/480 and 60 fps, respectively. They are connected to the embedded subsystem through LVDS/HiSPi interface. The subsystem works based on HUAWEI Hass Hi3519V101 scheme, which uses ARM Cortex-A17 as the control core and supports video image acquisition of 1920 × 1080 30 fps and Gigabit Ethernet connections. The data transmission module uses optical fiber communication technology to transmit the collected image sequence data to the data receiving module of the server through the TCP/IP protocol.

Description of the System
The underwater target detection and 3D reconstruction system based on binocular vision proposed in this paper is shown in Figure 1. The whole system can be divided into two parts: perception module in underwater environment and data processing module on land surface. Under the normal operation of the system, the binocular vision sensing module, which is composed of two optical cameras, is used as the system input. The client of the data processing module on the land surface is set as the output of the system. The whole process mainly includes three steps: (1) Underwater target image sequence is collected by binocular vision system through video input interface module. (2) The collected image sequence is transmitted by the embedded subsystem through the data sending module, received by the data receiving module of the server and stored in the server's memory. (3) The clients access the server's memory through the Wireless Local Area Network (WLAN) to acquire the binocular image sequence and accomplish the image processing, which includes image pre-processing, stereo matching and 3D reconstruction. The disparity map and 3D coordinate information of targets are finally output by the system software interface.

Hardware of the System
As shown in Figure 2, the system hardware consists of three parts: data acquisition module, embedded subsystem and data transmission module. The CMOS Sensors (model FMVU-03MTC supplied by Lingliang Photoelectric Technology company in Shanghai, China) are adopted in this paper. Their resolution and frame are 640/480 and 60 fps, respectively. They are connected to the embedded subsystem through LVDS/HiSPi interface. The subsystem works based on HUAWEI Hass Hi3519V101 scheme, which uses ARM Cortex-A17 as the control core and supports video image acquisition of 1920 × 1080 30 fps and Gigabit Ethernet connections. The data transmission module uses optical fiber communication technology to transmit the collected image sequence data to the data receiving module of the server through the TCP/IP protocol.  Figure 2. System hardware diagram. Figure 2. System hardware diagram.

Processing of The Binocular Vision System
As shown in Figure 3, the binocular vision system processing consists of stereo calibration, image rectification, image denoising, color correction, image segmentation, stereo matching, depth calculation and 3D reconstruction. Stereo calibration is done off-line by an in-water calibration using a checkerboard in our current work, which will be described in Section 3.1.1. Image rectification projects the left and right images onto a common plane in such a way that the corresponding points have the same row coordinates and will be given in Section 3.1.2. Image denoising and color correction, offered in Sections 3.2.1 and 3.2.2 respectively, together improve the quality of underwater images for better image segmentation and stereo matching. Image segmentation in Section 3.2.3 is used for target extraction and matching area restriction. Stereo matching, which is described in Section 3.2.4, can produce the disparity map, which will then be used for depth calculation and 3D reconstruction. Based on the above processing, we have developed a software of the binocular vision system using the Microsoft Foundation Class (MFC) library and C++ language. The user can view both left and right images in the current frame, obtain the disparity map and depth information of the target and view the 3D scene of the target if necessary.

Processing of The Binocular Vision System
As shown in Figure 3, the binocular vision system processing consists of stereo calibration, image rectification, image denoising, color correction, image segmentation, stereo matching, depth calculation and 3D reconstruction. Stereo calibration is done off-line by an in-water calibration using a checkerboard in our current work, which will be described in Section 3.1.1. Image rectification projects the left and right images onto a common plane in such a way that the corresponding points have the same row coordinates and will be given in Section 3.1.2. Image denoising and color correction, offered in Section 3.2.1 and Section 3.2.2 respectively, together improve the quality of underwater images for better image segmentation and stereo matching. Image segmentation in Section 3.2.3 is used for target extraction and matching area restriction. Stereo matching, which is described in Section 3.2.4, can produce the disparity map, which will then be used for depth calculation and 3D reconstruction. Based on the above processing, we have developed a software of the binocular vision system using the Microsoft Foundation Class (MFC) library and C++ language. The user can view both left and right images in the current frame, obtain the disparity map and depth information of the target and view the 3D scene of the target if necessary.

Stereo Calibration
Stereo measurement based on binocular vision has the similar principles as triangulation technology. The target images are obtained by two cameras from different angles. Then the 3D geometric information is acquired from the 2D image coordinates of the feature points in the left and right views. The calibration accuracy of binocular stereo vision system is an important factor affecting the accuracy of 3D reconstruction. The fundamental task of calibration of stereo vision system is to determine the mapping relation between the 2D image coordinates and the 3D world coordinates, including the optical geometry parameters and distortion parameters (intrinsic parameters) that describe the internal structure of each camera and the structural parameters (external parameters) that describe the spatial relationship between the two cameras. Due to light refraction in air-glass-water transitions, an axial camera is more suitable for modeling an underwater camera [6,7]. However, an axial camera model is still difficult to use with stereo processing and therefore it is still common in underwater vision to adopt a pinhole model with the parameters obtained by an in-water calibration using a checkerboard [6]. The calibration method based on the chessboard template proposed by Zhang [36] has a higher calibration precision and is widely used. According to the convenience and maneuverability, the proposed system currently uses Zhang's chessboard calibration method to restrict the intrinsic parameters of cameras through the corner feature and the homographic matrix. The mapping relation between the 3D world coordinate system and the 2D pixel coordinate system is defined as follows: where P = [Xw Yw Zw 1] is the homogeneous coordinate of a given point in the world coordinate system; p = [u v 1] is the corresponding homogeneous coordinate of this point in the pixel coordinate

Stereo Calibration
Stereo measurement based on binocular vision has the similar principles as triangulation technology. The target images are obtained by two cameras from different angles. Then the 3D geometric information is acquired from the 2D image coordinates of the feature points in the left and right views. The calibration accuracy of binocular stereo vision system is an important factor affecting the accuracy of 3D reconstruction. The fundamental task of calibration of stereo vision system is to determine the mapping relation between the 2D image coordinates and the 3D world coordinates, including the optical geometry parameters and distortion parameters (intrinsic parameters) that describe the internal structure of each camera and the structural parameters (external parameters) that describe the spatial relationship between the two cameras. Due to light refraction in air-glass-water transitions, an axial camera is more suitable for modeling an underwater camera [6,7]. However, an axial camera model is still difficult to use with stereo processing and therefore it is still common in underwater vision to adopt a pinhole model with the parameters obtained by an in-water calibration using a checkerboard [6]. The calibration method based on the chessboard template proposed by Zhang [36] has a higher calibration precision and is widely used. According to the convenience and maneuverability, the proposed system currently uses Zhang's chessboard calibration method to restrict the intrinsic parameters of cameras through the corner feature and the homographic matrix. The mapping relation between the 3D world coordinate system and the 2D pixel coordinate system is defined as follows: where P = [X w Y w Z w 1] is the homogeneous coordinate of a given point in the world coordinate system; p = [u v 1] is the corresponding homogeneous coordinate of this point in the pixel coordinate system; s is a scaling factor; R is a rotation matrix and T is a translation vector; M is the camera intrinsic matrix and can be given by the following: where f x = f /d x , f y = f /d y ; d x and d y are the sizes of a single pixel in the direction of the X axis and that of the Y axis of the image coordinate system respectively; (u 0 , v 0 ) is the position of origin point of pixel coordinate system in the image coordinate system. If a point is at the position (u, v) in the pixel coordinate system and at (x c , y c ) in the image coordinate system, the corresponding coordinate relations will be given by the following: The above solving process of camera intrinsic parameters assumes that the lens is an ideal model. The image is usually distorted by the influence of the lens manufacturing process. According to Brown's theory for distortion parameter solution, the lens distortion can be described by radial distortion and tangential distortion and the expression is as follows: x rec y rec where (x, y) is the original position of the distortion point and (x rec , y rec ) is the new position after correction; r is the radius of the lens model; k 1 , k 2 and k 3 are radial distortion parameters; p 1 , p 2 are tangential distortion parameters; and D = [k 1 , k 2 , k 3 , p 1 , p 2 ] is the distortion parameter vector of the non-ideal lens model. The external parameters of the binocular camera include the rotation matrix R, the translation vector T and the re-projection matrix Q. The rotation matrix R and the translation vector T are used to describe the relative position of the binocular cameras and can be given by: where R r and T r are the rotation matrix and the translation vector for the right camera; while R l and T l are those for the left one. The re-projection matrix Q is used to convert the two-dimensional image point coordinates into three-dimensional coordinates, which can be calculated according to the following: where (c x , c y ) is the origin position of the left view image and c x is the horizontal ordinate of the right view image origin; T x is the component of the translation matrix in the direction of the X axis, which equals the baseline distance B between the binocular cameras in the ideal situation; and f is the focal length of the lenses. To measure the calibration accuracy, the error of the calibration result is characterized by calculating the re-projection error of the corner point of the checkerboard, which is defined as: where (u 1i , v 1i ) is the sub-pixel corner coordinates extracted from the feature points; (u 2i , v 2i ) is the coordinates calculated according to the re-projection matrix; and n is the total number of calibration checkerboard corners. After stereo calibration using a checkerboard with 1-m distance in water and in the air, intrinsic parameters of the binocular vision system in water are given in Table 1 and the corresponding intrinsic parameters in the air are given in Table 2. According to Tables 1 and 2, the average focal length ratio in water and in the air can be calculated by which is a little bigger than the ideal focal length ratio of 1.333. External parameters that gives the relative spatial position between the two optical sensors are given in Table 3. Besides, the calibration errors of 0.173 (unit: pixel) in water and 0.155 (unit: pixel) in the air are also given in Table 3. From Table 3, it can be seen that in water, the external parameters are relatively small, that is, the binocular images in water can be corrected by the rotation of a smaller angle and the translation of a smaller distance than in the air. Table 1. Intrinsic parameters of the binocular vision system (in water, at 1 m distance).

View Optical Geometry Parameters Distortion
Parameters  Table 2. Intrinsic parameters of the binocular vision system (in the air, at 1 m distance).

View Optical Geometry Parameters Distortion
Parameters  Table 3. External parameters of the binocular vision system (in water and in the air, at 1 m distance).

Image Rectification
The left and right images obtained by the binocular vision system usually have a certain image distortion due to the imaging principle of the camera and the structure of the device. Therefore, the same pixel may not be on the same pole line in the left and right images. This will bring difficulties to the subsequent stereo matching and cause the increase in time consumption and mismatch. To improve the accuracy of stereo matching, it is necessary to correct the image to achieve the strict coplane and line alignment of the left and right images in theory. The search range of image matching points is reduced from two-dimensional to one dimension, so the range of search is reduced and the speed of the operation can be improved. Image correction includes three steps: elimination of distortion, image correction and image cutting. The proposed system made use of OpenCV Library. Firstly, the 'pillow' distortion is eliminated based on the distortion parameters obtained from the calibration process in Section 3.1.1. Secondly, the left and right view images are horizontally aligned by parameters with the focal length and optical center of lens, rotation matrix and translation vector. Therefore, we can ensure the position consistence of optical centers, parallelism of light axis and alignment of polar lines. Finally, the image is cut and the irregular areas at the edges and corners of the image are deleted so that the overlapped area of the left and right images is maximized. The rectification results of the checkboard images are shown in Figure 4. reduced and the speed of the operation can be improved. Image correction includes three steps: elimination of distortion, image correction and image cutting. The proposed system made use of OpenCV Library. Firstly, the 'pillow' distortion is eliminated based on the distortion parameters obtained from the calibration process in Section 3.1.1. Secondly, the left and right view images are horizontally aligned by parameters with the focal length and optical center of lens, rotation matrix and translation vector. Therefore, we can ensure the position consistence of optical centers, parallelism of light axis and alignment of polar lines. Finally, the image is cut and the irregular areas at the edges and corners of the image are deleted so that the overlapped area of the left and right images is maximized. The rectification results of the checkboard images are shown in Figure 4.  As shown in Figure 4a is the image pair before image correction and Figure 4b is the image pair after image correction. Obviously, the 'pillow' distortion is eliminated by stereoscopic correction and the corresponding points of the same targets in the left and right images are basically aligned.

Image Denoising
Obtaining high-quality underwater images is important for accurate 3D reconstruction of underwater targets. To effectively remove noise in the images, the proposed system adopts the block matching and 3D filtering (BM3D) algorithm [37]. The BM3D algorithm can be divided into two steps: (1) Initial estimation and (2) Final estimation. In the first step, the similarity between search block and reference block is defined by a hard threshold. The similar blocks are stacked into three-dimensional arrays. After each three-dimensional array is filtered through cooperative filtering, that is, the spectrum transform is contracted, the initial estimate of the reference block is obtained by inverse transformation. Finally, the initial estimation is aggregated by a method of the non-local mean. In the second process, the original image and the initial estimation image are transformed by three-dimensional transform and Wiener filtering and the final output is obtained by inverse transform and aggregation. As shown in Figure 4a is the image pair before image correction and Figure 4b is the image pair after image correction. Obviously, the 'pillow' distortion is eliminated by stereoscopic correction and the corresponding points of the same targets in the left and right images are basically aligned.

Image Denoising
Obtaining high-quality underwater images is important for accurate 3D reconstruction of underwater targets. To effectively remove noise in the images, the proposed system adopts the block matching and 3D filtering (BM3D) algorithm [37]. The BM3D algorithm can be divided into two steps: (1) Initial estimation and (2) Final estimation. In the first step, the similarity between search block and reference block is defined by a hard threshold. The similar blocks are stacked into three-dimensional arrays. After each three-dimensional array is filtered through cooperative filtering, that is, the spectrum transform is contracted, the initial estimate of the reference block is obtained by inverse transformation. Finally, the initial estimation is aggregated by a method of the non-local mean. In the second process, the original image and the initial estimation image are transformed by three-dimensional transform and Wiener filtering and the final output is obtained by inverse transform and aggregation.

Color Correction
Due to the limitations of the digital camera's photosensitive device, there is a difference between the recorded color and the real color. This difference can be expressed as a relation of mapping. Hirschmuller [33] estimated the downlink attenuation coefficients of different colors under the water based on the multispectral or hyperspectral data of underwater images of target at a specific location. By reversely creating an unenhanced image, the over-enhanced phenomenon caused by the camera built-in functions of white balance and color enhancement is eliminated. Therefore, the color information of the underwater image is effectively restored. The proposed system used three basic color (RGB) calibration method to calibrate and restore the color information of images. Through a variety of curve fitting tests, it is found that the three polynomial fitting results are the best and the fitting formula of the photosensitive curve is as follows: where r, g and b are recorded values of digital cameras for red, green and blue, respectively; R, G and B are standard values for red, green and blue. We carried out the color correction experiment with 'ColorChecker 24' using the method of three basic color calibration and achieved the fitting coefficient of the photoreceptor curve. The photoreceptor curve fitting parameters of cameras are given in Table 4. To demonstrate the effectiveness of our proposal, Figure 5 gives a comparison of the original underwater images, the results using the dark channel prior method in Reference [38] and the results using BM3D filtering and color correction of our research. It can be seen from Figure 5 that, the 'atomization' phenomenon is effectively eliminated and the true color information of the image is well restored after BM3D filtering and color correction. By using BM3D filtering and color correction, the targets in the underwater images become much clearer, which will contribute to the following accurate stereo matching of the image pairs. Compared with our method, although the method in Reference [38] can also well remove the 'atomization' phenomenon, it may fail to recover the true color, which can be seen from the first three images in the second row. In the first image of the second row of Figure 5, the third and fifth blocks, which should have different colors, are made the same color; and that is the same case for the square target and vase in the second image, which should have different red colors; while in the third image, the red color is not well recovered. To clearly demonstrate this, the standard color checkboard, the color checkboard restored by the method in Reference [38] and that restored by BM3D filtering and color correction are further shown in detail in Figure 6, from which it can be seen that the color checkboard restored by BM3D filtering and color correction is much closer to the real one.

Image Segmentation
After image denoising and color restoration, the true colors of images are obtained. However, due to the particularity of marine environment, the background of underwater images usually contains less texture information, which may cause many mismatched regions. Therefore, the proposed system implemented the segmentation and extraction of the target from the background before the stereo image matching. Taking the advantages of the super-pixel segmentation algorithm that can reduce the complexity of the subsequent image processing, a segmentation algorithm based on super-pixel clustering is adopted in this paper. First, the brightness and texture features are extracted from the underwater image after the noise reduction. Next, the similarity of the two features is calculated to make a weighted fusion. Then, the pixels are clustered to generate the super-pixels by using the fusion similarity as the distance measurement. The calculation formula of the distance metric D is as follows: where dt, dc and ds are similarity distances for texture, color and spatial features, respectively; Ns is the maximum space distance within the cluster; Nc is the maximum color similarity; and  is a weight parameter. Obviously, the smaller the distance metric D is, the greater the similarity between the pixels.

Image Segmentation
After image denoising and color restoration, the true colors of images are obtained. However, due to the particularity of marine environment, the background of underwater images usually contains less texture information, which may cause many mismatched regions. Therefore, the proposed system implemented the segmentation and extraction of the target from the background before the stereo image matching. Taking the advantages of the super-pixel segmentation algorithm that can reduce the complexity of the subsequent image processing, a segmentation algorithm based on super-pixel clustering is adopted in this paper. First, the brightness and texture features are extracted from the underwater image after the noise reduction. Next, the similarity of the two features is calculated to make a weighted fusion. Then, the pixels are clustered to generate the super-pixels by using the fusion similarity as the distance measurement. The calculation formula of the distance metric D is as follows: where dt, dc and ds are similarity distances for texture, color and spatial features, respectively; Ns is the maximum space distance within the cluster; Nc is the maximum color similarity; and λ is a weight parameter. Obviously, the smaller the distance metric D is, the greater the similarity between

Image Segmentation
After image denoising and color restoration, the true colors of images are obtained. However, due to the particularity of marine environment, the background of underwater images usually contains less texture information, which may cause many mismatched regions. Therefore, the proposed system implemented the segmentation and extraction of the target from the background before the stereo image matching. Taking the advantages of the super-pixel segmentation algorithm that can reduce the complexity of the subsequent image processing, a segmentation algorithm based on super-pixel clustering is adopted in this paper. First, the brightness and texture features are extracted from the underwater image after the noise reduction. Next, the similarity of the two features is calculated to make a weighted fusion. Then, the pixels are clustered to generate the super-pixels by using the fusion similarity as the distance measurement. The calculation formula of the distance metric D is as follows: where d t , d c and d s are similarity distances for texture, color and spatial features, respectively; N s is the maximum space distance within the cluster; N c is the maximum color similarity; and λ is a weight parameter. Obviously, the smaller the distance metric D is, the greater the similarity between the pixels. Figure 7 shows the results of the target segmentation. Based on the feature, the generated super-pixels are detected and the super-pixels with saliency are marked by red line. All the super-pixels are clustered by the method of the Max-Flow/Min-Cut algorithm. After that, the proportion of the significant super-pixels in clustering is calculated and compared with the preset threshold. Thus, the segmentation result for the foreground object is obtained.
Sensors 2018, 18, x FOR PEER REVIEW 10 of 21 Figure 7 shows the results of the target segmentation. Based on the feature, the generated super-pixels are detected and the super-pixels with saliency are marked by red line. All the super-pixels are clustered by the method of the Max-Flow/Min-Cut algorithm. After that, the proportion of the significant super-pixels in clustering is calculated and compared with the preset threshold. Thus, the segmentation result for the foreground object is obtained.

Stereo Matching
Stereo matching is an important part of the system implementation. The system uses binocular cameras to get different views of the left and right images, to calculate the cost of stereo matching and get the matching disparity map. Considering both matching accuracy and time efficiency, the Semi-Global Matching (SGM) algorithm that has the advantages of fast matching speed and high matching accuracy is preferred. According to this, we proposed an improved algorithm based on the semi-global matching algorithm in our system. Taking the right view as the reference, Figure 8a shows the stereo matching principle of SGM algorithm. If there is a point to be matched in the right view image and its horizontal ordinate is x, we could search for the best match point starting from the position minDis within the range of Windows in the left view image. However, the disadvantage of this algorithm is that the color characteristics of underwater images are seriously disturbed and degraded by the influence of light and water scattering. Therefore, there are many mismatches in the background areas.

Stereo Matching
Stereo matching is an important part of the system implementation. The system uses binocular cameras to get different views of the left and right images, to calculate the cost of stereo matching and get the matching disparity map. Considering both matching accuracy and time efficiency, the Semi-Global Matching (SGM) algorithm that has the advantages of fast matching speed and high matching accuracy is preferred. According to this, we proposed an improved algorithm based on the semi-global matching algorithm in our system. Taking the right view as the reference, Figure 8a shows the stereo matching principle of SGM algorithm. If there is a point to be matched in the right view image and its horizontal ordinate is x, we could search for the best match point starting from the position minDis within the range of Windows in the left view image. However, the disadvantage of this algorithm is that the color characteristics of underwater images are seriously disturbed and degraded by the influence of light and water scattering. Therefore, there are many mismatches in the background areas.
Sensors 2018, 18, x FOR PEER REVIEW 10 of 21 Figure 7 shows the results of the target segmentation. Based on the feature, the generated super-pixels are detected and the super-pixels with saliency are marked by red line. All the super-pixels are clustered by the method of the Max-Flow/Min-Cut algorithm. After that, the proportion of the significant super-pixels in clustering is calculated and compared with the preset threshold. Thus, the segmentation result for the foreground object is obtained.

Stereo Matching
Stereo matching is an important part of the system implementation. The system uses binocular cameras to get different views of the left and right images, to calculate the cost of stereo matching and get the matching disparity map. Considering both matching accuracy and time efficiency, the Semi-Global Matching (SGM) algorithm that has the advantages of fast matching speed and high matching accuracy is preferred. According to this, we proposed an improved algorithm based on the semi-global matching algorithm in our system. Taking the right view as the reference, Figure 8a shows the stereo matching principle of SGM algorithm. If there is a point to be matched in the right view image and its horizontal ordinate is x, we could search for the best match point starting from the position minDis within the range of Windows in the left view image. However, the disadvantage of this algorithm is that the color characteristics of underwater images are seriously disturbed and degraded by the influence of light and water scattering. Therefore, there are many mismatches in the background areas.  The improved stereo matching algorithm based on SGM algorithm proposed in this paper can accurately extract the target area from the background in underwater images. The stereo matching process is strictly constrained within the valid target area. As shown in Figure 8b,c, the black pixels belong to background that are invalid for stereo matching, while the white ones belong to target areas that are valid for stereo matching; and the gray one is the current pixel to be matched. If the matching pixel in the left image of the current pixel to be matched in the right one is in the valid target area, the matching search process starts directly from the position x in the left image until the matching pixel reaches the boundary of search window. If the boundary of search window is in the invalid background area, the search process ends in advance. If the matching pixel in the left image of the current pixel to be matched in the right one is in the invalid background area, the matching search process starts from the first valid pixel in the search window until the matching pixel reaches the window boundary or the invalid background. The implementation of the improved algorithm can be divided into the following four steps: (1) Gradient information extraction. To further eliminate the effect of image noise on calculation of disparity map, the horizontal Sobel operator is used to extract the gradient information of the image. The Sobel operator is given as follows: (11) where I represents the pixel value of the image. After processed by the Sobel operator and smoothed by the Gauss filter, the original image is mapped to generate a new image. The mapping function is given by: where I represents the pixel value of the original image; I new indicates the pixel value of the remapping image; and T h is the threshold of the filter.
(2) Matching cost calculation. In practical applications, the different visual angle of the binocular vision system always leads to an inhomogeneous phenomenon between the left and right view images, which causes an increase in the mismatch rate. Mutual information has the advantage of insensitivity to light, so the semi-global matching algorithm is based on such information. The computational efficiency and accuracy of the stereo matching are improved by the cost calculation of the hierarchical mutual information instead of the traditional gray value calculation. The definition of mutual information is as follows: M I 1 ,I 2 = H I 1 + H I 2 − H I 1 ,I 2 (13) where H I 1 and H I 2 are the entropies of the left and right images, respectively; H I 1 ,I 2 is the combined entropy for the two images. According to the Taylor expansion formula, entropy H I and combined entropy H I 1 ,I 2 can be respectively expressed as: where P I 1 ,I 2 (i, k) represents the joint probability distribution of the images; g(i, k)is the Gauss kernel function. Therefore, the mutual information M I 1 ,I 2 can be finally given by: Then, the corresponding matching cost is defined as: C(p, d) = −m I p ,I q I p , I q (16) where I p is the value of point p and q is its corresponding point on the polar line in the left view image. If the horizontal ordinate of p is x, then the horizontal ordinate of q is x + d, where d is the disparity value.
(3) Cost aggregation. The matching cost based on the mutual information has been obtained after the above calculation process for matching cost. But such matching cost with the form of pixel by pixel can be easily affected by mismatch points or noise and other factors. Therefore, the penalty function based on the neighborhood disparity data is introduced to increase the smoothness constraint. Accordingly, the energy function can be defined as: where ∑ p C p, d p is the data item representing the matching costs of all pixels in the image and the next two items are used for punishment. If the disparity value between the point p and the point q equals 1, the punishment item P 1 works; otherwise if the disparity value is greater than 1 and P 2 is larger than P 1 at the same time, the punishment item P 2 works. Besides, q is a point within the neighborhood ( p ) of point p. To minimize the energy, the dynamic programming method is adopted and the idea of scanning line optimization is introduced. The matching cost on the direction r could be defined as: where C(p, d) indicates the matching cost of point p on disparity value d; the second term represents the minimum matching cost of the path adjacent point p−r based on the disparity smoothing constraint; and the third term represents the minimum matching cost of the path adjacent point p−r along the direction r. Therefore, the sum of the matching costs of point p can be obtained by aggregation of the path costs in the direction of each scan line, which is given by Equation (18): (4) Disparity map optimization. According to the above matching cost calculation method, the right view image was set as the reference and the left view image was the one to be matched. The effective area of the whole image is traversed by progressive scanning. Upon every valid pixel in the right view getting the best matching point with the lowest matching cost in the left view, the basic disparity map was consequently formed. Aiming at the problem of mismatching or invalid disparity in the weak texture area, the proposed method made use of the super-pixel segmentation data to optimize the basic disparity map within every super-pixel area by the least squares fitting interpolation method. The plane template used in this paper is given by Equation (20): The weighted least square method was used to calculate a, b and c, which formed the parameters set of the disparity plane template. The calculation formula of the weighted least square method is as follows: where N is the total number of pixels in the plane area; (x i , y i ) and d i are the coordinates and disparity values of the pixel indexed by i, respectively. As shown in Figure 9, compared with the basic disparity maps obtained by stereo matching, the optimization results of the least squares plane fitting interpolation method are smoother and more complete, with fewer holes. The invalid matching areas have been basically eliminated. In addition, the disparity plane in the same area is effectively smoothed and the transition of disparity values is more placid. The fitting parameters of the disparity plane templates obtained for optimizing the three basic disparity maps are given in Table 5. As shown in Figure 9, compared with the basic disparity maps obtained by stereo matching, the optimization results of the least squares plane fitting interpolation method are smoother and more complete, with fewer holes. The invalid matching areas have been basically eliminated. In addition, the disparity plane in the same area is effectively smoothed and the transition of disparity values is more placid. The fitting parameters of the disparity plane templates obtained for optimizing the three basic disparity maps are given in Table 5.   Among the five methods, the proposed method usually can provide best disparity maps, which are smooth and continuous, with fewer black holes. The two global matching methods, that is, the AD-Census method and the ARWR method, have better performance than the FCVF method and the SGM method. By constraining the matching within the valid target area and further optimizing the basic disparity map using the least squares plane fitting interpolation, the proposed method has made a remarkable improvement of the SGM method.  In Figure 10, we provide a comparison of the disparity maps produced by our method and four state of the art stereo matching methods, which are AD-Census method by Mei et al., Fast Cost-Volume Filtering (FCVF) method by Hosni et al., Adaptive Random Walk with Restart (ARWR) method by Lee et al. and Semi Global Matching (SGM) method by Hirschmuller et al. Among the five methods, the proposed method usually can provide best disparity maps, which are smooth and continuous, with fewer black holes. The two global matching methods, that is, the AD-Census method and the ARWR method, have better performance than the FCVF method and the SGM method. By constraining the matching within the valid target area and further optimizing the basic disparity map using the least squares plane fitting interpolation, the proposed method has made a remarkable improvement of the SGM method.  Figure 10. Comparison of the disparity maps produced by the AD-Census method, the FCVF method, the ARWR method, the SGM method and the proposed method, from top to bottom.

Results and Discussion
Based on the disparity map, the depth information that gives the distance from the target to the camera can finally be calculated. In this section, we first give the measurement principle of binocular vision and then provide a comparison of the distance measurement accuracy using different disparity maps produced by the five stereo matching methods mentioned above. The 3D reconstruction results using different disparity maps are also shown here. Figure 11 shows the measurement principle of binocular vision. Assuming the imaging planes of binocular cameras are coplanar and the distances to the camera's optical centers ( l O and r O ) are f. The baseline distance between two camera optical centers is B. If the objective point P is located at p1(xl, yl) in the left view image and at p2(xr, yr) in the right view image, then the distance Z from the objective point P to the imaging plane, also known as the depth value, can be calculated according to the following: Figure 10. Comparison of the disparity maps produced by the AD-Census method, the FCVF method, the ARWR method, the SGM method and the proposed method, from top to bottom.

Results and Discussion
Based on the disparity map, the depth information that gives the distance from the target to the camera can finally be calculated. In this section, we first give the measurement principle of binocular vision and then provide a comparison of the distance measurement accuracy using different disparity maps produced by the five stereo matching methods mentioned above. The 3D reconstruction results using different disparity maps are also shown here. Figure 11 shows the measurement principle of binocular vision. Assuming the imaging planes of binocular cameras are coplanar and the distances to the camera's optical centers (O l and O r ) are f. The baseline distance between two camera optical centers is B. If the objective point P is located at p 1 (x l , y l ) in the left view image and at p 2 (x r , y r ) in the right view image, then the distance Z from the objective point P to the imaging plane, also known as the depth value, can be calculated according to the following: where d represents the disparity value; c x and c x are the horizontal ordinates of the center points of the two imaging planes. In addition to the depth value, we can also use the reprojection matrix computed in the previous stereoscopic calibration process to get the horizontal and vertical coordinates of the objective point P in X-axis and Y-axis directions. Based on the dense disparity maps, we can transform 2D coordinate into 3D coordinate information according to the following: where (x, y) is the pixel coordinate of a point in the disparity map and d is the disparity value of the point. Then, the real-world coordinate of the point can be expressed as where d represents the disparity value; x c and ' x c are the horizontal ordinates of the center points of the two imaging planes. In addition to the depth value, we can also use the reprojection matrix computed in the previous stereoscopic calibration process to get the horizontal and vertical coordinates of the objective point P in X-axis and Y-axis directions. Based on the dense disparity maps, we can transform 2D coordinate into 3D coordinate information according to the following: where (x, y) is the pixel coordinate of a point in the disparity map and d is the disparity value of the point. Then, the real-world coordinate of the point can be expressed as To evaluate the accuracy and precision of the proposed method, we carried out a series of distance measurement experiments. We sampled several points from different underwater targets and obtained 10 distance values measured by our method and the compared methods, separately. According to the implementation, we set a valid distance range of measurement within [0,200]. And the distance unit is centimeter. The sampling points positions are shown in Figure 12  In our implementation, all experiment data are obtained in Intel (R) Core (TM) i5-2467M processor, 4 GB memory, CPU frequency 1.6 GHz, Windows 7 system PC machine, Visual Studio 2015. Table 6 shows the data of measurement experiments for underwater targets. As known from Table 6, all the values of sampling points are valid. The proposed method has the lowest error rate To evaluate the accuracy and precision of the proposed method, we carried out a series of distance measurement experiments. We sampled several points from different underwater targets and obtained 10 distance values measured by our method and the compared methods, separately. According to the implementation, we set a valid distance range of measurement within [0,200]. And the distance unit is centimeter. The sampling points positions are shown in Figure 12 where d represents the disparity value; x c and ' x c are the horizontal ordinates of the center points of the two imaging planes. In addition to the depth value, we can also use the reprojection matrix computed in the previous stereoscopic calibration process to get the horizontal and vertical coordinates of the objective point P in X-axis and Y-axis directions. Based on the dense disparity maps, we can transform 2D coordinate into 3D coordinate information according to the following: where (x, y) is the pixel coordinate of a point in the disparity map and d is the disparity value of the point. Then, the real-world coordinate of the point can be expressed as To evaluate the accuracy and precision of the proposed method, we carried out a series of distance measurement experiments. We sampled several points from different underwater targets and obtained 10 distance values measured by our method and the compared methods, separately. According to the implementation, we set a valid distance range of measurement within [0,200]. And the distance unit is centimeter. The sampling points positions are shown in Figure 12  In our implementation, all experiment data are obtained in Intel (R) Core (TM) i5-2467M processor, 4 GB memory, CPU frequency 1.6 GHz, Windows 7 system PC machine, Visual Studio 2015. Table 6 shows the data of measurement experiments for underwater targets. As known from Table 6, all the values of sampling points are valid. The proposed method has the lowest error rate  In our implementation, all experiment data are obtained in Intel (R) Core (TM) i5-2467M processor, 4 GB memory, CPU frequency 1.6 GHz, Windows 7 system PC machine, Visual Studio 2015. Table 6 shows the data of measurement experiments for underwater targets. As known from Table 6, all the values of sampling points are valid. The proposed method has the lowest error rate (mean error of 2.320%) compared with the other four methods and has made a great progress of SGM algorithm (mean error of 10.114%).
To display the structure information of the underwater targets more intuitively and effectively, the 3D scene can be reconstructed. Figure 13 gives the results of AD-Census method, the FCVF method, the ARWR method, the SGM method and the proposed method, respectively. The proposed system adopts the open source and extensible 3D geometric processing software MeshLab to reconstruct the color information and 3D spatial data of the targets. Figure 13 effectively demonstrates the accuracy of the system for underwater target detection and 3D reconstruction.
Sensors 2018, 18, x FOR PEER REVIEW 17 of 21 Figure 13. Comparison of 3D reconstruction of underwater targets using disparity maps provided by methods of AD-Census method, the FCVF method, the ARWR method, the SGM method and the proposed method, which are demonstrated from the first row to the last row, respectively.

Limitations of the Proposed Method
We have given a complete vision system in our work and successfully tested it on targets around 1-m distance in a reasonably clear water pool. Now the system can stably capture the images underwater for more than one hour and processes a pair of left and right images with a total latency of about 100 milliseconds. But the system does have a restriction on measuring distance, because we Figure 13. Comparison of 3D reconstruction of underwater targets using disparity maps provided by methods of AD-Census method, the FCVF method, the ARWR method, the SGM method and the proposed method, which are demonstrated from the first row to the last row, respectively. From Table 6 and Figure 13, it can be seen that: (1) Among the four methods compared with our method, the ARWR and AD-Census methods have better performance than the other two because their matching cost computations are both global and have taken account of the overall scene structure. As shown in Table 6, they can produce more accurate distance measurement results than the other two, with mean error rates at 3.553% and 6.752%, respectively. But meanwhile, they reduce the computational efficiency of the system. (2) The SGM method has a trade-off between accuracy and computation requirements, which is suitable for practical applications. However, many invalid matches are made on the surface where there is a large area or a smooth changing, which can be clearly seen in Figure 13e. This is because that considering pixel correspondence only within a small window are more susceptible to noise and that the costs of neighboring support regions in the texture-less area are aggregated together to give a less accurate result. (3) The proposed method is based on the SGM method and therefore it inherits the advantage of high efficiency. For the above testing images, the average execution time of the proposed method is about 90 milliseconds, which is close to the average execution time of the SGM method at 70 milliseconds. The FCVF method is the fastest one with an average 40 milliseconds. While the AD-Census method and the ARWR method cost about 600 milliseconds and 5 s, respectively. Moreover, by strictly constraining the stereo matching in the valid target area and optimizing the basic disparity map within each super-pixel area using the least squares fitting interpolation method, the accuracy of measurement can be greatly improved. Therefore, the proposed method has the lowest error rate (mean error of 2.320%) among all the five methods.

Limitations of the Proposed Method
We have given a complete vision system in our work and successfully tested it on targets around 1-m distance in a reasonably clear water pool. Now the system can stably capture the images underwater for more than one hour and processes a pair of left and right images with a total latency of about 100 milliseconds. But the system does have a restriction on measuring distance, because we currently focus on improving the stereo matching, without thoroughly considering the problem of camera calibration caused by light refraction in air-glass-water transitions, which is another important issue. In our current work, following the common practice, we adopt a pinhole model with the parameters obtained by an in-water calibration using a checkerboard at about 1 m away. According to the work in Reference [7], the distance measurement accuracy will descend if the target is not close to the distance at which the camera was calibrated. To find the maximum distance of our system, we then test it on other three underwater targets at different distances (1-m, 1.5-m and 2-m distances). The images with the observed points marked are given in Figure 14 and the measured distances of these points are given in Table 7. Table 7 also gives the true distance values of the observed corner points when the targets are put about 1 m away and the true values of these points should be added 50 cm and 100 cm when the targets are measured at 1.5 m and 2 m, respectively. From Table 7, it can be seen that the proposed method has the lowest distance measurement error rate (mean error of 0.82%, about 0.85 cm) at 1-m distance, which is the distance at which the camera was calibrated. The error rate (mean error of 1.533%, about 2.5 cm) becomes a little bigger at 1.5-m distance. However, when the distance comes to 2 m, the error rate becomes much bigger (mean error of 5.260%, about 11 cm), which may not be acceptable. The main reason for this is that the pinhole model used in our system can only very well approximates an axial camera model around the distance at which the camera was calibrated [7]. Therefore, the working distance of the current system should be restricted around 1 m and within 2 m distance. distances). The images with the observed points marked are given in Figure 14 and the measured distances of these points are given in Table 7. Table 7 also gives the true distance values of the observed corner points when the targets are put about 1 m away and the true values of these points should be added 50 cm and 100 cm when the targets are measured at 1.5 m and 2 m, respectively.   Table 7, it can be seen that the proposed method has the lowest distance measurement error rate (mean error of 0.82%, about 0.85 cm) at 1-m distance, which is the distance at which the camera was calibrated. The error rate (mean error of 1.533%, about 2.5 cm) becomes a little bigger at

Conclusions
To deal with the difficulty of underwater target detection and 3D reconstruction in optical image, a binocular vision based underwater target detection and 3D reconstruction system is proposed in this paper. The left and right views of the valid target area are obtained by image preprocessing technologies, such as image denoising, color restoration, salient region segmentation and so on, which will help to reduce mismatch caused by noise and distortion.
Based on the improved semi-global matching algorithm and the least squares plane fitting method, the accuracy of the three-dimensional reconstruction of the targets are effectively improved, which is good for practical engineering applications. However, due to the pinhole model adopted for camera calibration in our research, the current system can only work well within 2 m distance. Therefore, it is necessary to do more research on the pinax model, which is more accurate and apply it to our system in the future work. Moreover, for the convenience of the experiments and the validation, the current work is carried out only in a reasonably clear water pool, its robustness in a more turbid water needs to be verified in the future research.