Visual Navigation for Recovering an AUV by Another AUV in Shallow Water

Autonomous underwater vehicles (AUVs) play very important roles in underwater missions. However, the reliability of the automated recovery of AUVs has still not been well addressed. We propose a vision-based framework for automatically recovering an AUV by another AUV in shallow water. The proposed framework contains a detection phase for the robust detection of underwater landmarks mounted on the docking station in shallow water and a pose-estimation phase for estimating the pose between AUVs and underwater landmarks. We propose a Laplacian-of-Gaussian-based coarse-to-fine blockwise (LCB) method for the detection of underwater landmarks to overcome ambient light and nonuniform spreading, which are the two main problems in shallow water. We propose a novel method for pose estimation in practical cases where landmarks are broken or covered by biofouling. In the experiments, we show that our proposed LCB method outperforms the state-of-the-art method in terms of remote landmark detection. We then combine our proposed vision-based framework with acoustic sensors in field experiments to demonstrate its effectiveness in the automated recovery of AUVs.


Introduction
Autonomous underwater vehicles (AUVs) are of great interest to oceanographers and navies for marine scientific research and reconnaissance due to their wide scope of activity and high efficiency. However, long-duration missions struggle with the limited on-board energy and storage capability of AUVs. The autonomous recovery of AUVs is highly desirable for recharging, data transfer, and relaunch. Various recovery systems have been proposed by previous works, and they usually contain a docking station. Docking stations can either be stationary [1][2][3][4], such as landed on the seafloor, or mobile [5][6][7][8][9][10], such as carried by Unmanned Surface Vehicles (USVs) [5] and submarines [7][8][9]. Stationary docking stations are able to perform recharging and data transfer for AUVs, but in fixed locations. Mobile docking stations are more flexible. They can not only recharge and transfer data almost anywhere, but also take AUVs back to home harbors. All recovery systems aim at the precise physical contact between docking stations and AUVs. Precise contact can either be passive or active. Passive contact is to capture AUVs by using custom mechanisms. An AUV was captured by a robotic manipulator when it is in the docking envelope of submarines in [9]. Active contact means that AUVs actively connect to docking stations with the help of custom mechanisms. A common kind of active contact is that AUVs catch a cable [5,8] or pole [11][12][13] with an incorporated custom latch or hook. This form allows AUVs to approach the cable or pole in any direction, but requires additional space to install the mechanism on the AUVs, which limits the wide application to a range of AUVs. The other leveraged in [1,3,4,9,16,19,21] for pose estimation. Perspective-n-points (PnP) algorithms can accurately solve points-based pose estimation, but a drawback of PnP algorithms is that they fail in estimating poses in cases where not all predefined landmarks are detected, namely incomplete observations. However, an incomplete observation is highly possible in long-duration underwater tasks and may be caused by broken landmarks and biofouling coverage on landmarks in practice. In order to solve this problem, we propose a novel pose-estimation method for incomplete observations in AUV recovery. To the best of our knowledge, we are the first to solve the problem of pose estimation for incomplete observations in PnP-based VBAR.
Our contributions are four-fold: • Detection of landmarks in shallow water suffers from ambient light and nonuniform spreading. In order to overcome them, we propose an LCB method for the detection of underwater landmarks.

•
Incomplete observations of all predefined landmarks that may be caused by broken landmarks and biofouling coverage comprise a practical problem in pose estimation for recovering AUVs in long-duration missions. A novel pose-estimation method for incomplete observations in AUV recovery is proposed in this work. To the best of our knowledge, the proposed method is the first points-based method for pose estimation in VBAR algorithms.

•
Field experiments were performed to evaluate the proposed LCB method and the pose-estimation method.

•
We provide an underwater active landmark dataset that was collected in our field experiments. The 2D location of each landmark was manually labeled for each image. The dataset can help researchers develop and validate their underwater vision-based recovery or docking algorithms in the absence of related underwater infrastructures.

System Overview
In this section, we introduce an overview of our recovery system. We illustrate the process of recovery in Figure 1a,b. The recovery system consisted of a mother vessel and a sub-AUV (SAUV) for recovery. The SAUV was attached to the docking station, physically and autonomously carried by the mother vessel, for recovery by employing underwater docking. The docking station was rigidly fixed to the underbelly of the mother vessel. The entrance of the docking station was funnel-like, 2 m in diameter, and equipped with several active landmarks mounted around its rim, as shown in Figure 1c. These landmarks were used for detecting and computing the position of the docking station, whose details are given in Section 3. Blue LEDs with a 460-nm wavelength were employed as our landmarks due to their good propagation in water. We utilized another AUV as the mother vessel in our system. The mother vessel could also be in other forms, such as unmanned surface vehicles (USVs).
The SAUV was a torpedo-shaped vehicle that was 384 mm in diameter and 5486 mm in length. It was equipped with a control computer, Doppler velocity log (DVL), inertial measurement unit (IMU), GPS, battery units, and motors. An acoustic sensor, Evologics S2C R, was used for rough long-range positioning, and our vision-based underwater-docking module for short-range precision guidance was installed in the SAUV, as shown in Figure 1d. A NanoSeaCam monocular camera with a frame rate of 20 fps was rigidly mounted on the SAUV head for capturing underwater images. The captured images were fed into an embedded computer, Jetson TX2, for detecting the docking station and computing its pose (Section 3). Jetson TX2 is a power-efficient embedded computer with a 6-core CPU, 8 Gb RAM, and a 256-core GPU. The computed pose information was further transmitted to the SAUV controller for short-range precision navigation.

Underwater-Docking Recovery Algorithm
Our proposed underwater-docking algorithm for recovery consisted of two phases. The first phase was landmark detection, and the second was pose estimation. In the detection phase, the detector localizes 2D landmark coordinates on images once the landmarks appear in the field of view of the camera. The pose-estimation phase recovers the relative 3D pose, including position and orientation, between the docking station and SAUV. The estimated pose is then transmitted to the control computer of the SAUV for short-range precision navigation.

Landmark Detection
Landmarks in our task were LEDs actively emitting energy. The main problems in the detection of landmarks are nonuniform spreading and complex ambient light. In order to overcome these problems, we proposed an LCB method for landmark detection. In the coarse detection phase, we deployed a convolutional neural network, DoNN, proposed in [16], to detect the bounding box of the landmarks. Then, we proposed a Laplacian-of-Gaussian (LoG) filter that utilizes blob features to overcome ambient light. Blockwise segmentation on filtered images was also performed to deal with nonuniform spreading. We show that the proposed detector outperformed state-of-the-art methods in the remote detection of landmarks in Section 4.
Coarse detection phase. We leveraged a convolutional neural network, DoNN, proposed in [16], to detect the landmark bounding box. DoNN showed superior performance in robust bounding box detection of underwater landmarks in various underwater environments in [16]. Hence, DoNN was employed to detect the bounding box of all underwater landmarks, which is also known as coarse detection. We also validated the importance of coarse detection performed by DoNN by comparing the performance in Table 2. Rows 3 and 5 in the table show that the method leveraging coarse detection significantly outperformed the method that did not take advantage of coarse detection. Instead of first detecting the bounding box of all the landmarks, an alternative solution is to detect each individual landmark directly. However, the solution is not as stable and robust as detecting the pattern formed by all the landmarks. There are many noisy luminaries that are similar to landmarks in terms of their appearance in water. Noisy luminaries may come from other light sources from the external environment and mirror images of landmarks reflected from the water surface due to total internal reflection, as given in [1,16]. If detection is performed on individual landmarks, noisy luminaries become potential detection results. Instead, the pattern formed by all landmarks represent the concept of the docking station, which is more stable and robust for detection. For the above reasons, the bounding box of all landmarks was detected in our work.
For the sake of completeness, we briefly introduce DoNN. DoNN contains nine convolution layers and seven pooling layers. It maps captured underwater images to a vector v = {x, y, w, h, s}, DoNN : (I) → v. I denotes the captured underwater image. {x, y},{w, h} represent the position and size of the predicted bounding box, respectively. s is the confidence score predicted by DoNN for the bounding box. DoNN first divides the input underwater image into G × G grids and then predicts multiple bounding boxes encoded by {x, y, w, h} along with the confidence score s for each grid. Finally, the bounding box with the highest confidence score s is selected as the final predicted bounding box. To summarize, the phase of coarse detection outputs the predicted confidence score and landmark bounding box.
LoG filter After coarse detection by DoNN, the bounding box of landmarks is localized in the image. The following processes were performed to localize landmarks within the image patch given by the bounding box. LEDs actively emitting energy were designed as our landmarks due to their good propagation in water. Since such landmarks are quite bright, most previous works proposed to detect and extract landmarks by taking advantage of the luminance features of landmarks. Methods based on luminance features suffer from ambient light, and the problem becomes more acute in shallow water, as shown in Figure 2a,b. Besides luminance, landmarks have blob-like structures that were neglected by previous works. Blob features were employed to overcome the interference of strong ambient light in shallow water. The LoG filter is a differential blob detector. It is defined as: where σ is a standard deviation of Gaussian filter G σ (x, y), ∇ 2 is the Laplacian operator, and (x, y) are image coordinates. Images filtered by LoG can be implemented by first convolving the image with a Gaussian filter G σ (x, y) and then applying the Laplacian operator ∇ 2 to the image filtered by G σ (x, y), since: where f (x, y) is the pixel value of the image at (x, y) and G σ (x, y) is a Gaussian kernel defined as: The Laplacian operator is very sensitive to noise since it is a second derivative measurement on the image. To counter this case, a Gaussian filter was leveraged to smooth the image and attenuate high-frequency noise.
LoG filters obtain a strong response for dark blobs of σ extent on a light background or light blobs on a dark background when LoG filters convolve with images. The "3σ" rule indicates that 99% of the energy of a Gaussian is concentrated within three standard deviations of its mean. Hence, the standard deviation σ is usually computed as σ = (s − 1)/3 to detect blobs of radius s [22]. Since the radius s of the blobs in our task is in the range of 10-25, the performance of our method with σ ∈ {3, 5, 7, 9} was analyzed. Figure 3 shows the convolution of LoG filters (σ ∈ {3, 5, 7, 9}) with landmarks in one dimension for ease of understanding. The response obtained its strongest value at the peak of the original signal. The response became smooth with the increment of σ. Blockwise segmentation Nonuniform spreading is primarily due to the viewpoint of the camera, as illustrated in Figure 2c,d, posing great challenges to the extraction of landmarks from f (x, y). Landmark images away from the camera are less bright than landmarks close to the camera, which results in a weak response in f (x, y). To overcome this problem, we proposed to segment filtered images f (x, y) in a blockwise manner. Filtered image f (x, y) was first equally divided into four non-overlapping blocks so that the illumination of each block was approximately uniform. Then, we applied a global threshold method to each block for segmentation. Here, Otsu's global segmentation method [23] was adopted for blockwise segmentation, which can work well since illumination is nearly uniform.

Pose Estimation
The pose is defined as the relative 3D position, represented by X = (x, y, z), and orientation, represented by Euler angles (yaw, pitch, roll) between landmarks and the SAUV, as shown in Figure 4. Pose estimation is defined as the process of recovering the relative pose between landmarks and the SAUV from the 2D coordinates of landmarks localized by the LCB method. Landmarks in 2D underwater images were detected and localized in Section 3.1. In this section, we first introduce the method to recover the relative pose between landmarks and the SAUV in the case where all predefined landmarks are observed, namely complete observation. Then, our proposed method is provided for pose estimation in the case where not all predefined landmarks can be observed, namely incomplete observation. Pose estimation can be solved by the perspective-n-point (PnP) algorithm for the complete observation of landmarks. However, the PnP algorithm cannot work for the case of an incomplete observation. Incomplete observation of landmarks may be caused by broken landmarks and biofouling coverage on landmarks in practice. Pose estimation in the case of incomplete observations is imperative for the long-term use of unmanned systems. Previous works [1,16,21] that used the PnP algorithm required a complete observation of landmarks for pose estimation. The pose cannot be estimated in incomplete observations in these works. To our best knowledge, we are the first to solve the problem of pose estimation for the case of incomplete observations in points-based underwater recovery algorithms.

Pose Estimation in the Case of Complete Observations
In order to estimate the pose between landmarks and the SAUV, three coordinate frames are first defined, as illustrated in Figure 4. The first coordinate frame is the camera coordinate frame, which is a 3D coordinate frame. Its origin is the optical center of the camera. Another coordinate frame is a reference frame attached to the docking station. Its origin resides in the geometric center of all landmarks. The third coordinate frame is the image coordinate frame, which is a 2D frame established on the image plane. The relative pose between landmarks and the SAUV is described by a transformation between the camera and reference coordinate frame. The relative pose is estimated by determining the transformation. In the following, we detail the determination of the transformation between the camera and reference coordinate frame. Assume a pinhole camera is used; the transformation between the image and the camera coordinate frame can be expressed as a set of linear equations: where λ is a scale factor, (u, v) ∈ R 3 is the coordinates in the image frame, (x c , y c , z c ) ∈ R 3 is the coordinates in the camera frame, and Λ ∈ R 3×4 is the intrinsic matrix of the inherent parameters of a camera, mapping 3D points in the camera frame to 2D points in the image. Λ contains five parameters.
(δ x , δ y ) ∈ R 2 is the principal point, whose unit is pixels. φ x ∈ R and φ y ∈ R represent the scaling factor converting space metrics to pixel units. γ is the skew coefficient. The intrinsic matrix Λ of a camera can be calibrated before using the camera for pose estimation. A matrix, called extrinsic matrix E, relates the camera frame and the reference frame. It is defined by: where E ∈ R 4×4 is the extrinsic matrix, (x r , y r , z r ) ∈ R 3 are the coordinates of the reference frame, and R and T are the rotation and translation matrix, respectively. The relative 3D position and orientation of the relative pose can be obtained from T and R, respectively. Combining Equations (4) and (5), the relationship between the image frame and the reference frame is: Given a known intrinsic matrix Λ, n 3D points w = (x r , y r , z r ) in the reference frame, and their corresponding projections x = (u, v) in the image, rotation matrix R and translation matrix T can be computed. This problem is known as the PnP problem. A unique solution can be obtained when n ≥ 4. Given n point correspondences, Equation (6) can be solved using direct linear method "DLT" [24], which is a noniterative method. Noniterative methods are efficient, but less accurate and stable in the presence of noise. Iterative methods enhance solution accuracy by minimizing a nonlinear error function, such as image space error [25,26] and object space error [27]. However, the computational cost of the iterative methods is usually high. RPnP [28] is a noniterative method that is as accurate as iterative algorithms with much less computational time. Hence, RPnP was employed in our work for pose estimation. The main idea of RPnP consists of three steps.
Firstly, n reference points are divided into (n − 2) three-point subsets. According to the law of cosines, each point pair (Q i , Q j ) in the subset gives constraint: where d i = ||O c Q i ||, d j = ||O c Q j ||, d ij = ||Q i Q j || and θ ij is the view angle, as illustrated in Figure 5. Q i and Q j are 3D reference points. q i and q j are the projection of Q i and Q j on the image plane, respectively. Based on Equation (7), we obtain the following polynomial system: for each subset. Equation (8) is then converted to a fourth-degree polynomial: for the k th subset. Second, a cost function G = ∑ n−2 k=1 g 2 k (x) is defined as the square sum of the polynomials in Equation (9). It was demonstrated in [28] that cost function G has four minima at most.
Finally, the minima of G were explored by finding the roots of its first derivative G = ∑ n−2 k=1 g k (x)g k (x) = 0, which is a seven-degree polynomial. Rotation matrix R and translation matrix T were computed from each local minimum, and the result with the least reprojection residual was selected as the final solution.
Rotation matrix R was then further converted to Euler angles represented by (yaw, pitch, roll). Translation matrix T denotes the coordinates of point O r in the camera frame. The Euler angles and translation matrix T were passed to the SAUV controller for navigation.

Pose Estimation in the Case of Incomplete Observations
Pose estimation can be solved by normal PnP algorithms using 2D-3D-point correspondences in the case of complete observations. However, PnP algorithms cannot give correct solutions if some landmarks are not observed without knowing which landmarks are missing. In this subsection, we propose a method to estimate pose in the case of incomplete observations by minimizing the reprojection error.
In order to estimate pose in the case of incomplete observations, the method first identifies which landmarks are missing. Then, the normal PnP algorithm can be used to compute the pose after identifying missing landmarks. Each landmark was numbered as its identification as shown in Figure 6a. The identification of all detected landmarks is denoted by a vector i m = {1, 2, 3, · · · , m}, where m is the number of predefined landmarks. For example, vector i 8 = {1, 2, 3, · · · , 7} means that there are eight landmarks in total mounted on the docking station, but the eighth landmark cannot be observed due to some reasons. Computing identification of landmarks i m is not hard in complete observations. Take the configuration of landmarks in Figure 6b as an example. The configuration consists of eight landmarks. The upper-right connected component in the binary images obtained by the detection phase corresponds to the first landmark. The upper-left connected component corresponds to the eighth landmark. The rest can be done in the same manner. The 2D-3D-point correspondences are gained in this way in the case of complete observations. However, this method does not work for the identification of landmarks in the case of incomplete observations. We propose to identify landmarks in the case of incomplete observations by finding an identification vector i m that minimizes the reprojection error, that is, where k(4 ≤ k ≤ m) denotes the number of landmarks detected in the detection phase, w j are the 3D coordinates of the jth landmark defined in Equation (6), x im j are the 2D coordinates of the jth landmarks determined by i m , and Ω i m is the projection matrix in Equation (6), which is computed by using w j and x i m j . To be specific, k is computed by analyzing connected components in the binary image obtained in the detection phase. Then, vector i m that minimizes the reprojection error is searched in the space of k landmarks. To search for the optimal i m , x i m j was first computed for each i m . Then, Ω i m was computed using the corresponding x i m j and w j . Next, w j was projected to the image plane by Ω i m , and the corresponding reprojection error was computed. Finally, the i m with the minimal reprojection error was the optimal i m . After obtaining the optimal i m , the relative pose can be computed using normal PnP algorithms.

Experiment Results
In this section, we give the experiment results of our proposed vision-based underwater recovery framework for AUVs. We first show that our proposed detection method outperformed the state-of-art method in landmark remote detection. Next, the LCB method is demonstrated on an underwater-docking dataset. Finally, we jointly validate our proposed method for detection and pose estimation in field experiments.

Comparison of Detection: Algorithm Performance
A histogram-based adaptive thresholding (HATS) method was proposed for landmark detection in [20]. We compare our proposed LCB method with HATS in this subsection. The comparison results are shown in Figure 7. Comparison images were consecutive frames taken in an AUV recovery task. HATS argues that landmark images contain a direct-path component and a glow component resulting from the scattering and absorption effects of the water body. The direct-path component carries much larger power than the glow component. Hence, the direct-path component corresponds to a peak in the tail of the histogram, as shown in Rows 5 and 6 of Figure 7. However, we found that HATS performed well in landmark detection at short distances, but failed at remote distances, because pixels of direct landmark components dominate the image at short distances, as shown in Rows 5 and 6 of Figure 7. The peak corresponding to the direct component was obvious in the tail of the histogram, but this was not the case at remote distances. The direct component of a landmark may only correspond to several pixels in the image at remote distances, resulting in a flat tail in the histogram, as illustrated in Rows 1-4 of Figure 7. In this case, landmarks cannot be detected using HATS, whereas remote detection is imperative for recovery since the remote detection of landmarks can provide the SAUV with more time and space to adjust its pose for docking, which vastly increases the docking success rate. Figure 7 shows that the LCB algorithm was effective for landmark detection at both short and remote distances, outperforming the state-of-the-art HATS method.
We further compared the LCB and HATS methods on a dataset, namely D recovery (Publicly available at https://drive.google.com/open?id=1m3OIQmtYUEwRmA-rN_Xg_eBYaUM1rarz), which contains 2306 images of active underwater landmarks. D recovery was collected in shallow-water field experiments with the landmarks and docking station deployed in [16]. Eight landmarks (m = 8) were mounted on the docking station. The coordinates of each landmark in the 2D image were manually labeled as the ground truth in D recovery . Detection was regarded as a failed detection if the number of detected landmarks was not equal to the number of predefined landmarks (k = m). This can be regarded as an incomplete observation and solved by our proposed pose-estimation method if the number of detected landmarks is less than the number of predefined landmarks (4 ≤ k < m). The Euclidean distance of each landmark between the location detected by the LCB algorithm and ground truth was computed for measuring location error if k = m. The location error was evaluated, as it significantly affected the accuracy of the pose-estimation algorithm given the known intrinsic matrix Λ and n 3D points w = (x r , y r , z r ) in the reference frame, as illustrated by (6). The comparison results are shown in Table 1. For the methods in Rows 1 and 2 in Table 1, the threshold was computed using HATS if a peak existed in the tail of the histogram. Otherwise, a fixed threshold was used. The first line shows results in which HATS was applied to the whole image. Both the successful detection and location errors were very poor. This was primarily due to the ambient light illustrated in the third column of Figure 7. HATS was applied to the ROI detected in the coarse detection phase in the second row of Table 1. Both failed detection and location error were improved thanks to coarse detection. Location error was 1.91, which is acceptable for successful detection, but the number of failed detections was very high. This was because HATS only works at short distances, as illustrated in the beginning of this subsection. The third row of Table 1 shows the results of our proposed LCB method. It performed much better than the two previous methods, in both the number of failed detection and location errors. Table 1. Performance comparison of histogram-based adaptive thresholding (HATS) and LCB methods.

Detection-Performance Analysis
The LCB method consists of three phases: (1) coarse detection, (2) LoG filtering, and (3) blockwise segmentation, as described in Section 3.1. In this section, we discuss the contribution of each phase to the detection performance of active underwater landmarks. Analysis was performed on D recovery . Analysis results are shown in Table 2. The first row of Table 2 shows the detection performance by globally applying Otsu's method to the image. Otsu's method failed in the detection of all images in D recovery because it is severely affected by ambient light. Otsu's method assumes that there are two classes of pixels that follow a bimodal histogram in images and finds a threshold to minimize the intraclass variance. Otsu's method tends to segment the region close to the light source instead of individual landmarks, as shown in Figure 8b. The method globally applying LoG and then using Otsu's method on the filtered image in the second row of Table 2 performed better than Otsu's method in terms of resistance to ambient light, but poorly in nonuniform spreading, as seen in Figure 8c. Applying Otsu's method to the image filtered by LoG block-wisely (Row 3 in Table 2) still separated the image in blocks that did not contain landmarks into two classes, introducing much blob noise, as shown in Figure 8d. Subsequently, we discuss the performance of methods with coarse detection (Rows 4-8). Comparing the method in the second and fourth rows, the number of failed detections was decreased by 75 images by coarse detection. Comparing the method in the fourth and fifth rows, the number of failed detections dramatically dropped, from 322 to 9 images, by adopting blockwise segmentation. We compare the performance of methods with various σ in Rows 5-8 of Table 2. Results showed that the method with σ = 3 had the worst performance, and other methods (σ = 5,7,9) performed almost the same. The poor performance of the method with σ = 3 was primarily due to a small standard deviation in the LoG filter being sensitive to noise.  Table 2, respectively.

Pose-Estimation Algorithm Analysis
In this subsection, we analyze the accuracy of the pose-estimation algorithm and its sensitivity to the location error introduced by the detection in 2D images. Since it is very hard to obtain the underwater ground truth of the relative pose between the SAUV and the docking station, accuracy was evaluated by ground experiments. In the ground experiments, a docking station of one meter in diameter was kept stationary on the ground, and a camera captured landmark images mounted on the docking station with various positions and orientations. To estimate the relative pose, the 2D location of each landmark in the 2D images was first annotated. Next, the relative pose using the pose-estimation algorithm was computed to assess accuracy in the absence of location error. Then, two levels of Gaussian noise, with standard deviations σ = 3 and σ = 5, were added to the annotated 2D location of each landmark to evaluate the accuracy of the pose-estimation algorithm in the presence of location errors in different degrees. Note that the intrinsic parameters of the camera used in the ground experiments were different from those used in water due to a different optical medium. Figure 9 and Table 3 (Part of the data in the table was published in our previous work [16].) show images for estimating the pose and corresponding results, respectively. Table 3 gives the mean estimation results averaged over 1000 trials for each degree of noise. It is shown in Table 3 that the mean orientation error was 1.823 degrees, and the mean position error was 6.306 mm in the absence of location error, that is σ = 0. The mean orientation and position error increased to 2.168 degrees and 7.039 mm, respectively, as the noise level went to σ = 3. With σ = 5, the mean orientation and position error became 2.770 degrees and 9.818 mm, respectively. The aforementioned results validate the accuracy of the pose-estimation algorithm and its robustness to noise. Table 3. Ground pose-estimation performance. Results in the table correspond to the figures given in Figure 9.

Field Experiments
We demonstrate our proposed LCB method and pose-estimation method by the automated recovery of AUVs in field experiments. Field experiments were conducted on an experimental ship, as shown in Figure 10a, anchored in Qiandao Lake, China, using systems given in Section 2. The method in the fifth row of Table 2 was employed for landmark detection. In order to validate our proposed pose-estimation method for both complete and incomplete observations, field experiments were performed with two configurations of landmarks. One was complete observations, that is all landmarks could be observed (k = m = 8), as shown in Figure 11a. The other was to simulate incomplete observations (4 ≤ k < m). To this end, a landmark was removed in advance, as shown in Figure 11b, but without letting the pose-estimation algorithm know this prior information.  Both the mother vessel (see Figure 10b,d) and the SAUV (see Figure 10b,c) started running at a random position with a random heading. The distance between them was about 500 m. The mother vessel ran near the water surface. Both acoustic and optical sensors were combined for recovery in the field experiments. The acoustic sensor was utilized for rough navigation at a long distance. Once the SAUV received the navigation information provided by our proposed vision-based underwater recycling algorithm, the SAUV switched to using it for navigation. Two runs were performed for each landmark configuration. The SAUV was recovered successfully by the mother vessel in all of these runs. Figure 12 shows the trajectories of all four of these runs. Figures 13-16 give the images of the recycling process. Runs 1 and 2 demonstrate the LCB algorithm and pose-estimation method in complete observations. We validated the LCB algorithm and pose-estimation method for the incomplete observation in Runs 3 and 4. The average running time of the proposed framework was 0.17 s per frame.

Conclusions
A vision-based framework was proposed in this work for recovering an AUV with another AUV in shallow water. The framework consisted of a detection phase and a pose-estimation phase. In order to overcome challenges resulting from ambient light and nonuniform spreading in shallow water, an LCB algorithm was developed for the detection of underwater landmarks. Experiments showed that the LCB algorithm outperformed the state-of-the-art method in terms of location error. For the pose-estimation phase, a novel pose-estimation method was proposed for the incomplete observation of all underwater landmarks, which is practical, but has been neglected by most previous works. Pose estimation for incomplete observations enables the recovery of AUVs, even if some landmarks are broken or occluded by biofouling, which is very likely in long-duration missions.
In ground experiments, we observed that the mean position and orientation error were 1.823 degrees and 6.306 mm, respectively, in the absence of noise, and 2.770 degrees and 9.818 mm, respectively, in the presence of strong noise. Field experiments were performed to recover a sub-AUV by a mother vessel in a lake using our proposed framework. The field experiments validated that our developed framework was effective for recovering AUVs in shallow water.