A Fast Binocular Localisation Method for AUV Docking

Docking technology plays a critical role in realising the long-time operation of autonomous underwater vehicles (AUVs). In this study, a binocular localisation method for AUV docking is presented. An adaptively weighted OTSU method is developed for feature extraction. The foreground object is extracted precisely without mixing or missing lamps, which is independent of the position of the AUV relative to the station. Moreover, this extraction process is more precise compared to other segmentation methods with a low computational load. The mass centre of each lamp on the binary image is used as matching feature for binocular vision. Using this fast feature matching method, the operation frequency of the binocular localisation method exceeds 10 Hz. Meanwhile, a relative pose estimation method is suggested for instances when the two cameras cannot capture all the lamps. The localisation accuracy of the distance in the heading direction as measured by the proposed binocular vision algorithm was tested at fixed points underwater. A simulation experiment using a ship model has been conducted in a laboratory pool to evaluate the feasibility of the algorithm. The test result demonstrates that the average localisation error is approximately 5 cm and the average relative location error is approximately 2% in the range of 3.6 m. As such, the ship model was successfully guided to the docking station for different lateral deviations.


Introduction
In recent decades, autonomous underwater vehicles (AUVs) have played an increasingly crucial role in marine exploration and development, such as resource detection, military technologies, and underwater structure inspections [1]. To facilitate the flexibility of their movement, the power capacity of AUVs is limited by their compact outfit. In addition, upon completion of undersea assignments, AUVs have to return to the bank to obtain a new mission. Docking technology makes it possible for the vehicles to upload data, download new assignments and recharge their batteries underwater [2,3]. This greatly extends the duration of their operation. To achieve this expected manipulation, AUVs must be guided to the docking station during the first step. The guiding process is divided into two stages: the homing stage and the docking stage [4]. AUVs are guided to the neighbourhood of the docking station during the homing stage and then enter into it during the docking stage. Compared with the homing stage, the docking stage requires higher localisation precision and directly determines whether AUVs can get back to the docking station successfully or not. As such, the docking technology for the docking stage is of vital importance.
In previous works, several types of docking technologies based on the use of different sensors have been introduced. described specifically in Section 3. By choosing the mass centres of the extracted lamp objects as the matching feature, the six-dimensional pose of the AUV can be computed with a binocular localisation method at an operation frequency greater than 10 Hz. To address situations in which two cameras cannot capture all the lamps, an efficacious estimation method (D. Li et al. [11]) is applied to obtain the relative two-dimensional position to control the heading direction only.
In the remaining part of this article, Section 2 presents the configuration for the proposed vision guidance system. Section 3 describes the realisation of the proposed system, which includes the processing of the raw images, the binocular vision algorithm, and the control strategy for the ship model. Section 4 presents the experiment results. The final section is a summary of the main conclusions of this study.

System Configuration
The platform applied in this article was a ship model. As shown in Figure 1, it contains five main components: two monochromatic Complementary Metal Oxide Semiconductor (CMOS) cameras, two rudders, one propeller, one control unit and one navigation computer. Table 1 presents the details of the equipment specification of the proposed system. We chose a monochromatic camera due to its higher sensitivity to light compared to colour cameras, whose colour filters result in a loss of more than half of the incident light energy [15]. The field angle of the camera is 60 degrees. The CMOS cameras are connected to the navigation computer by Universal Serial Bus (USB) port. The navigation computer is embedded in a Linux operating system and is utilised in processing the images captured by the cameras and then transmitting the visual computation results to the control unit by ethernet. The control unit implements the control strategy (described in Section 3.3) to control the direction and speed of the ship model. To be more specific, it can transmit Pulse Width Modulation (PWM) instructions to the rudders and propeller through General-Purpose Input/Output (GPIO) ports. method at an operation frequency greater than 10 Hz. To address situations in which two cameras cannot capture all the lamps, an efficacious estimation method (D. Li et al. [11]) is applied to obtain the relative two-dimensional position to control the heading direction only.
In the remaining part of this article, Section 2 presents the configuration for the proposed vision guidance system. Section 3 describes the realisation of the proposed system, which includes the processing of the raw images, the binocular vision algorithm, and the control strategy for the ship model. Section 4 presents the experiment results. The final section is a summary of the main conclusions of this study.

System Configuration
The platform applied in this article was a ship model. As shown in Figure 1, it contains five main components: two monochromatic Complementary Metal Oxide Semiconductor (CMOS) cameras, two rudders, one propeller, one control unit and one navigation computer. Table 1 presents the details of the equipment specification of the proposed system. We chose a monochromatic camera due to its higher sensitivity to light compared to colour cameras, whose colour filters result in a loss of more than half of the incident light energy [15]. The field angle of the camera is 60 degrees. The CMOS cameras are connected to the navigation computer by Universal Serial Bus (USB) port. The navigation computer is embedded in a Linux operating system and is utilised in processing the images captured by the cameras and then transmitting the visual computation results to the control unit by ethernet. The control unit implements the control strategy (described in Section 3.3) to control the direction and speed of the ship model. To be more specific, it can transmit Pulse Width Modulation (PWM) instructions to the rudders and propeller through General-Purpose Input/Output (GPIO) ports.
Most of the vision guidance systems mount the navigation lamps at the entrance of the docking station as active beacons. In this paper, we also adopt this method. However, instead of a real docking station, we used an aluminium profile model with lamps equipped on it, which operated similarly to the docking station in terms of the terminal docking process. Three common underwater green lamps were symmetrically positioned on the aluminium profile model around the centre of the three lamps, namely, the centre of the docking station ( Figure 2). Green lamps were chosen because of the wavelengths at which it is relatively difficult for seawater to either absorb or scatter the light [16].   Most of the vision guidance systems mount the navigation lamps at the entrance of the docking station as active beacons. In this paper, we also adopt this method. However, instead of a real docking station, we used an aluminium profile model with lamps equipped on it, which operated similarly to the docking station in terms of the terminal docking process. Three common underwater green lamps were symmetrically positioned on the aluminium profile model around the centre of the three lamps, namely, the centre of the docking station ( Figure 2). Green lamps were chosen because of the wavelengths at which it is relatively difficult for seawater to either absorb or scatter the light [16].

Vision Guidance Algorithm
As previously indicated, the vision guidance system primarily contains two cameras installed below the ship model and three active lamps equipped on the simple docking station. This article presents a fast binocular localisation algorithm to determine the relative position and attitude between the AUV and the docking station. A relative localisation algorithm is applied when two cameras cannot capture all three lamps.
When all three lamps are captured by both cameras, the fast binocular localisation algorithm is able to compute the six-dimensional pose (including three-dimensional position and threedimensional attitude) of the AUV via a mass-centre matching method. Moreover, a relative localisation method proposed by D. Li et al. [11] is applied when the two cameras cannot capture all the lamps. The combination of these algorithms can improve the reliability of the system without a significant increase in the computational burden. The flow chart of the visual localisation algorithm is shown in Figure 3.

Vision Guidance Algorithm
As previously indicated, the vision guidance system primarily contains two cameras installed below the ship model and three active lamps equipped on the simple docking station. This article presents a fast binocular localisation algorithm to determine the relative position and attitude between the AUV and the docking station. A relative localisation algorithm is applied when two cameras cannot capture all three lamps.
When all three lamps are captured by both cameras, the fast binocular localisation algorithm is able to compute the six-dimensional pose (including three-dimensional position and three-dimensional attitude) of the AUV via a mass-centre matching method. Moreover, a relative localisation method proposed by D. Li et al. [11] is applied when the two cameras cannot capture all the lamps. The combination of these algorithms can improve the reliability of the system without a significant increase in the computational burden. The flow chart of the visual localisation algorithm is shown in Figure 3.

Vision Guidance Algorithm
As previously indicated, the vision guidance system primarily contains two cameras installed below the ship model and three active lamps equipped on the simple docking station. This article presents a fast binocular localisation algorithm to determine the relative position and attitude between the AUV and the docking station. A relative localisation algorithm is applied when two cameras cannot capture all three lamps.
When all three lamps are captured by both cameras, the fast binocular localisation algorithm is able to compute the six-dimensional pose (including three-dimensional position and threedimensional attitude) of the AUV via a mass-centre matching method. Moreover, a relative localisation method proposed by D. Li et al. [11] is applied when the two cameras cannot capture all the lamps. The combination of these algorithms can improve the reliability of the system without a significant increase in the computational burden. The flow chart of the visual localisation algorithm is shown in Figure 3.

Image Processing
Light is attenuated and scattered during underwater propagation, and hence, images captured in this environment may contain noise and artefacts. In this section, we utilise a median filter to remove image noise and propose a weighted OTSU method to binarise the image and to extract the mass centres of the captured lamps in the binary image.

Image Filtering
An underwater lamp consists of several small LEDs. As such, the captured image of an individual lamp may be identified as several lamps instead of one. A median filter with a 5 × 5 pixel mask was thus adopted to smooth the image and to remove salt-and-pepper noise while still retaining the edge information of the object. Figure 4 shows the raw image and the image acquired after processing using the median filter.

Image Processing
Light is attenuated and scattered during underwater propagation, and hence, images captured in this environment may contain noise and artefacts. In this section, we utilise a median filter to remove image noise and propose a weighted OTSU method to binarise the image and to extract the mass centres of the captured lamps in the binary image.

Image Filtering
An underwater lamp consists of several small LEDs. As such, the captured image of an individual lamp may be identified as several lamps instead of one. A median filter with a 5 × 5 pixel mask was thus adopted to smooth the image and to remove salt-and-pepper noise while still retaining the edge information of the object. Figure 4 shows the raw image and the image acquired after processing using the median filter.

Adaptively Weighted OTSU Method
Before feature matching, the object needs to be extracted from the processed image. It is feasible to segment the foreground object and the background scene through thresholding. In this article, we propose an adaptively weighted OTSU method to obtain the global optimal threshold for the captured image. The traditional OTSU method defines the segmentation threshold as the solution that maximises the between-class variance, which is established as follows: where and denote the probability that a pixel is divided into the object and the background under the global threshold T, respectively. The variables and denote the average grey value of the object zone and the background zone and is the average greyscale value of the entire image. The number of pixels divided into the object and the background is denoted as and , respectively. Assuming m and n are the width and height of the image in pixel units, respectively, it can be induced from the definition that = /( × ) and = /( × ) as well as + = × . Therefore, it is obvious to induce that + = and + = 1.
After applying the traditional OTSU method, the image shown in Figure 4b becomes the binary image as shown in Figure 5a. The result appears to be non-ideal. The reason for this is that most regions of the image captured underwater are background scene. In other words, the proportion of the pixels with a small grey value (i.e., the background) is much larger than the proportion of the pixels with a large grey value (i.e., the object) in the captured image. Therefore, in order to maximise the between-class variance, the optimal global threshold T calculated by the traditional OTSU method is more likely to be smaller, that is, biases towards the grey value of the background. In this way, some background pixels are segmented as the foreground object, so that the segmented object area is much larger than the real object area and several lamps are mixed together as shown in Figure 5a. This method works well in the study by D. Li et al. [11], because only one light is used for guidance,

Adaptively Weighted OTSU Method
Before feature matching, the object needs to be extracted from the processed image. It is feasible to segment the foreground object and the background scene through thresholding. In this article, we propose an adaptively weighted OTSU method to obtain the global optimal threshold for the captured image. The traditional OTSU method defines the segmentation threshold as the solution that maximises the between-class variance, which is established as follows: where ω 0 and ω 1 denote the probability that a pixel is divided into the object and the background under the global threshold T, respectively. The variables µ 0 and µ 1 denote the average grey value of the object zone and the background zone and µ T is the average greyscale value of the entire image. The number of pixels divided into the object and the background is denoted as N 0 and N 1 , respectively. Assuming m and n are the width and height of the image in pixel units, respectively, it can be induced from the definition that ω 0 = N 0 /(m × n) and ω 1 = N 1 /(m × n) as well as N 0 + N 1 = m × n. Therefore, it is obvious to induce that ω 0 µ 0 + ω 1 µ 1 = µ T and ω 0 + ω 1 = 1. After applying the traditional OTSU method, the image shown in Figure 4b becomes the binary image as shown in Figure 5a. The result appears to be non-ideal. The reason for this is that most regions of the image captured underwater are background scene. In other words, the proportion of the pixels with a small grey value (i.e., the background) is much larger than the proportion of the pixels with a large grey value (i.e., the object) in the captured image. Therefore, in order to maximise the between-class variance, the optimal global threshold T calculated by the traditional OTSU method is more likely to be smaller, that is, biases towards the grey value of the background. In this way, some background pixels are segmented as the foreground object, so that the segmented object area is much larger than the real object area and several lamps are mixed together as shown in Figure 5a. This method works well in the study by D. Li et al. [11], because only one light is used for guidance, thereby avoiding the mixing problem. However, in this work, each lamp needs to be extracted individually for the subsequent feature matching process. output threshold will extremely bias towards the grey value of the objective region. Therefore, the lamps with relatively low grey values cannot be extracted. Taking this condition into consideration, we would like to let K be adaptive according to the number of lamps in the segmented image. The procedures involved in the implementation of the adaptively weighted OTSU method is illustrated in Figure 6. The final weight coefficient K * is selected by maximising the number of lamps in the segmented image N:

{N}
(3) In this method, the extracted lamps are neither mixed nor missed, irrespective of the distance of the AUV from the station.

Image Segmentation
As mentioned in the previous section, threshold segmentation methods (TSMs) are usually used to segment the image into two classes: the objective zone and the background zone. When the boundary threshold T is given, the segmented image can be obtained as follows: To fix the above problem, a weight coefficient is added to increase the proportion of the object in the between-class variance formula. The modified between-class variance is established as follows: where K is the added weight coefficient. Assuming ω 0 and ω 1 are constant coefficients between 0 and 1, the smaller the value of K is, the larger the value of ω K 0 and the smaller the value of ω 2−K 1 will be. This means that the proportion of the object in the between-class variance will be greater and the extracted objective region will be smaller. Then, in order to maximise the between-class variance, the output threshold of this modified method will bias toward the grey value of the objective region, i.e., the output threshold of this method will be higher than that of the original method. This conclusion is proven in Figure 5.
The traditional OTSU algorithm is a particular case of the weighted OTSU algorithm when K is equal to 1, as shown in Figure 5a. Figure 5b-k shows the binary images obtained through the weighted OTSU method with different weight coefficients. It can be observed that the smaller the weight coefficient K is, the larger the output global threshold T will be. When K is too small, the output threshold will extremely bias towards the grey value of the objective region. Therefore, the lamps with relatively low grey values cannot be extracted. Taking this condition into consideration, we would like to let K be adaptive according to the number of lamps in the segmented image. The procedures involved in the implementation of the adaptively weighted OTSU method is illustrated in Figure 6. The final weight coefficient K* is selected by maximising the number of lamps in the segmented image N: In this method, the extracted lamps are neither mixed nor missed, irrespective of the distance of the AUV from the station.

Image Segmentation
As mentioned in the previous section, threshold segmentation methods (TSMs) are usually used to segment the image into two classes: the objective zone and the background zone. When the boundary threshold T is given, the segmented image can be obtained as follows: where f(i,j) and g(i,j) are the grey values of pixel point in the location (i,j) on the processed image and on the binary image, respectively. The pixel point with a grey value greater than or equal to T(i,j) becomes the bright group (grey value equal to 255) while the pixel point with a grey value smaller than T(i,j) becomes the dark group (grey value equal to 0). As such, the greyscale image is divided into the objective zone and the background zone.
Depending on whether the threshold T varies with the location of the pixel point, TSMs can be divided into local TSMs and global TSMs. The threshold of the former method is determined adaptively by the neighbourhood window that is centred on each pixel [17]. In the latter method, the determination of the global threshold is of the most importance. Several studies on vision guidance technology have adopted diverse methods for determining the global threshold. Park et al. [12]

Image Segmentation
As mentioned in the previous section, threshold segmentation methods (TSMs) are usually used to segment the image into two classes: the objective zone and the background zone. When the boundary threshold T is given, the segmented image can be obtained as follows: where f(i,j) and g(i,j) are the grey values of pixel point in the location (i,j) on the processed image and on the binary image, respectively. The pixel point with a grey value greater than or equal to T(i,j) becomes the bright group (grey value equal to 255) while the pixel point with a grey value smaller than T(i,j) becomes the dark group (grey value equal to 0). As such, the greyscale image is divided into the objective zone and the background zone.
Depending on whether the threshold T varies with the location of the pixel point, TSMs can be divided into local TSMs and global TSMs. The threshold of the former method is determined adaptively by the neighbourhood window that is centred on each pixel [17]. In the latter method, the determination of the global threshold is of the most importance. Several studies on vision guidance technology have adopted diverse methods for determining the global threshold. Park et al. [12] adopted a pre-specified threshold value to segment the greyscale image. D. Li et al. [11] applied a traditional OTSU method to obtain the global optimal threshold, which has been commonly used for image segmentation due to its excellent performance. Y. Li et al. [13] used a Mean-Shift algorithm to extract the light source zone. This algorithm is an iterative optimisation approach, which is computationally intensive. The computational burdens of the four aforementioned segmentation methods and the proposed adaptively weighted OTSU approach are illustrated in Table 2. It is evident that the Mean-Shift algorithm takes more than ten times as long as the TSMs and is unsuitable for the real-time processing of images. Table 2. Computational costs of different segmentation method.

Segmentation Method Computational Time/ms
Adaptive local TSM [17] 0.51777 Pre-specified TSM [12] 0.01926 Traditional OTSU TSM [11] 0.63905 Mean-Shift algorithm [13] 183.239 Adaptively weighted OTSU TSM 17.641 The threshold segmentation results of the four TSMs for the image shown in Figure 4b are illustrated in Figure 7, and the threshold segmentation results for the image captured near the station are shown in Figure 8. It can be concluded that the adaptive local TSM is unsuitable for underwater lamp images due to its high sensitivity to the noise in the image. The pre-specified TSM is the fastest algorithm, but it is inflexible, owing to the brightness of the captured lamps varying with the distance between the AUV and the docking station. For example, when setting the pre-defined threshold as 125, this method appears to be suitable for images captured far from the station (Figure 4b), but it fails to precisely extract the object when the image is captured near the station (Figure 8c). With the traditional OTSU TSM, it is likely to mix up the lamps when the AUV is far from the station, in which cases most regions in the underwater captured images are background (Figure 7c). Moreover, the adaptively weighted OTSU TSM can extract the object precisely without incurring a heavy computational burden, irrespective of the distance between the AUV and the station. Therefore, it outperforms the other four methods. adopted a pre-specified threshold value to segment the greyscale image. D. Li et al. [11] applied a traditional OTSU method to obtain the global optimal threshold, which has been commonly used for image segmentation due to its excellent performance. Y. Li et al. [13] used a Mean-Shift algorithm to extract the light source zone. This algorithm is an iterative optimisation approach, which is computationally intensive. The computational burdens of the four aforementioned segmentation methods and the proposed adaptively weighted OTSU approach are illustrated in Table 2. It is evident that the Mean-Shift algorithm takes more than ten times as long as the TSMs and is unsuitable for the real-time processing of images.

Segmentation Method
Computational Time/ms Adaptive local TSM [17] 0.51777 Pre-specified TSM [12] 0.01926 Traditional OTSU TSM [11] 0.63905 Mean-Shift algorithm [13] 183.239 Adaptively weighted OTSU TSM 17.641 The threshold segmentation results of the four TSMs for the image shown in Figure 4b are illustrated in Figure 7, and the threshold segmentation results for the image captured near the station are shown in Figure 8. It can be concluded that the adaptive local TSM is unsuitable for underwater lamp images due to its high sensitivity to the noise in the image. The pre-specified TSM is the fastest algorithm, but it is inflexible, owing to the brightness of the captured lamps varying with the distance between the AUV and the docking station. For example, when setting the pre-defined threshold as 125, this method appears to be suitable for images captured far from the station (Figure 4b), but it fails to precisely extract the object when the image is captured near the station (Figure 8c). With the traditional OTSU TSM, it is likely to mix up the lamps when the AUV is far from the station, in which cases most regions in the underwater captured images are background (Figure 7c). Moreover, the adaptively weighted OTSU TSM can extract the object precisely without incurring a heavy computational burden, irrespective of the distance between the AUV and the station. Therefore, it outperforms the other four methods.

Feature Extraction
In this study, we chose the mass centres of the captured lamps as the matching feature of the binocular algorithm. We extracted all the contours of the lamps from the binary image as shown in Figure 9b. Then the coordinates of pixels on each contour were averaged to obtain the coordinate of the mass centre of each contour, i.e., the mass centre of each lamp (Figure 9c). The origin of the pixel coordinate frame is at the top left corner of the image while the positive directions of the x and y axis are respectively point to the left and upper side of the image as shown in Figure 9d.
The mass centres were firstly sorted in terms of their vertical coordinates. To remove the reflected lamps located at the top of the image, we only kept three mass centres that had larger vertical coordinates (Figure 9e). Then we sorted the mass centres of three lamps in terms of their horizontal coordinates and marked the three lamps in order. In this way, we can identify the lamps. When both cameras capture all three lamps, each lamp captured by the left and right camera can be matched in order.
When the roll angle of the AUV exceeds 60°, this sorting method fails. However, this situation rarely occurs. To ensure the normal functional operation of the AUV, IMU (Inertial Measurement Unit), compass, or other sensors will be applied to maintain the variation of the roll angle on a small fluctuation [18]. This is not discussed herein.

Feature Extraction
In this study, we chose the mass centres of the captured lamps as the matching feature of the binocular algorithm. We extracted all the contours of the lamps from the binary image as shown in Figure 9b. Then the coordinates of pixels on each contour were averaged to obtain the coordinate of the mass centre of each contour, i.e., the mass centre of each lamp (Figure 9c). The origin of the pixel coordinate frame is at the top left corner of the image while the positive directions of the x and y axis are respectively point to the left and upper side of the image as shown in Figure 9d.
The mass centres were firstly sorted in terms of their vertical coordinates. To remove the reflected lamps located at the top of the image, we only kept three mass centres that had larger vertical coordinates (Figure 9e). Then we sorted the mass centres of three lamps in terms of their horizontal coordinates and marked the three lamps in order. In this way, we can identify the lamps. When both cameras capture all three lamps, each lamp captured by the left and right camera can be matched in order.
When the roll angle of the AUV exceeds 60 • , this sorting method fails. However, this situation rarely occurs. To ensure the normal functional operation of the AUV, IMU (Inertial Measurement Unit), compass, or other sensors will be applied to maintain the variation of the roll angle on a small fluctuation [18]. This is not discussed herein.

Binocular Vision Algorithm
In this study, we apply a parallel binocular vision algorithm to realise the localisation. The position and attitude of the AUV relative to the docking station can be obtained using three co-planar lamps and two parallel cameras.
Firstly, we adopt the method proposed by Zhang Z. et al. [19] to calibrate the cameras. Using the calibrated internal coefficients and the external coefficients of two cameras, we can rectify the captured images and ensure that the image planes of both cameras are ideally co-planar ( Figure 10).

Binocular Vision Algorithm
In this study, we apply a parallel binocular vision algorithm to realise the localisation. The position and attitude of the AUV relative to the docking station can be obtained using three co-planar lamps and two parallel cameras.
Firstly, we adopt the method proposed by Zhang Z. et al. [19] to calibrate the cameras. Using the calibrated internal coefficients and the external coefficients of two cameras, we can rectify the captured images and ensure that the image planes of both cameras are ideally co-planar ( Figure 10). Then, according to the similar triangle theorem, the position of a matching point in the left camera coordinate O l -X l Y l Z l can be computed as: where f is the focal length of the camera, B is the baseline length of the two cameras, x l and x r are the horizontal positions of a point in the image coordinate of the left and right camera, respectively, while y l and y r are the vertical positions. c x and c y are, respectively, the horizontal and vertical position of the optical centre in the image plane. The coordinate frames of the vision guidance system in this study are shown in Figure 11. In the AUV body frame, the position of the matching point can be obtained as: Then, according to the similar triangle theorem, the position of a matching point in the left camera coordinate O -X Y Z can be computed as: where f is the focal length of the camera, B is the baseline length of the two cameras, x and x are the horizontal positions of a point in the image coordinate of the left and right camera, respectively, while y and y are the vertical positions. and are, respectively, the horizontal and vertical position of the optical centre in the image plane.
The coordinate frames of the vision guidance system in this study are shown in Figure 11. In the AUV body frame, the position of the matching point can be obtained as:  The dock coordinate refers to the earth or the inertial frame. The position of the matching points in the AUV coordinate P A can be computed from Equations (8)-(10), while the positions of the three lamps in the dock coordinate P D are determined. Therefore, using the three matching points, the relative translation and rotation of the two coordinates, namely the position and attitude of the AUV relative to the dock, can be computed by the transverse formula: where T is the translation vector and R is the rotation matrix. Given that the average position of the three lamps is the origin of the dock coordinate, T can be computed as the average position of the three matching points in the AUV coordinate as follows: Then, R can be computed from Equation (11) via the positions of the three matching points in the AUV and dock coordinate. Assuming that ψ, θ and ϕ are the rotation angles of the AUV coordinate relative to the dock coordinate in the X, Y, and Z direction, R can be described as: cos θ cos ϕ sin ψ sin θ cos ϕ − cos ψ cos ϕ cos ψ sin θ cos ϕ + sin ψ sin ϕ cos θ sin ϕ sin ψ sin θ sin ϕ + cos ψ cos ϕ cos ψ sin θ cos ϕ − sin ψ cos ϕ − sin θ sin ψ cos θ cos ψ cos θ    Assuming R ij is the element of matrix R in the ith row and the jth column, the attitude of AUV can be obtained from Equation (13): Hence, the three-dimensional position and three-dimensional attitude of AUV relative to the docking station are deduced. The matching features of the images captured by the two cameras are discussed in Section 3.1 and are shown in Figure 12. With this feature matching method, the operation frequency of the binocular ranging algorithm can exceed 10 Hz and the entire visual process can reach 5 Hz, which satisfies the control requirement.
where T is the translation vector and R is the rotation matrix. Given that the average position of the three lamps is the origin of the dock coordinate, T can be computed as the average position of the three matching points in the AUV coordinate as follows: Then, R can be computed from Equation (11) via the positions of the three matching points in the AUV and dock coordinate. Assuming that ψ, θ and φ are the rotation angles of the AUV coordinate relative to the dock coordinate in the X, Y, and Z direction, R can be described as: R = cos θ cos φ sin ψ sin θ cos φ − cos ψ cos φ cos ψ sin θ cos φ + sin ψ sin φ cos θ sin φ sin ψ sin θ sin φ + cos ψ cosφ cos ψ sin θ cos φ − sin ψ cos φ − sin θ sin ψ cos θ cos ψ cos θ Assuming R is the element of matrix R in the ith row and the jth column, the attitude of AUV can be obtained from Equation (13): Hence, the three-dimensional position and three-dimensional attitude of AUV relative to the docking station are deduced. The matching features of the images captured by the two cameras are discussed in Section 3.1 and are shown in Figure 12. With this feature matching method, the operation frequency of the binocular ranging algorithm can exceed 10 Hz and the entire visual process can reach 5 Hz, which satisfies the control requirement. When both cameras cannot capture all the lamps, we can only calculate the relative distance at the X and Y dimension between the docking station and the AUV. The average coordinates of the mass centres of the captured lamps in the left and right image plane can be computed from the processed image, denoted as (x , y ) and (x , y ), respectively, in pixel units. Then the position of the docking station relative to the AUV (x , y ) in pixel units can be computed as follows: When both cameras cannot capture all the lamps, we can only calculate the relative distance at the X and Y dimension between the docking station and the AUV. The average coordinates of the mass centres of the captured lamps in the left and right image plane can be computed from the processed image, denoted as (x l , y l ) and (x r , y r ), respectively, in pixel units. Then the position of the docking station relative to the AUV (x, y) in pixel units can be computed as follows:

Control Strategy
The motion control system for AUV consists of two independent parts, namely, the control strategies for the horizontal and vertical plane [20]. The control schemes for motion in two planes are similar [11]. To avoid redundancy, we only focus on the tracking control strategy on one plane and the other plane can be derived in a similar manner. Furthermore, during the docking stage in a real scenario, the AUV tends to sail at a setting depth and to move on the horizontal two-dimensional plane.
Hence, in this study, we only discuss the two-dimensional motion control on the horizontal plane and carry out the experiment in the same way. The classical Proportion Integration Differentiation (PID) control strategy is used to control the yaw angle of the ship model on the horizontal plane.
When both cameras are able to capture all three lamps, we can obtain the three-dimensional position and attitude of the AUV relative to the docking station from the binocular vision algorithm. From the obtained three-dimensional position, the horizontal yaw angle of the ship model relative to the docking station can be obtained as follows: where X A and Z A denote the position of the docking station in the X and Z direction of the AUV coordinate, respectively. When the two cameras cannot capture all three lamps, we can only obtain the relative position of the docking station relative to the ship model in pixel units (x, y). In such cases, Equation (17) is not applicable. Assuming that the field angle of the camera is α, we can obtain the following equation via the geometric relationship and the camera model shown in Figure 13: where k is the proportional coefficient.
The motion control system for AUV consists of two independent parts, namely, the control strategies for the horizontal and vertical plane [20]. The control schemes for motion in two planes are similar [11]. To avoid redundancy, we only focus on the tracking control strategy on one plane and the other plane can be derived in a similar manner. Furthermore, during the docking stage in a real scenario, the AUV tends to sail at a setting depth and to move on the horizontal two-dimensional plane. Hence, in this study, we only discuss the two-dimensional motion control on the horizontal plane and carry out the experiment in the same way. The classical Proportion Integration Differentiation (PID) control strategy is used to control the yaw angle of the ship model on the horizontal plane.
When both cameras are able to capture all three lamps, we can obtain the three-dimensional position and attitude of the AUV relative to the docking station from the binocular vision algorithm. From the obtained three-dimensional position, the horizontal yaw angle of the ship model relative to the docking station can be obtained as follows: where X and Z denote the position of the docking station in the X and Z direction of the AUV coordinate, respectively. When the two cameras cannot capture all three lamps, we can only obtain the relative position of the docking station relative to the ship model in pixel units (x , y ). In such cases, Equation (17) is not applicable. Assuming that the field angle of the camera is α, we can obtain the following equation via the geometric relationship and the camera model shown in Figure 13: where k is the proportional coefficient. Therefore, the horizontal yaw angle of the ship model can be derived as: The PID control scheme is illustrated in Figure 14. The position of the docking station in the AUV coordinate P can be obtained by the proposed vision localisation algorithm, while P denotes the origin of the AUV coordinate. The deviation between P and P , denoted as ∆P, is equal Therefore, the horizontal yaw angle of the ship model can be derived as: The PID control scheme is illustrated in Figure 14. The position of the docking station in the AUV coordinate P A can be obtained by the proposed vision localisation algorithm, while P O denotes the origin of the AUV coordinate. The deviation between P A and P O , denoted as ∆P, is equal to P A . Thereby, the yaw angle θ can be calculated by Equations (17) and (20). The compass can measure the angle of the centre line of the AUV deviated from the centre line of the docking station θ dock . The deviation between θ and θ dock , denoted as ∆θ, serves as the input of the controller, the coefficients of which are determined by trial and error.
where v is the minimum velocity of the AUV and K +v is the maximum velocity. K is set as v so that the maximum velocity is twice the minimum velocity.
In conditions where the two cameras cannot capture all three lamps, the AUV is either within short range of the docking station or has a large deviation from the docking station. In this case, it is desirable that the AUV moves slowly, and hence we let Z be equal to 0 m. The control strategy scheme on the horizontal plane is shown in Figure 14.

Pool Experiment
The aim of this research is to develop a method for a platform-based docking station such that AUVs only need to be laid on the platform without the requirement of extremely high precision. In this case, only the three-dimensional position is required to control the AUV to complete its docking process. We initially tested the localisation accuracy of the distance in the heading direction, namely the localisation accuracy of the proposed binocular vision system. A verification experiment was then performed in the laboratory pool to validate the feasibility of the entire system.

Experiment Platform
The experiment aims to verify the feasibility of the presented localisation method in which the impact of the experimental object size turns out to be small. The ship model used in this experiment is shown in Figure 15. The binocular cameras are installed about twenty centimetres below the ship model, while the navigation computer and control unit are contained in the inner cavity. The motor and the rudders are located at its tail.
The experiment was conducted in an indoor laboratory pool with dimensions of 4.2 × 2.4 × 1.2 m , as shown in Figure 16. A simple docking station with three navigation lamps was positioned on one side of the water pool. Throughout the entire experiment, the ship model moves automatically, Moreover, it is possible for AUV to obtain its six-dimensional pose relative to the docking station when the binocular vision algorithm works. Therefore, except for the heading direction, the AUV can control its speed as well. When the target position in Z direction P A (z) is larger than the reference Z-distance Z ref , which is set as 5 m, it is suggested that the AUV moves at a relatively high velocity. When Z A is smaller than Z ref , it is suggested that the AUV moves at a relatively low velocity to avoid strong collision with the docking station. The relationship between the deviation ∆Z and the reference velocity v ref is established via the sigmoid function: where v 0 is the minimum velocity of the AUV and K v + v 0 is the maximum velocity. K v is set as v 0 so that the maximum velocity is twice the minimum velocity. In conditions where the two cameras cannot capture all three lamps, the AUV is either within short range of the docking station or has a large deviation from the docking station. In this case, it is desirable that the AUV moves slowly, and hence we let Z A be equal to 0 m. The control strategy scheme on the horizontal plane is shown in Figure 14.

Pool Experiment
The aim of this research is to develop a method for a platform-based docking station such that AUVs only need to be laid on the platform without the requirement of extremely high precision. In this case, only the three-dimensional position is required to control the AUV to complete its docking process. We initially tested the localisation accuracy of the distance in the heading direction, namely the localisation accuracy of the proposed binocular vision system. A verification experiment was then performed in the laboratory pool to validate the feasibility of the entire system.

Experiment Platform
The experiment aims to verify the feasibility of the presented localisation method in which the impact of the experimental object size turns out to be small. The ship model used in this experiment is shown in Figure 15. The binocular cameras are installed about twenty centimetres below the ship model, while the navigation computer and control unit are contained in the inner cavity. The motor and the rudders are located at its tail. without connecting to any wire cables. The starting and stopping of the ship are controlled via Wi-Fi signals, while the movement is autonomous. The cameras installed below the ship model capture images of the lamps and the navigation computer calculates the three-dimensional position of the ship model relative to the docking station. This data is input into the control unit to control the yaw angle of the ship on the horizontal plane, and to finally lead it to the docking station. The positions are then recorded in the navigation computer.

Test of Localisation Accuracy
To validate the accuracy of the binocular vision algorithm with the proposed feature matching method, we performed a test on the localisation accuracy of the distances in the heading direction at fixed points. The distance in the heading direction instead of the deviation direction was selected, because the former had a larger test range in the pool.
The binocular cameras were positioned at fixed points underwater within a range of 1.2 m to 3.6 m away from the simple docking station. The computed distances from the binocular vision algorithm are compared with the actual distances measured by the tapeline, which is illustrated in Table 3. It

Test of Localisation Accuracy
To validate the accuracy of the binocular vision algorithm with the proposed feature matching method, we performed a test on the localisation accuracy of the distances in the heading direction at fixed points. The distance in the heading direction instead of the deviation direction was selected, because the former had a larger test range in the pool.
The binocular cameras were positioned at fixed points underwater within a range of 1.2 m to 3.6 m away from the simple docking station. The computed distances from the binocular vision algorithm are compared with the actual distances measured by the tapeline, which is illustrated in Table 3. It

Test of Localisation Accuracy
To validate the accuracy of the binocular vision algorithm with the proposed feature matching method, we performed a test on the localisation accuracy of the distances in the heading direction at fixed points. The distance in the heading direction instead of the deviation direction was selected, because the former had a larger test range in the pool.
The binocular cameras were positioned at fixed points underwater within a range of 1.2 m to 3.6 m away from the simple docking station. The computed distances from the binocular vision algorithm are compared with the actual distances measured by the tapeline, which is illustrated in Table 3. It can be obtained that the average localisation error is approximately 5 cm and the average relative localisation error is close to 2%, which satisfies the requirement for the tracking control of the AUV.

Docking Experiment
To verify the feasibility of the entire visual system, the ship model was positioned approximately 3.1 m from the docking station at different initial lateral deviations. The experiment results are shown in Figures 17-19

Docking Experiment
To verify the feasibility of the entire visual system, the ship model was positioned approximately 3.1 m from the docking station at different initial lateral deviations. The experiment results are shown in Figures 17-19.
The blue line in Figure 17 illustrates the motion displacement computed by the binocular vision algorithm on the horizontal plane when the ship model starts at the left side of the docking station. Its path is in reverse at the beginning, because it is in a slant direction. It takes time for the ship model to move towards the docking station by adjusting its yaw angles at a small P coefficient, as determined by the control strategy. The red line illustrates the condition in which the ship model starts at the right side of the docking station.
When the ship is approximately 1.1 m away from the docking station, the cameras cannot capture all the lamps, so the ship cannot obtain its three-dimensional position relative to the docking station. In such circumstances, the ship model keeps moving to the docking station according to the computed relative two-dimensional position in pixel units. Figures 18 and 19 show that the ship model can successfully achieve docking irrespective of its initial position relative to the docking station.  The blue line in Figure 17 illustrates the motion displacement computed by the binocular vision algorithm on the horizontal plane when the ship model starts at the left side of the docking station. Its path is in reverse at the beginning, because it is in a slant direction. It takes time for the ship model to move towards the docking station by adjusting its yaw angles at a small P coefficient, as determined by the control strategy. The red line illustrates the condition in which the ship model starts at the right side of the docking station.

Conclusions
This study presents a fast binocular localisation method in combination with a relative pose estimation method for the instance when two cameras cannot capture all the lamps. A detailed description of the system configuration is proposed. Through image processing, including image filtering, image segmentation using an adaptively weighted OTSU method and feature extraction, the mass centres of the lamps are obtained as the matching features of the binocular algorithm. A control strategy based on this vision algorithm is then provided. The test at fixed points shows that the relative localisation error in the heading direction within 3.6 m is approximately 2%, which satisfies the requirements for tracking control. A verification experiment was conducted in the laboratory pool using a ship model to evaluate the feasibility of the entire system. The ship model can achieve docking no matter whether it starts at the right side or the left side of the docking station.

Conclusions
This study presents a fast binocular localisation method in combination with a relative pose estimation method for the instance when two cameras cannot capture all the lamps. A detailed description of the system configuration is proposed. Through image processing, including image filtering, image segmentation using an adaptively weighted OTSU method and feature extraction, the mass centres of the lamps are obtained as the matching features of the binocular algorithm. A control strategy based on this vision algorithm is then provided. The test at fixed points shows that the relative localisation error in the heading direction within 3.6 m is approximately 2%, which satisfies the requirements for tracking control. A verification experiment was conducted in the laboratory pool using a ship model to evaluate the feasibility of the entire system. The ship model can achieve docking no matter whether it starts at the right side or the left side of the docking station. When the ship is approximately 1.1 m away from the docking station, the cameras cannot capture all the lamps, so the ship cannot obtain its three-dimensional position relative to the docking station. In such circumstances, the ship model keeps moving to the docking station according to the computed relative two-dimensional position in pixel units. Figures 18 and 19 show that the ship model can successfully achieve docking irrespective of its initial position relative to the docking station.

Conclusions
This study presents a fast binocular localisation method in combination with a relative pose estimation method for the instance when two cameras cannot capture all the lamps. A detailed description of the system configuration is proposed. Through image processing, including image filtering, image segmentation using an adaptively weighted OTSU method and feature extraction, the mass centres of the lamps are obtained as the matching features of the binocular algorithm. A control strategy based on this vision algorithm is then provided. The test at fixed points shows that the relative localisation error in the heading direction within 3.6 m is approximately 2%, which satisfies the requirements for tracking control. A verification experiment was conducted in the laboratory pool using a ship model to evaluate the feasibility of the entire system. The ship model can achieve docking no matter whether it starts at the right side or the left side of the docking station.
Compared with the other vision guidance systems mentioned in the introduction, the vision guidance system presented in this article has the following advantages. Firstly, an adaptively weighted OTSU method is proposed to segment the captured image with good performance and a relatively low cost of computation compared with other TSMs. Secondly, the frequency of the proposed binocular localisation method can reach up to 10 Hz, and the entire algorithm can achieve 5 Hz. Furthermore, the vision algorithm includes a relative pose estimation method in the case when there are one or more lamps out of the viewing field of the cameras. The vision guidance algorithm will work even if only one of the cameras captures lamps, greatly extending its working range.
However, some defects still exist in this article and can be optimised in the future. For example, the pool experiment conducted in this article uses a model ship to validate the proposed algorithm. In future work, experiments using a full-sized AUV will be conducted in a deep-water area to evaluate the effectiveness of the proposed system in an actual sea environment. The motion in the vertical plane will be taken into consideration. Moreover, to make this vision guidance algorithm more practical, a dead reckoning algorithm could be used as an assisted algorithm for circumstances when the target is lost from the sight of both cameras during the docking stage. In addition, a laser source may be considered in the future to extend the effective range of the vision guidance system.