Underwater Accompanying Robot Based on SSDLite Gesture Recognition

Liu, Tingzhuang; Zhu, Yi; Wu, Kefei; Yuan, Fei

doi:10.3390/app12189131

Open AccessArticle

Underwater Accompanying Robot Based on SSDLite Gesture Recognition

by

Tingzhuang Liu

^†

,

Yi Zhu

^†,

Kefei Wu

and

Fei Yuan

^*

Key Laboratory of Underwater Acoustic Communication and Marine Information Technology, Xiamen University, Xiamen 361005, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2022, 12(18), 9131; https://doi.org/10.3390/app12189131

Submission received: 3 August 2022 / Revised: 23 August 2022 / Accepted: 6 September 2022 / Published: 11 September 2022

(This article belongs to the Special Issue Underwater Robot)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Underwater robots are often used in marine exploration and development to assist divers in underwater tasks. However, the underwater robots on the market have some problems, such as only a single function of object detection or tracking, the use of traditional algorithms with low accuracy and robustness, and the lack of effective interaction with divers. To this end, we designed a type of gesture recognition based on interaction, using person tracking as an auxiliary means for an underwater accompanying robot (UAR). We train and test the SSDLite detection algorithm using the self-labeled underwater datasets, and combine the kernelized correlation filters (KCF) tracking algorithm with the “Active Control” target tracking rule to continuously track the underwater human body. Our experiments show that the use of underwater datasets and target tracking can effectively improve gesture recognition accuracy by 40–105%. In the outfield experiment, the performance of the algorithm was good. It achieved target tracking and gesture recognition at 29.4 FPS on Jetson Xavier NX, and the UAR made corresponding actions according to the diver gesture command.

Keywords:

underwater robots; human–machine interaction; gesture recognition; underwater object tracking

1. Introduction

With the increase of population and economic development, human beings need to develop marine resources to meet the needs of economic development. In recent years, underwater robots have begun to play an increasingly important role in marine exploration and development. Underwater robots are mainly divided into remote operated vehicles (ROVs) and automatic underwater vehicles (AUVs) [1,2]. ROVs can complete complex underwater tasks through the manipulation of personnel on shore, and are widely used in many fields such as underwater mining, hull cleaning, and pipeline monitoring [3,4]. AUVs are mainly used in marine resource exploration, mineral resource development, and other fields due to their strong autonomy and large diving depth [5].

The research and application of underwater robots has become increasingly mature, and it has made remarkable achievements in the field of marine exploration. However, for some important scenes such as exploration under complex terrain, aquatic resource fishing, underwater archaeology, and marine biological census, divers are still the main tool. However, on the one hand, the marine environment is dangerous, and the real-time monitoring of the underwater environment cannot be realized due to the weakening of electromagnetic waves by water bodies. On the other hand, human diving depth is very limited and depends on relevant equipment to maintain necessary breathing. All of these factors threaten the life safety of divers.

For years, researchers have continuously improved autonomous underwater vehicles (AUVs) to help divers complete underwater work. Effective communication between the diver and the AUV is critical in the human–machine co-diving environment. A good interaction mode can maximize the role of AUVs and promote human–machine cooperation to achieve underwater tasks [6]. During this period, underwater optical images have been widely used in underwater archaeology [7], seabed resource exploration [8], underwater environment detection [9], and other fields due to their advantage of intuitive detection targets, high imaging resolution and high information content.

At present, the wireless communication mode dominated by acoustics is often used in underwater communication due to the substantial limitation of the cable communication mode underwater. However, underwater acoustic communication still has two major drawbacks: one is the dependence on expensive acoustic equipment, and the other is the narrow band width and low rate of underwater acoustic signal transmission [10]. Therefore, at present, in order to realize the communication and interaction between the diver and the robot, many researchers [11,12,13] adopted the method of recognizing and analyzing the captured diver’s gesture to obtain the diver’s command and intention. In this way, information can be effectively transmitted between man and machine, and expensive acoustic equipment can be avoided [14].

Due to the particularity of the underwater environment and the absorption of light by water bodies, there are often situations in the underwater gesture interaction, such as inability to correctly capture gestures and a poor effect of gesture recognition [15]. To solve the above problems, we added underwater image enhancement and human tracking before gesture recognition. Underwater image enhancement can carry out color correction and image restoration, and restore the real scene information of the underwater image. Underwater human tracking can not only realize the continuous tracking of divers by an AUV to ensure that dives always stay in the field of vision of the AUV, but also focus on finding preset gestures from human body instead of the whole image to improve the accuracy of gesture recognition.

In addition, due to the work under underwater, AUVs have requirements on the weight and volume of the equipment they carry, so they cannot load high-power and large-volume desktop-level computing equipment. Therefore, it is necessary to select embedded devices with a small volume, low power consumption and high performance, combined with low-complexity algorithms, to complete underwater tracking and detection tasks.

On the basis of previous studies, we propose a type of gesture recognition as interaction based on human tracking as an auxiliary means of an underwater accompanying robot (UAR). The visual processing algorithm is deployed on the embedded AI device Jetson Xavier NX to realize real-time tracking and gesture recognition of underwater targets. The structure of the work is as follows: Section 2 mainly introduces the existing related research work. Section 3 introduces the overall framework of the system and the algorithm designed. Section 4 shows the use of datasets, experimental process and results. Section 5 elaborates upon conclusions and prospects for follow-up work.

2. Related Work

At present, the gesture-based robot control method is widely used in daily life, industrial production and other fields because of its high ease of use, high recognition accuracy, and low learning cost. In [16], a gesture-based indoor mobile robot was proposed, and the movement of the robot was controlled by gestures captured by a Kinect camera. In order to improve the robot’s gesture recognition ability, [17] integrated the features of the data obtained from the visual system and the inertial system, and evaluated the influence of four commonly used algorithms on gesture recognition. The gesture-based human–machine interaction was introduced in detail in [18], but the methods involved are mostly applicable to the air environment. In the underwater environment, the robustness and repeatability of these methods are questionable due to the influence of color distortion, low light, bubble occlusion, and other problems [6].

In [19], sufficient research has been made on gesture interaction in the underwater environment. Considering the influence caused by complex scenes, divers’ clothes, and harsh lighting conditions, a more complex Caddian language was designed to complete the communication between divers and AUVs. This language is built upon consolidated and standardized underwater gestures, which form control commands through different gestures to accomplish complex tasks such as ascending a fixed distance or taking photos at fixed points. While it is possible to enhance underwater gestures to communicate information, because of not considering the actual physical problems in the implementation, the practicality and understandability of gestures is not high.

In recent years, in view of the strong recognition performance of neural networks, many studies [20,21,22] have adopted convolutional neural networks to complete underwater human–machine interaction based on gestures, achieving good recognition and control effects. However, in terms of experimental implementation, the above schemes are all based on the test data in the laboratory, without combining with the actual hardware equipment to obtain the experimental verification results in the real scene.

For the choice of high-performance mobile devices, embedded AI devices such as NVIDIA Jetson series are widely used. The reason is not only that Jetson can provide powerful AI computing power and parallel computing function, but also that it has the characteristics of a small size, low power consumption and rich external expansion. At the same time, the Jetson platform is compatible with the jetpack software development kit (SDK), so that it can run applications such as deep learning, computer vision, and accelerated computing software libraries. These features enable Jetson to be used in various scenarios, including auto driving [23,24,25], industrial quality monitoring [26], medical testing [27,28], etc. [29,30,31].

Less research work has been done on human–machine gesture interaction underwater. The related works of underwater robot systems based on gesture recognition and assisted by target tracking are very few and there are many aspects yet to be examined.

3. Framework and Methodology

Due to the special nature of the underwater environment and the absorption, simple gesture recognition algorithm may not achieve good results. Therefore, we combine underwater image enhancement and human tracking, proposing an interactive underwater accompanying robot system based on gestures. The system block diagram is shown in Figure 1. First, we need to identify the underwater target person in the enhanced underwater image as the target for the next continuous tracking. Then, the gesture recognition algorithm is used to obtain the gestures made by the tracking target. Finally, according to the target coordinates and gesture recognition results, the processing results are transmitted to the STM32 control board by instruction. The results of each step can be output individually.

3.1. Image Enhancement

Due to the effects of absorption and scattering, underwater images have very serious color distortion. Especially when the camera is far away from the target, the longer the path of light propagation, the more serious the color distortion. The attenuation effects are subject to the wavelength of light and usually the images collected in deep water are blue-green, which is due to the red and purple light being completely attenuated. Therefore, it is very important to perform color correction and image restoration for underwater images, which can restore the real scene information.

The scattering model for the underwater image [32] can be simplified and equated to the following Equation (1):

I_{c} = J_{c} T_{c}^{D} + B_{c}^{\infty} (1 - T_{c}^{B})

(1)

where

J_{c}

denotes the recovered image, and

T_{c}^{D}

correspond to the direct light transmission map (DLTM),

B_{c}^{\infty}

is called the veiling light, which is the background light at infinity under underwater imaging conditions, and

T_{c}^{B}

represents the back scattered light transmission map (BLTM). The recovered image can thus be calculated using Equation (2).

J_{c} = \frac{I_{c} - B_{c}^{\infty} (1 - T_{c}^{B})}{T_{c}^{D}}

(2)

In our project, to obtain the recovery image

J_{c}

, we need to know

T_{c}^{D}

,

T_{c}^{B}

and

B_{c}^{\infty}

, but it is difficult to obtain these parameters from a single image. For this purpose, we propose a neural network structure estimation network, as shown in Figure 2, consisting mainly of an encoder and four decoders to estimate the parameters

J_{c}

,

T_{c}^{D}

,

T_{c}^{B}

, and

B_{c}^{\infty}

in the scattering model. The encoder is improved based on DenseNet [33], which aims to extract multiple scale features of the underwater images as input to the decoder. The four decoders have roughly the same structure and are responsible for estimating different imaging parameters. Among them,

D e c o d e r T_{c}^{D}

is responsible for estimating DLTM,

D e c o d e r T_{c}^{B}

is responsible for estimating BLTM, and

D e c o d e r B_{c}^{\infty}

is responsible for estimating veiling light.

D e c o d e r J_{c}

is a bootstrap decoder that we designed to guide the network training to improve the performance. When an underwater image is input, the parameters

T_{c}^{D}

,

T_{c}^{B}

, and

B_{c}^{\infty}

in the scattering model are estimated by the network, and then the underwater recovery image

J_{c}

can be obtained by Equation (2).

3.2. Underwater Human Tracking

In order to realize the continuous tracking of underwater humans by UAR, we use a combination of a target detection network and a target tracking algorithm to continuously track specific targets in the image, so as to provide help for the next gesture recognition. Considering the limited memory space and computing power of embedded devices, we chose the SSDLite detection network with better speed and size to complete the detection work. The SSD (Single Shot MultiBox Detector) algorithm [34] is one of the most widely used algorithms in object detection, which adopts a deep neural network model of end-to-end single shot MultiBox real-time detection. It integrates the regression idea of YOLO [35] and the candidate box mechanism of Faster R-CNN [36]. The SSD algorithm greatly reduces the computation of neural networks and improves the running speed of algorithms by using the idea of regression. In addition, the SSD algorithm uses the local feature extraction method to obtain features at different positions, different aspect ratios and different sizes, which is more efficient than the YOLO’s global feature extraction method for a particular location. The overall objective loss function of the network is the weighted sum of localization loss

L_{loc}

and confidence loss

L_{conf}

.

L (x, c, l, g) = \frac{1}{N} (L_{conf} (x, c) + α L_{l o c} (x, l, g))

(3)

where N is the number of matched default boxes and the

L_{loc}

is the Smooth L1 loss [36] between the parameters of predicted box

(l)

and the ground truth box

(g)

.

L_{conf}

is the softmax loss on the multi-class confidences

(c)

and the weight term

α

is set to 1 through cross-validation.

Based on SSD, SSDLite replaces all of the regular convolutions in the detection layer with separable convolutions (Depthwise followed by 1 × 1 projection) [37]. Compared with the conventional SSD model, SSD Lite has fewer parameters and less computational cost, so it can be better applied in embedded devices.

For the real-time detection of mobile embedded networks, a network with good accuracy and real-time performance is needed as the backbone of SSDLite. The MobileNet [38] network is a lightweight CNN proposed by Google, which replaces ordinary convolution with Depthwise Separable Convolution to improve the accuracy and speed of detection while reducing the size of the model.

MobileNetV3 [39], on the basis of preserving the Depthwise Separable Convolution of V1 and the Inverted Residuals Block of V2, introduces a lightweight attentional model based on the squeeze and excitation structure, by combining with the characteristics of the channel relationship to enhance the learning ability of the network. Meanwhile, the h-swish and h-sigmoid functions as shown in Equations (4) and (5) are used to replace the swish and sigmoid functions of the network, respectively. Thus, the computation of the network is further reduced and the model is made more suitable for the mobile embedded platform.

h - swish (x) = x \frac{ReLU6 (x + 3)}{6}

(4)

h - sigmoid (x) = \frac{ReLU6 (x + 3)}{6}

(5)

Since the underwater tracking scenario requires more real-time performance, we discarded the neural network tracking algorithm with poor speed performance and choose the faster KCF algorithm as the underwater human tracking algorithm. The human body in the underwater frame is detected using the SSDLite, and the kernelized correlation filters (KCF) tracker is initialized using the detection result. In the subsequent tracking stage, the tracker is densely sampled with the target position in the previous frame of the video as the center, and all of the obtained samples are input to the trained ridge regression classifier. After the classification of the ridge regression classifier, the position of the target is obtained. The overall algorithm flow is shown in Figure 3. In addition, KCF uses the nature of diagonalization in the frequency domain of the circular matrix to convert the calculation from the time domain to the frequency domain, and by combining fast Fourier transform, the tracking speed is greatly improved.

At the same time, due to the peculiarities of the underwater environment, the UAR needs to maintain a very close formation with the target. To ensure that the UAR can effectively track the target, the overall visual center of the tracked target needs to be kept as close to the image center as possible. To this end, we designed a target tracking rule called “Active Control” to adjust the UAR’s position and attitude in time. The principle of the “Active Control” algorithm is to always keep the distance between the target center and the frame center within a certain range, as shown in Figure 4. A limited box

L B O X

is constructed with the frame center point

F C P

and the side length is a default value

t h r e

. The target center point

T C P

needs to be in that square at all times. In addition, the area outside the

L B O X

is divided into 8 blocks, which are

U p p e r L e f t

,

U p p e r

,

U p p e r R i g h t

,

L e f t

,

R i g h t

,

L o w e r L e f t

,

L o w e r

, and

L o w e r R i g h t

. When the target is out of the restriction square, the next control command is sent to the STM32 control board according the area and position of the

T C P

.

Finally, according to the overall flow, in the underwater tracking process of this project, we used the target position obtained by SSDLite to initialize the KCF tracker when tracking starts and when tracking fails. The overall algorithm flow pseudo-code is shown in Algorithm 1, where cmd denotes the control command passed to the STM32 control board.

Algorithm 1: Underwater human tracking algorithm

Input: Video stream after image enhancement I
Output: Control command $c m d$
Load trained SSDLite model, init $t h r e$ , set $t r a c k i n g$ = False
while (get frame F from I) != None do
if not $t r a c k i n g$ then
Detect person in frame F, init the $t r a c k e r$ , set $t r a c k i n g$ = True
$t r a c k e r$ update, get confidence score $c o n f$ , person position $r e c t$
if $c o n f$ > $t h r e$ then
ccording to the $A c t i v e C o n t r o l$ and $r e c t$ , calculate $c m d$
else
$t r a c k i n g$ = False
end if
end if
end while

We tested the proposed human tracking algorithm on the underwater dataset, and the results are shown in Figure 5. Column 1 of the figure is the initial frame, where the human position information is obtained using underwater target detection, which is used to initialize the tracker, and columns 2 to 4 show the subsequent tracking results of the tracker.

3.3. Gesture Recognition

There are several problems with the currently available gesture datasets, which lead it not being possible to directly use them in this project. On the one hand, most of these datasets were obtained in the open-air environment, which is quite different from the underwater environment. On the other hand, the comprehensibility of the gestures used in these datasets is not good, and there is no special consideration for robot control. For these reasons, we designed eight simple and practical gestures for Uthe AR. Each gesture corresponds to an action of the UAR, such as open five fingers for light and the closed fist for stop, as detailed in Table 1.

We used the SSDLite network as the gesture recognition network and adopted the MobileNet-V3 mentioned above as its backbone network. We used self-labeled datasets for SSDLite gesture recognition network training and testing. The network is implemented on the PyTorch framework, and all of the above models were trained and tested on NVIDIA GEFORCE GTX 3080Ti. Our training batch size was 132 with a learning rate of

1 \times 10^{- 3}

, which lasted 72,000 iterations, and the learning rate decreased 10 times after the 48,000th and 60,000th iterations.

4. Algorithm and System Experiment

4.1. Datasets

In our project, we are targeting underwater humans, including scuba divers, free divers, and swimmers in pools. In order to ensure that SSDLite can have a better underwater detection effect, we discarded land datasets such as COCO [40] and PASCAL VOC [41]. The Diver underwater human dataset is produced by means of self-shooting and network search. The Diver dataset has a total of 1398 images, including underwater human images of different color degradation models, different underwater scenes, different degrees of clarity, and different costumes. The dataset is diverse, and some images are shown in Figure 6.

For the gesture recognition dataset, according to the gesture types shown in Table 1, we collected about 600 air and underwater images for each gesture, reaching a total of 5541 images, and used random cropping and affine transformation to expand the data to an A 4:1 ratio that was used to make the training datasets and test datasets; some images are shown in Figure 7.

4.2. Performance Evaluation

4.2.1. Image Enhancement Performance

We compared and analyzed the proposed Estimation Network image enhancement algorithm with other methods, including simple white balance, underwater dark channel, multi-scale fusion [42], DN-based [43], and UGAN [44]. The experimental results are shown in Figure 8.

Then, we used the objective image quality evaluation indicators, peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) to evaluate and compare the performance of the proposed method and other methods. Table 2 shows the PSNR and SSIM values of different methods for the original image (a) in Figure 9, where (b) is the underwater image generated by the method in [45], and (c–f) represent the output of different methods to the underwater image. The larger SSIM and PSNR values are, the closer the output image is to the true value, that is, the higher is the accuracy. From the evaluation results, it can be seen that the method proposed in this paper has good performance for images of different degradation types, and performs best among various comparison methods.

4.2.2. Recognition Performance

We trained different detection methods and compared their performance on the Diver test set. The test conditions were as follows: NVIDIA Jetson Xavier NX, Jetpack 4.6, and MODE_10W_DESKTOP. The results are shown in Table 3. As can be seen from the table, YOLOv3 achieved the highest mAP value in terms of detection accuracy. However, in terms of speed performance, except for SSDLite, the other methods had poor real-time performance and could not realize real-time detection on embedded devices. Considering the accuracy, speed, and model size, the SSDLite detection algorithm with MobileNetv3 as the backbone performed best.

In addition, in order to prove the superiority of the Diver dataset, the SSDLite model trained using the PASCAL VOC 07 + 12 dataset to detect a person in an underwater environment, and the mAP obtained was only 0.4879, far lower than the 0.882 obtained by the Diver dataset training. A large number of images in the test set could not be correctly detected. As shown in Figure 10, the first row shows the test results using the air training set, and the second row shows the test results using the Diver dataset. In false detection rate, missed detection rate, and detection accuracy, the Diver dataset showed better performance.

Similarly, we tested the trained gesture recognition network model on the test set and compared the results underwater with different backbones, as shown in Table 4. The MobileNetV3 Small model we used achieved 99.94% mAP on Jetson Xavier NX, 23.5 FPS recognition speed and the model size was only 5.71 MB, which performed best.

4.2.3. Comparison of Performance before and after Human Tracking

In order to test the effect on performance before and after adding underwater human tracking, we used part of the underwater dataset as the test set. Taking MobileNetV3 (small) as the test model, the recognition results of adding tracking and not adding were tested. Some gesture recognition results are shown in the Figure 11. Subfigures (a.1), (b.1), and (c.1) in Figure 11 show the results of using only gesture recognition. Subfigures (a.2), (b.2), and (c.2) show the results of adding tracking before gesture recognition. From the comparison of (a.1) and (a.2), it can be seen that the detection score of gesture recognition has been greatly improved from 0.76 to 0.99. In addition, there was a large number of samples that cannot be detected by gesture recognition alone in the test set, as shown in (c.1). After adding underwater human tracking, better performance was obtained, as shown in (c.2).

We used the trained MobileNetV3-small model to test the underwater gesture recognition accuracy before and after tracking. Table 5 shows the mean average precision (mAP) index of gesture recognition before and after the addition of tracking in detail. It can be seen from the table that the mAP index improved greatly after the addition of underwater human tracking. With the increase of the score threshold (i.e., removing network detection results below the threshold), the improvement rate increased from 40% to 105%.

Additionally, from Figure 12 it can be seen that with an increase of the set score threshold, after adding underwater human tracking, the performance of the gesture recognition was more stable, and it continued to maintain a high recognition accuracy. The reason is that the tracking removes a large number of interfering pixels and makes the gesture recognition focus on finding preset gestures from the person body image rather than the whole image. To sum up, adding the step of person tracking before underwater gesture recognition can not only improve the detection speed (due to the smaller input image), but also greatly improve the accuracy of recognition results.

4.3. System Experiment

To test the overall performance of the system, we conducted outfield experiments on the UAR in a swimming pool. The experimental scene is shown in Figure 13.

The whole UAR is shown in Figure 14. The ROV used was the Trench Wanderer 110ROV, which is equipped with four thrusters to give it three degrees of freedom (surge, heavy, and yaw). In terms of hardware equipment selection, the underwater camera we used was GoPro7, because it has a wider viewing angle and an enhanced anti-shake function. It can be connected with the processing unit through USB interface to provide a better picture for the UAR. As the core processing unit, we chose NVIDIA Jetson Xavier NX to carry and run the recognition network. With the powerful computing power of 21 TPOS and memory bandwidth of more than 51 GB/s, combined with low power consumption (10 W) and a comprehensive expansion interface, Jetson Xavier NX is suitable for the perception and detection of complex underwater environments.

We chose the STM32 microcontroller (STM32F407ARM) to cooperate with the NX3 flight control to complete the control of the underwater robot: on the one hand, the attitude change of the robot was monitored through an on-board gyroscope and accelerometer to avoid excessive inclination or rollover. On the other hand, the control information sent by Jetson Xavier NX was received through UART, and 8 PWM channels were used to drive the motor of the robot to complete the set underwater work. The overall control flow of the system is shown in Figure 15. With the exception of the floodlight, all other devices were installed in a waterproof shell made of transparent polycarbonate (PC). Finally, the built-in battery pack supplied power to all electronics inside through various up/down modules.

During the experiment, the diver entered the water with the UAR, initialized the UAR and determined the main tracking target through the preset gestures. Then, the UAR invoked its thrusters and continuously tracked the target diver, always keeping the target diver in the center of the frame. At the same time, the UAR performed gesture recognition on the cropped video steam (crop by the position of the target). Due to the elimination of complex environmental background, the recognition area was narrowed, and the missed and false recognition was greatly reduced. Moreover, it also had a good recognition effect on fuzzy moving images, as shown in Figure 16.

In terms of recognition speed, the target tracking time was 33.0 ms, and the model inference time was 31.9 ms. By setting the gesture recognition frequency to once a second, our algorithm was able to achieve 29.4 fps on Jetson Xavier NX, satisfying the real-time requirements.

5. Conclusions

In this paper, we proposed an underwater accompanying robot system based on gesture recognition, and made a prototype for experimental verification. The work involved mainly focuses on underwater visual processing and gesture interaction. By experimental verification, we found that using an underwater training dataset and adding human tracking before recognition can greatly improve the recognition effect. In the field tests, our algorithm achieved a 99.92% mAP score on Jetson Xavier NX and ran stably at 29.4 fps, meeting our requirements for accuracy and speed of underwater gesture recognition.

However, our work still needs to be improved, especially the hardware design and control part of the robot. A complete underwater robot system needs a reliable and effective hardware structure as support. Therefore, we will gradually improve the hardware structure in future work, and give full play to the joint role of the hardware and algorithm. At the same time, for the underwater identification scene, although we have carried out the verification experiment of the algorithm and system in a swimming pool, we have not verified the performance in the outdoor marine environment.

Next, on the basis of this experiment, we will continue to optimize the hardware structure and algorithm design to improve the overall robustness of the system. In addition, test experiments in the outdoor marine environment will be conducted to continuously improve the stability and accuracy of the system in complex environments according to the results obtained from the experiment.

Author Contributions

Conceptualization, F.Y.; Investigation, Y.Z.; Methodology, F.Y.; Project administration, Y.Z.; Resources, Y.Z.; Software, T.L.; Supervision, F.Y.; Validation, T.L.; Visualization, K.W.; Writing–original draft, K.W.; Writing–review and editing, T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was sponsored by the National Natural Science Foundation of China (62071401, 62001404) and Xiamen Ocean and Fishery Development Special Fund project (21CZB015HJ10).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

The authors would like to thank the National Natural Science Foundation of China (62071401, 62001404) and Xiamen Ocean and Fishery Development Special Fund project (21CZB015HJ10). And we are grateful to the editors and the reviewers for their insightful comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

An, E.; Dhanak, M.R.; Shay, L.K.; Smith, S.; Van Leer, J. Coastal oceanography using a small AUV. J. Atmos. Ocean. Technol. 2001, 18, 215–234. [Google Scholar] [CrossRef]
Pazmiño, R.S.; Cena, C.E.G.; Arocha, C.A.; Santonja, R.A. Experiences and results from designing and developing a 6 DoF underwater parallel robot. Robot. Auton. Syst. 2011, 59, 101–112. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, J.; Chemori, A.; Xiang, X. Virtual submerged floating operational system for robotic manipulation. Complexity 2018, 2018, 9528313. [Google Scholar] [CrossRef]
Shojaei, K.; Dolatshahi, M. Line-of-sight target tracking control of underactuated autonomous underwater vehicles. Ocean Eng. 2017, 133, 244–252. [Google Scholar] [CrossRef]
Ridao, P.; Carreras, M.; Ribas, D.; Sanz, P.J.; Oliver, G. Intervention AUVs: The next challenge. Annu. Rev. Control 2015, 40, 227–241. [Google Scholar] [CrossRef]
Zhang, H.; Yuan, F.; Chen, J.; He, X.; Zhu, Y. Gesture-Based Autonomous Diving Buddy for Underwater Photography. In International Conference on Image and Graphics; Springer: Cham, Switzerland, 2021; pp. 195–206. [Google Scholar]
Ura, T. Observation of deep seafloor by autonomous underwater vehicle. Indian J. Geo-Mar. Sci. 2013, 42, 1028–1033. [Google Scholar]
Nishida, Y.; Nagahashi, K.; Sato, T.; Bodenmann, A.; Thornton, B.; Asada, A.; Ura, T. Development of an autonomous underwater vehicle for survey of cobalt-rich manganese crust. In Proceedings of the OCEANS 2015-MTS/IEEE, Washington, DC, USA, 19–22 October 2015; pp. 1–5. [Google Scholar]
Nakatani, T.; Ura, T.; Ito, Y.; Kojima, J.; Tamura, K.; Sakamaki, T.; Nose, Y. AUV “TUNA-SAND” and its Exploration of hydrothermal vents at Kagoshima Bay. In Proceedings of the OCEANS 2008-MTS/IEEE Kobe Techno-Ocean, Kobe, Japan, 8–11 April 2008; pp. 1–5. [Google Scholar]
Kilfoyle, D.; Baggeroer, A. The state of the art in underwater acoustic telemetry. IEEE J. Ocean Eng. 2000, 25, 4–27. [Google Scholar] [CrossRef]
Chiarella, D.; Bibuli, M.; Bruzzone, G.; Caccia, M.; Ranieri, A.; Zereik, E.; Marconi, L.; Cutugno, P. Gesture-based language for diver-robot underwater interaction. In Proceedings of the Oceans 2015—Genova, Genova, Italy, 18–21 May 2015; pp. 1–9. [Google Scholar]
Mišković, N.; Bibuli, M.; Birk, A.; Caccia, M.; Egi, M.; Grammer, K.; Marroni, A.; Neasham, J.; Pascoal, A.; Vasilijević, A.; et al. Caddy—cognitive autonomous diving buddy: Two years of underwater human-robot interaction. Mar. Technol. Soc. J. 2016, 50, 54–66. [Google Scholar] [CrossRef]
Gustin, F.; Rendulic, I.; Miskovic, N.; Vukic, Z. Hand gesture recognition from multibeam sonar imagery. IFAC-PapersOnLine 2016, 49, 470–475. [Google Scholar] [CrossRef]
Buelow, H.; Birk, A. Gesture-recognition as basis for a Human Robot Interface (HRI) on a AUV. In Proceedings of the OCEANS’11 MTS/IEEE KONA, Waikoloa, HI, USA, 19–22 September 2011. [Google Scholar]
Gordon, H.R. Can the Lambert-Beer law be applied to the diffuse attenuation coefficient of ocean water? Limnol. Oceanogr. 1989, 34, 1389–1409. [Google Scholar] [CrossRef]
Hsu, R.C.; Su, P.C.; Hsu, J.L.; Wang, C.Y. Real-Time Interaction System of Human-Robot with Hand Gestures. In Proceedings of the 2020 IEEE Eurasia Conference on IOT, Communication and Engineering (ECICE), Yunlin, Taiwan, 23–25 October 2020; pp. 396–398. [Google Scholar]
Fiorini, L.; Loizzo FG, C.; Sorrentino, A.; Kim, J.; Rovini, E.; Di Nuovo, A.; Cavallo, F. Daily gesture recognition during human-robot interaction combining vision and wearable systems. IEEE Sens. J. 2021, 21, 23568–23577. [Google Scholar] [CrossRef]
Rautaray, S.S.; Agrawal, A. Vision Based Hand Gesture Recognition for Human Computer Interaction: A Survey. Artif. Intell. Rev. 2015, 43, 1–54. [Google Scholar] [CrossRef]
Chiarella, D.; Bibuli, M.; Bruzzone, G.; Caccia, M.; Ranieri, A.; Zereik, E.; Marconi, L.; Cutugno, P. A Novel Gesture-Based Language for Underwater Human—Robot Interaction. J. Mar. Sci. Eng. 2018, 6, 91. [Google Scholar] [CrossRef]
Xu, P. Gesture-based Human-robot Interaction for Field Programmable Autonomous Underwater Robots. arXiv 2017, arXiv:1709.08945. [Google Scholar]
Jiang, Y.; Zhao, M.; Wang, C.; Wei, F.; Wang, K.; Qi, H. Diver’s hand gesture recognition and segmentation for human–robot interaction on AUV. Signal Image Video Process. 2021, 15, 1899–1906. [Google Scholar] [CrossRef]
Jiang, Y.; Peng, X.; Xue, M.; Wang, C.; Qi, H. An underwater human–robot interaction using hand gestures for fuzzy control. Int. J. Fuzzy Syst. 2021, 23, 1879–1889. [Google Scholar] [CrossRef]
Cooper, N.; Lindsey, E.; Chapman, R.; Biaz, S. GPU Based Monocular Vision for Obstacle Detection; Techinical Report; Auburn University: Auburn, AL, USA, 2017. [Google Scholar]
Shustanov, A.; Yakimov, P. A method for traffic sign recognition with CNN using GPU. In Proceedings of the 14th International Joint Conference on e-Business and Telecommunications–ICETE 2017, Madrid, Spain, 24–26 July 2017; Volume 5, pp. 42–47. [Google Scholar]
Otterness, N.; Yang, M.; Rust, S.; Park, E.; Anderson, J.H.; Smith, F.D.; Berg, A.; Wang, S. An Evaluation of the NVIDIA TX1 for Supporting Real-time Computer-Vision Workloads. In Proceedings of the 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Pittsburgh, PA, USA, 18–21 April 2017; pp. 355–365. [Google Scholar]
Elkaseer, A.; Salama, M.; Ali, H.; Scholz, S. Approaches to a Practical Implementation of Industry 4.0. In Proceedings of the Eleventh International Conference on Advances in Computer-Human Interactions, Rome, Italy, 25–29 March 2018; pp. 141–146. [Google Scholar]
Chang, W.J.; Chen, L.B.; Hsu, C.H.; Lin, C.P.; Yang, T.C. A Deep Learning-Based Intelligent Medicine Recognition System for Chronic Patients. IEEE Access 2019, 7, 44441–44458. [Google Scholar] [CrossRef]
Sancho Aragón, J. Energy and Performance Modeling of NVIDIA Jetson TX1 Embedded GPU in Hyperspectral Image Classification Tasks for Cancer detection Using Machine Learning. Master’s Thesis, Escuela Técnica Superior de Ingeniería y Sistemas de Telecomunicación, Madrid, Spain, 2018. [Google Scholar]
Tang, J.; Ren, Y.; Liu, S. Real-Time Robot Localization Vision and Speech Recognition on Nvidia Jetson TX1. arXiv 2017, arXiv:1705.10945. [Google Scholar]
Wu, Y.; Gao, L.; Zhang, B.; Yang, B.; Chen, Z. Embedded GPU implementation of anomaly detection for hyperspectral images. High-Perform. Comput. Remote Sens. V 2015, 9646, 964608. [Google Scholar]
Xing, H.; Guo, S.; Shi, L.; Hou, X.; Liu, Y.; Hu, Y.; Xia, D.; Li, Z. Quadrotor Vision-based Localization for Amphibious Robots in Amphibious Area. In Proceedings of the 2019 IEEE International Conference on Mechatronics and Automation (ICMA), Tianjin, China, 4–7 August 2019; pp. 2469–2474. [Google Scholar]
Jaffe, J.S. Computer modeling and the design of optimal underwater imaging systems. IEEE J. Ocean. Eng. 1990, 15, 101–111. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, arXiv:1708.0200228. [Google Scholar] [CrossRef] [PubMed]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In ECCV; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2009, 88, 303–338. [Google Scholar] [CrossRef]
Ancuti, C.O.; Ancuti, C.; De Vleeschouwer, C.; Bekaert, P. Color balance and fusion for underwater image enhancement. IEEE Trans. Image Process. 2017, 27, 379–393. [Google Scholar] [CrossRef] [PubMed]
Pan, P.; Yuan, F.; Cheng, E. Underwater image de-scattering and enhancing using dehaze net and HWD. J. Mar. Sci. Technol. 2018, 26, 531–540. [Google Scholar]
Fabbri, C.; Islam, M.J.; Sattar, J. Enhancing underwater imagery using generative adversarial networks. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018. [Google Scholar]
Lu, J.; Yuan, F.; Yang, W.; Cheng, E. An Imaging Information Estimation Network for Underwater Image Color Restoration. IEEE J. Ocean. Eng. 2021, 46, 1228–1239. [Google Scholar] [CrossRef]

Figure 1. Gesture-based control framework.

Figure 2. Estimation network structure.

Figure 3. Overall flow of KCF tracking algorithm.

Figure 4. Positive control schematic. The UAR movement is controlled according to the position of the target center

T C P

, and the

T C P

is kept in the center of the frame

L B O X

.

Figure 4. Positive control schematic. The UAR movement is controlled according to the position of the target center

T C P

, and the

T C P

is kept in the center of the frame

L B O X

.

Figure 5. Underwater human body tracking test results. (a) The initial frame, (b–d) the subsequent tracking results of the tracker.

Figure 6. Different types of images in the Diver dataset are shown. (a) Green. (b) Blue. (c) Indoor. (d) Outdoor.

Figure 7. Display of images in the air and underwater in the gesture recognition dataset. (a) Air gesture dataset. (b) Underwater gesture dataset.

Figure 8. Comparison of different underwater image enhancement methods. (a) Original. (b) Simple white balance. (c) Underwater dark channel. (d) Multi-scale fusion. (e) DN-based. (f) UGAN. (g) Proposed.

Figure 9. Output results of different methods. (a) Original. (b) Generated underwater image. (c) Retinex-based. (d) Multi-scale fusion. (e) WaterGAN. (f) Proposed.

Figure 10. Comparison of test results between air dataset and diver dataset. (a) The results of the model trained by the air dataset. (b) The results of the model trained by the Diver dataset.

Figure 11. Performance before and after human tracking on the underwater test set. (a.1,b.1,c.1) show the results of using only gesture recognition. (a.2,b.2,c.2) show the results of adding tracking before gesture recognition.

Figure 12. With the increase of the score threshold, the gesture recognition accuracy before and after human tracking. Blue and green lines represent the results before tracking, red and orange lines represent the results after tracking.

Figure 13. The outfield experimental scene.

Figure 14. Overall UAR structure. It consists of the ROV, Jetson Xavier NX (core processing unit), a GoPro7 (underwater camera), an STM32 control board, and other components.

Figure 15. Hardware workflow.

Figure 16. Outfield experiment results. (a) The gesture interaction scene. (b) The results of gesture recognition.

Table 1. Gestures corresponding to robot actions.

No.	1	2	3	4	5	6	7	8
Actions	Take photo	Stop	Light up	Forward	Float up	Dive	Turn right	Turn left
Gestures examples

Table 2. PSNR and SSIM values between the output of different methods and the original image.

Method	PSNR	SSIM
Retinex-based	14.6280	0.3658
Multi-scale fusion	12.1468	0.2349
WaterGAN	16.7906	0.3649
Proposed	24.0392	0.6609

Bold indicates the best result.

Table 3. Performance of different detection methods on the Diver test set.

Method	Backbone	mAP	Speed	Model Size
YOLOv3	DarkNet-53	0.940	7.8 fps	117 MB
RefineDet	VGG-16	0.849	4.6 fps	129 MB
RetinaNet	ResNet-50	0.861	3.5 fps	138 MB
SSD	VGG-16	0.821	3.7 fps	90.6 MB
SSDLite	MobileNetv3	0.882	17.2 fps	14.0 MB

Bold indicates the best result.

Table 4. Performance of different backbones on gesture recognition dataset.

Backbone	mAP	Speed	Model Size
VGG-16	0.9998	10.8 fps	94.1 MB
V2	0.9990	21.3 fps	12.7 MB
V3-Small	0.9994	23.5 fps	5.71 MB
V3-Large	0.9998	21.2 fps	14.4 MB

Bold indicates the best result.

Table 5. Gesture recognition accuracy before and after person tracking with MobileNetV3 Small model.

Score Threshold	mAP (before Tracking)	mAP (after Tracking)	Improvement Rate
0.5	0.527	0.735	39.5%
0.6	0.491	0.735	49.7%
0.7	0.418	0.720	72.2%
0.8	0.364	0.671	84.3%
0.9	0.327	0.670	104.9%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, T.; Zhu, Y.; Wu, K.; Yuan, F. Underwater Accompanying Robot Based on SSDLite Gesture Recognition. Appl. Sci. 2022, 12, 9131. https://doi.org/10.3390/app12189131

AMA Style

Liu T, Zhu Y, Wu K, Yuan F. Underwater Accompanying Robot Based on SSDLite Gesture Recognition. Applied Sciences. 2022; 12(18):9131. https://doi.org/10.3390/app12189131

Chicago/Turabian Style

Liu, Tingzhuang, Yi Zhu, Kefei Wu, and Fei Yuan. 2022. "Underwater Accompanying Robot Based on SSDLite Gesture Recognition" Applied Sciences 12, no. 18: 9131. https://doi.org/10.3390/app12189131

APA Style

Liu, T., Zhu, Y., Wu, K., & Yuan, F. (2022). Underwater Accompanying Robot Based on SSDLite Gesture Recognition. Applied Sciences, 12(18), 9131. https://doi.org/10.3390/app12189131

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Underwater Accompanying Robot Based on SSDLite Gesture Recognition

Abstract

1. Introduction

2. Related Work

3. Framework and Methodology

3.1. Image Enhancement

3.2. Underwater Human Tracking

3.3. Gesture Recognition

4. Algorithm and System Experiment

4.1. Datasets

4.2. Performance Evaluation

4.2.1. Image Enhancement Performance

4.2.2. Recognition Performance

4.2.3. Comparison of Performance before and after Human Tracking

4.3. System Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI