Underwater Accompanying Robot Based on SSDLite Gesture Recognition

: Underwater robots are often used in marine exploration and development to assist divers in underwater tasks. However, the underwater robots on the market have some problems, such as only a single function of object detection or tracking, the use of traditional algorithms with low accuracy and robustness, and the lack of effective interaction with divers. To this end, we designed a type of gesture recognition based on interaction, using person tracking as an auxiliary means for an underwater accompanying robot (UAR). We train and test the SSDLite detection algorithm using the self-labeled underwater datasets, and combine the kernelized correlation ﬁlters (KCF) tracking algorithm with the “Active Control” target tracking rule to continuously track the underwater human body. Our experiments show that the use of underwater datasets and target tracking can effectively improve gesture recognition accuracy by 40–105%. In the outﬁeld experiment, the performance of the algorithm was good. It achieved target tracking and gesture recognition at 29.4 FPS on Jetson Xavier NX, and the UAR made corresponding actions according to the diver gesture command.


Introduction
With the increase of population and economic development, human beings need to develop marine resources to meet the needs of economic development.In recent years, underwater robots have begun to play an increasingly important role in marine exploration and development.Underwater robots are mainly divided into remote operated vehicles (ROVs) and automatic underwater vehicles (AUVs) [1,2].ROVs can complete complex underwater tasks through the manipulation of personnel on shore, and are widely used in many fields such as underwater mining, hull cleaning, and pipeline monitoring [3,4].AUVs are mainly used in marine resource exploration, mineral resource development, and other fields due to their strong autonomy and large diving depth [5].
The research and application of underwater robots has become increasingly mature, and it has made remarkable achievements in the field of marine exploration.However, for some important scenes such as exploration under complex terrain, aquatic resource fishing, underwater archaeology, and marine biological census, divers are still the main tool.However, on the one hand, the marine environment is dangerous, and the real-time monitoring of the underwater environment cannot be realized due to the weakening of electromagnetic waves by water bodies.On the other hand, human diving depth is very limited and depends on relevant equipment to maintain necessary breathing.All of these factors threaten the life safety of divers.
For years, researchers have continuously improved autonomous underwater vehicles (AUVs) to help divers complete underwater work.Effective communication between the diver and the AUV is critical in the human-machine co-diving environment.A good interaction mode can maximize the role of AUVs and promote human-machine cooperation to achieve underwater tasks [6].During this period, underwater optical images have been widely used in underwater archaeology [7], seabed resource exploration [8], underwater environment detection [9], and other fields due to their advantage of intuitive detection targets, high imaging resolution and high information content.
At present, the wireless communication mode dominated by acoustics is often used in underwater communication due to the substantial limitation of the cable communication mode underwater.However, underwater acoustic communication still has two major drawbacks: one is the dependence on expensive acoustic equipment, and the other is the narrow band width and low rate of underwater acoustic signal transmission [10].Therefore, at present, in order to realize the communication and interaction between the diver and the robot, many researchers [11][12][13] adopted the method of recognizing and analyzing the captured diver's gesture to obtain the diver's command and intention.In this way, information can be effectively transmitted between man and machine, and expensive acoustic equipment can be avoided [14].
Due to the particularity of the underwater environment and the absorption of light by water bodies, there are often situations in the underwater gesture interaction, such as inability to correctly capture gestures and a poor effect of gesture recognition [15].To solve the above problems, we added underwater image enhancement and human tracking before gesture recognition.Underwater image enhancement can carry out color correction and image restoration, and restore the real scene information of the underwater image.Underwater human tracking can not only realize the continuous tracking of divers by an AUV to ensure that dives always stay in the field of vision of the AUV, but also focus on finding preset gestures from human body instead of the whole image to improve the accuracy of gesture recognition.
In addition, due to the work under underwater, AUVs have requirements on the weight and volume of the equipment they carry, so they cannot load high-power and largevolume desktop-level computing equipment.Therefore, it is necessary to select embedded devices with a small volume, low power consumption and high performance, combined with low-complexity algorithms, to complete underwater tracking and detection tasks.
On the basis of previous studies, we propose a type of gesture recognition as interaction based on human tracking as an auxiliary means of an underwater accompanying robot (UAR).The visual processing algorithm is deployed on the embedded AI device Jetson Xavier NX to realize real-time tracking and gesture recognition of underwater targets.The structure of the work is as follows: Section 2 mainly introduces the existing related research work.Section 3 introduces the overall framework of the system and the algorithm designed.Section 4 shows the use of datasets, experimental process and results.Section 5 elaborates upon conclusions and prospects for follow-up work.

Related Work
At present, the gesture-based robot control method is widely used in daily life, industrial production and other fields because of its high ease of use, high recognition accuracy, and low learning cost.In [16], a gesture-based indoor mobile robot was proposed, and the movement of the robot was controlled by gestures captured by a Kinect camera.In order to improve the robot's gesture recognition ability, [17] integrated the features of the data obtained from the visual system and the inertial system, and evaluated the influence of four commonly used algorithms on gesture recognition.The gesture-based human-machine interaction was introduced in detail in [18], but the methods involved are mostly applicable to the air environment.In the underwater environment, the robustness and repeatability of these methods are questionable due to the influence of color distortion, low light, bubble occlusion, and other problems [6].
In [19], sufficient research has been made on gesture interaction in the underwater environment.Considering the influence caused by complex scenes, divers' clothes, and harsh lighting conditions, a more complex Caddian language was designed to complete the communication between divers and AUVs.This language is built upon consolidated and standardized underwater gestures, which form control commands through different gestures to accomplish complex tasks such as ascending a fixed distance or taking photos at fixed points.While it is possible to enhance underwater gestures to communicate information, because of not considering the actual physical problems in the implementation, the practicality and understandability of gestures is not high.
In recent years, in view of the strong recognition performance of neural networks, many studies [20][21][22] have adopted convolutional neural networks to complete underwater human-machine interaction based on gestures, achieving good recognition and control effects.However, in terms of experimental implementation, the above schemes are all based on the test data in the laboratory, without combining with the actual hardware equipment to obtain the experimental verification results in the real scene.
For the choice of high-performance mobile devices, embedded AI devices such as NVIDIA Jetson series are widely used.The reason is not only that Jetson can provide powerful AI computing power and parallel computing function, but also that it has the characteristics of a small size, low power consumption and rich external expansion.At the same time, the Jetson platform is compatible with the jetpack software development kit (SDK), so that it can run applications such as deep learning, computer vision, and accelerated computing software libraries.These features enable Jetson to be used in various scenarios, including auto driving [23][24][25], industrial quality monitoring [26], medical testing [27,28], etc. [29][30][31].
Less research work has been done on human-machine gesture interaction underwater.The related works of underwater robot systems based on gesture recognition and assisted by target tracking are very few and there are many aspects yet to be examined.

Framework and Methodology
Due to the special nature of the underwater environment and the absorption, simple gesture recognition algorithm may not achieve good results.Therefore, we combine underwater image enhancement and human tracking, proposing an interactive underwater accompanying robot system based on gestures.The system block diagram is shown in Figure 1.First, we need to identify the underwater target person in the enhanced underwater image as the target for the next continuous tracking.Then, the gesture recognition algorithm is used to obtain the gestures made by the tracking target.Finally, according to the target coordinates and gesture recognition results, the processing results are transmitted to the STM32 control board by instruction.The results of each step can be output individually.

Image Enhancement
Due to the effects of absorption and scattering, underwater images have very serious color distortion.Especially when the camera is far away from the target, the longer the path of light propagation, the more serious the color distortion.The attenuation effects are subject to the wavelength of light and usually the images collected in deep water are bluegreen, which is due to the red and purple light being completely attenuated.Therefore, it is very important to perform color correction and image restoration for underwater images, which can restore the real scene information.
The scattering model for the underwater image [32] can be simplified and equated to the following Equation ( 1

Underwater Human Tracking
In order to realize the continuous tracking of underwater humans by UAR, we use a combination of a target detection network and a target tracking algorithm to continuously track specific targets in the image, so as to provide help for the next gesture recognition.Considering the limited memory space and computing power of embedded devices, we chose the SSDLite detection network with better speed and size to complete the detection work.The SSD (Single Shot MultiBox Detector) algorithm [34] is one of the most widely used algorithms in object detection, which adopts a deep neural network model of end-toend single shot MultiBox real-time detection.It integrates the regression idea of YOLO [35] and the candidate box mechanism of Faster R-CNN [36].The SSD algorithm greatly reduces the computation of neural networks and improves the running speed of algorithms by using the idea of regression.In addition, the SSD algorithm uses the local feature extraction method to obtain features at different positions, different aspect ratios and different sizes, which is more efficient than the YOLO's global feature extraction method for a particular location.The overall objective loss function of the network is the weighted sum of localization loss L loc and confidence loss L conf .
where N is the number of matched default boxes and the L loc is the Smooth L1 loss [36] between the parameters of predicted box (l) and the ground truth box (g).L conf is the softmax loss on the multi-class confidences (c) and the weight term α is set to 1 through crossvalidation.
Based on SSD, SSDLite replaces all of the regular convolutions in the detection layer with separable convolutions (Depthwise followed by 1 × 1 projection) [37].Compared with the conventional SSD model, SSD Lite has fewer parameters and less computational cost, so it can be better applied in embedded devices.
For the real-time detection of mobile embedded networks, a network with good accuracy and real-time performance is needed as the backbone of SSDLite.The MobileNet [38] network is a lightweight CNN proposed by Google, which replaces ordinary convolution with Depthwise Separable Convolution to improve the accuracy and speed of detection while reducing the size of the model.
MobileNetV3 [39], on the basis of preserving the Depthwise Separable Convolution of V1 and the Inverted Residuals Block of V2, introduces a lightweight attentional model based on the squeeze and excitation structure, by combining with the characteristics of the channel relationship to enhance the learning ability of the network.Meanwhile, the h-swish and h-sigmoid functions as shown in Equations ( 4) and ( 5) are used to replace the swish and sigmoid functions of the network, respectively.Thus, the computation of the network is further reduced and the model is made more suitable for the mobile embedded platform.
Since the underwater tracking scenario requires more real-time performance, we discarded the neural network tracking algorithm with poor speed performance and choose the faster KCF algorithm as the underwater human tracking algorithm.The human body in the underwater frame is detected using the SSDLite, and the kernelized correlation filters (KCF) tracker is initialized using the detection result.In the subsequent tracking stage, the tracker is densely sampled with the target position in the previous frame of the video as the center, and all of the obtained samples are input to the trained ridge regression classifier.After the classification of the ridge regression classifier, the position of the target is obtained.The overall algorithm flow is shown in Figure 3.In addition, KCF uses the nature of diagonalization in the frequency domain of the circular matrix to convert the calculation from the time domain to the frequency domain, and by combining fast Fourier transform, the tracking speed is greatly improved.

Extract sample HOG features
Get classifier Fast target detection

Enter target area
The tracking result of the current frame is input as the next frame Target tracking results At the same time, due to the peculiarities of the underwater environment, the UAR needs to maintain a very close formation with the target.To ensure that the UAR can effectively track the target, the overall visual center of the tracked target needs to be kept as close to the image center as possible.To this end, we designed a target tracking rule called "Active Control" to adjust the UAR's position and attitude in time.The principle of the "Active Control" algorithm is to always keep the distance between the target center and the frame center within a certain range, as shown in Figure 4.A limited box LBOX is constructed with the frame center point FCP and the side length is a default value thre.The target center point TCP needs to be in that square at all times.In addition, the area outside the LBOX is divided into 8 blocks, which are U pper Le f t, U pper, U pper Right, Le f t, Right, Lower Le f t, Lower, and Lower Right.When the target is out of the restriction square, the next control command is sent to the STM32 control board according the area and position of the TCP.Finally, according to the overall flow, in the underwater tracking process of this project, we used the target position obtained by SSDLite to initialize the KCF tracker when tracking starts and when tracking fails.The overall algorithm flow pseudo-code is shown in Algorithm 1, where cmd denotes the control command passed to the STM32 control board.

Gesture Recognition
There are several problems with the currently available gesture datasets, which lead it not being possible to directly use them in this project.On the one hand, most of these datasets were obtained in the open-air environment, which is quite different from the underwater environment.On the other hand, the comprehensibility of the gestures used in these datasets is not good, and there is no special consideration for robot control.For these reasons, we designed eight simple and practical gestures for Uthe AR.Each gesture corresponds to an action of the UAR, such as open five fingers for light and the closed fist for stop, as detailed in Table 1.We used the SSDLite network as the gesture recognition network and adopted the MobileNet-V3 mentioned above as its backbone network.We used self-labeled datasets for SSDLite gesture recognition network training and testing.The network is implemented on the PyTorch framework, and all of the above models were trained and tested on NVIDIA GEFORCE GTX 3080Ti.Our training batch size was 132 with a learning rate of 1 × 10 −3 , which lasted 72,000 iterations, and the learning rate decreased 10 times after the 48,000th and 60,000th iterations.

Datasets
In our project, we are targeting underwater humans, including scuba divers, free divers, and swimmers in pools.In order to ensure that SSDLite can have a better underwater detection effect, we discarded land datasets such as COCO [40] and PASCAL VOC [41].The Diver underwater human dataset is produced by means of self-shooting and network search.The Diver dataset has a total of 1398 images, including underwater human images of different color degradation models, different underwater scenes, different degrees of clarity, and different costumes.The dataset is diverse, and some images are shown in Figure 6.For the gesture recognition dataset, according to the gesture types shown in Table 1, we collected about 600 air and underwater images for each gesture, reaching a total of 5541 images, and used random cropping and affine transformation to expand the data to an A 4:1 ratio that was used to make the training datasets and test datasets; some images are shown in Figure 7.

Performance Evaluation 4.2.1. Image Enhancement Performance
We compared and analyzed the proposed Estimation Network image enhancement algorithm with other methods, including simple white balance, underwater dark channel, multi-scale fusion [42], DN-based [43], and UGAN [44].The experimental results are shown in Figure 8.Then, we used the objective image quality evaluation indicators, peak signal-tonoise ratio (PSNR) and structural similarity index (SSIM) to evaluate and compare the performance of the proposed method and other methods.Table 2 shows the PSNR and SSIM values of different methods for the original image (a) in Figure 9, where (b) is the underwater image generated by the method in [45], and (c-f) represent the output of different methods to the underwater image.The larger SSIM and PSNR values are, the closer the output image is to the true value, that is, the higher is the accuracy.From the evaluation results, it can be seen that the method proposed in this paper has good performance for images of different degradation types, and performs best among various comparison methods.

Recognition Performance
We trained different detection methods and compared their performance on the Diver test set.The test conditions were as follows: NVIDIA Jetson Xavier NX, Jetpack 4.6, and MODE_10W_DESKTOP.The results are shown in Table 3.As can be seen from the table, YOLOv3 achieved the highest mAP value in terms of detection accuracy.However, in terms of speed performance, except for SSDLite, the other methods had poor real-time performance and could not realize real-time detection on embedded devices.Considering the accuracy, speed, and model size, the SSDLite detection algorithm with MobileNetv3 as the backbone performed best.In addition, in order to prove the superiority of the Diver dataset, the SSDLite model trained using the PASCAL VOC 07 + 12 dataset to detect a person in an underwater environment, and the mAP obtained was only 0.4879, far lower than the 0.882 obtained by the Diver dataset training.A large number of images in the test set could not be correctly detected.As shown in Figure 10, the first row shows the test results using the air training set, and the second row shows the test results using the Diver dataset.In false detection rate, missed detection rate, and detection accuracy, the Diver dataset showed better performance.Similarly, we tested the trained gesture recognition network model on the test set and compared the results underwater with different backbones, as shown in Table 4.The MobileNetV3 Small model we used achieved 99.94% mAP on Jetson Xavier NX, 23.5 FPS recognition speed and the model size was only 5.71 MB, which performed best.

Comparison of Performance before and after Human Tracking
In order to test the effect on performance before and after adding underwater human tracking, we used part of the underwater dataset as the test set.Taking MobileNetV3 (small) as the test model, the recognition results of adding tracking and not adding were tested.Some gesture recognition results are shown in the Figure 11.Subfigures (a.1), (b.1), and (c.1) in Figure 11 show the results of using only gesture recognition.Subfigures (a.2), (b.2), and (c.2) show the results of adding tracking before gesture recognition.From the comparison of (a.1) and (a.2), it can be seen that the detection score of gesture recognition has been greatly improved from 0.76 to 0.99.In addition, there was a large number of samples that cannot be detected by gesture recognition alone in the test set, as shown in (c.1).After adding underwater human tracking, better performance was obtained, as shown in (c.2).
We used the trained MobileNetV3-small model to test the underwater gesture recognition accuracy before and after tracking.Table 5 shows the mean average precision (mAP) index of gesture recognition before and after the addition of tracking in detail.It can be seen from the table that the mAP index improved greatly after the addition of underwater human tracking.With the increase of the score threshold (i.e., removing network detection results below the threshold), the improvement rate increased from 40% to 105%.
Additionally, from Figure 12 it can be seen that with an increase of the set score threshold, after adding underwater human tracking, the performance of the gesture recognition was more stable, and it continued to maintain a high recognition accuracy.The reason is that the tracking removes a large number of interfering pixels and makes the gesture recognition focus on finding preset gestures from the person body image rather than the whole image.To sum up, adding the step of person tracking before underwater gesture recognition can not only improve the detection speed (due to the smaller input image), but also greatly improve the accuracy of recognition results.

System Experiment
To test the overall performance of the system, we conducted outfield experiments on the UAR in a swimming pool.The experimental scene is shown in Figure 13.The whole UAR is shown in Figure 14.The ROV used was the Trench Wanderer 110ROV, which is equipped with four thrusters to give it three degrees of freedom (surge, heavy, and yaw).In terms of hardware equipment selection, the underwater camera we used was GoPro7, because it has a wider viewing angle and an enhanced anti-shake function.It can be connected with the processing unit through USB interface to provide a better picture for the UAR.As the core processing unit, we chose NVIDIA Jetson Xavier NX to carry and run the recognition network.With the powerful computing power of 21 TPOS and memory bandwidth of more than 51 GB/s, combined with low power consumption (10 W) and a comprehensive expansion interface, Jetson Xavier NX is suitable for the perception and detection of complex underwater environments.We chose the STM32 microcontroller (STM32F407ARM) to cooperate with the NX3 flight control to complete the control of the underwater robot: on the one hand, the attitude change of the robot was monitored through an on-board gyroscope and accelerometer to avoid excessive inclination or rollover.On the other hand, the control information sent by Jetson Xavier NX was received through UART, and 8 PWM channels were used to drive the motor of the robot to complete the set underwater work.The overall control flow of the system is shown in Figure 15.With the exception of the floodlight, all other devices were installed in a waterproof shell made of transparent polycarbonate (PC).Finally, the built-in battery pack supplied power to all electronics inside through various up/down modules.During the experiment, the diver entered the water with the UAR, initialized the UAR and determined the main tracking target through the preset gestures.Then, the UAR invoked its thrusters and continuously tracked the target diver, always keeping the target diver in the center of the frame.At the same time, the UAR performed gesture recognition on the cropped video steam (crop by the position of the target).Due to the elimination of complex environmental background, the recognition area was narrowed, and the missed and false recognition was greatly reduced.Moreover, it also had a good recognition effect on fuzzy moving images, as shown in Figure 16.

Jetson
In terms of recognition speed, the target tracking time was 33.0 ms, and the model inference time was 31.9 ms.By setting the gesture recognition frequency to once a second, our algorithm was able to achieve 29.4 fps on Jetson Xavier NX, satisfying the real-time requirements.

Conclusions
In this paper, we proposed an underwater accompanying robot system based on gesture recognition, and made a prototype for experimental verification.The work involved mainly focuses on underwater visual processing and gesture interaction.By experimental verification, we found that using an underwater training dataset and adding human tracking before recognition can greatly improve the recognition effect.In the field tests, our algorithm achieved a 99.92% mAP score on Jetson Xavier NX and ran stably at 29.4 fps, meeting our requirements for accuracy and speed of underwater gesture recognition.
However, our work still needs to be improved, especially the hardware design and control part of the robot.A complete underwater robot system needs a reliable and effective hardware structure as support.Therefore, we will gradually improve the hardware structure in future work, and give full play to the joint role of the hardware and algorithm.At the same time, for the underwater identification scene, although we have carried out the verification experiment of the algorithm and system in a swimming pool, we have not verified the performance in the outdoor marine environment.xxxxNext, on the basis of this experiment, we will continue to optimize the hardware structure and algorithm design to improve the overall robustness of the system.In addition, test experiments in the outdoor marine environment will be conducted to continuously improve the stability and accuracy of the system in complex environments according to the results obtained from the experiment.

Figure 4 .
Figure 4. Positive control schematic.The UAR movement is controlled according to the position of the target center TCP, and the TCP is kept in the center of the frame LBOX.

Algorithm 1 Figure 5 .
Figure 5. Underwater human body tracking test results.(a) The initial frame, (b-d) the subsequent tracking results of the tracker.

Figure 6 .
Figure 6.Different types of images in the Diver dataset are shown.(a) Green.(b) Blue.(c) Indoor.(d) Outdoor.

Figure 7 .
Figure 7. Display of images in the air and underwater in the gesture recognition dataset.(a) Air gesture dataset.(b) Underwater gesture dataset.

Figure 10 .
Figure 10.Comparison of test results between air dataset and diver dataset.(a) The results of the model trained by the air dataset.(b) The results of the model trained by the Diver dataset.

Figure 11 .
Figure 11.Performance before and after human tracking on the underwater test set.(a.1,b.1,c.1)show the results of using only gesture recognition.(a.2,b.2,c.2) show the results of adding tracking before gesture recognition.

Figure 12 .
Figure 12.With the increase of the score threshold, the gesture recognition accuracy before and after human tracking.Blue and green lines represent the results before tracking, red and orange lines represent the results after tracking.

Figure 14 .
Figure 14.Overall UAR structure.It consists of the ROV, Jetson Xavier NX (core processing unit), a GoPro7 (underwater camera), an STM32 control board, and other components.

Figure 16 .
Figure 16.Outfield experiment results.(a) The gesture interaction scene.(b) The results of gesture recognition.

Table 1 .
Gestures corresponding to robot actions.

Table 2 .
PSNR and SSIM values between the output of different methods and the original image.
Bold indicates the best result.

Table 3 .
Performance of different detection methods on the Diver test set.

Table 4 .
Performance of different backbones on gesture recognition dataset.

Table 5 .
Gesture recognition accuracy before and after person tracking with MobileNetV3 Small model.