Robust Gesture-Based Communication for Underwater Human-Robot Interaction in the context of Search and Rescue Diver Missions

We propose a robust gesture-based communication pipeline for divers to instruct an Autonomous Underwater Vehicle (AUV) to assist them in performing high-risk tasks and helping in case of emergency. A gesture communication language (CADDIAN) is developed, based on consolidated and standardized diver gestures, including an alphabet, syntax and semantics, ensuring a logical consistency. A hierarchical classification approach is introduced for hand gesture recognition based on stereo imagery and multi-descriptor aggregation to specifically cope with underwater image artifacts, e.g. light backscatter or color attenuation. Once the classification task is finished, a syntax check is performed to filter out invalid command sequences sent by the diver or generated by errors in the classifier. Throughout this process, the diver receives constant feedback from an underwater tablet to acknowledge or abort the mission at any time. The objective is to prevent the AUV from executing unnecessary, infeasible or potentially harmful motions. Experimental results under different environmental conditions in archaeological exploration and bridge inspection applications show that the system performs well in the field.


I. INTRODUCTION
From the robotics perspective, underwater environments present numerous technological challenges for communication, navigation, image processing and other areas. There are unique sensors problems due to the electromagnetic waves being attenuated very strongly: no GPS-based localization, no radio communication and only limited possibility of using visible light. Acoustics sensors are mostly used, but they offer low bandwidth and high latency transmissions. It is no surprise that most of the applications such as biological sample acquisition, archaeological site exploration, industrial panel manipulation, etc; still require human intervention for their successful completion. 1 Robotics Group of the Computer Science & Electrical Engineering Department.Jacobs University Bremen gGmbH, Germany. {a.gomezchavez, a.birk}@jacobs-university.de. 2 Institute of Intelligent Systems and Automation. National Research Council of Italy (CNR). 3 Faculty of Electrical Engineering and Computing. University of Zagreb, Croatia.
For this reason, the EU FP7 CADDY project focused on diver-robot cooperation; where an AUV monitored divers activities, while communicating with them and performing multiple tasks on command [1]. Along the duration of this project, a unique set of data was recorded covering diver gesture recognition based on the CADDIAN sign language [2], and diver pose estimation using stereo-images and inertial sensor measurements from a specifically made diver's suit, called DiverNet [3].
To the best of our knowledge this is the first underwater dataset focusing on human-robot interaction between AUV and diver. The data mainly consist of precisely rectified stereo images suited for algorithms using 2D or 3D information, or a fusion of both; and the number of samples exhibited and their intravariance allow for testing feature-based or deep learning reasoning methods.
Although there are exhaustive underwater field trials surveys and performance evaluations of object recognition and stereo systems [4] [5]; due to the difficulty and cost of underwater data collection, the authors have to benchmark their methods across different application datasets available or use very constrained ones. The work here presented aims to solve this through a sufficiently large and representative dataset.
It is important to note that the use of these data is not limited to recognition and pose estimation tasks; it can serve as a basis for any vision-based algorithm in underwater applications. This stems from the fact that data was recorded on different environmental conditions which cause various image distortions unique to underwater scenarios i.e., low contrast, color distortion, haze [6], and that cannot be easily replicated from on-land recordings or in simulation.

A. BUDDY AUV
For data collection the BUDDY-AUV was specifically designed by the University of Zagreb during the CADDY project [7]. The vehicle is fully actuated and is equipped with navigation sensors: Doppler Velocity Log (DVL), Ultra Short Baseline (USBL), GPS; and perception sensors: multibeam sonar and Bumblebee XB3 stereo camera. In addition, it has a tablet in an underwater housing to enable human-robot interaction capabilities (see Fig. 1), i.e., output feedback to the diver.

B. Stereo camera and underwater image rectification
For image collection a Point Grey Bumblebee XB3 color stereo camera was used, model BBX3-13S2C, which provides raw images with 1280 × 960 pixels resolution at 16 Hz; it has 3.8 mm nominal focal length and wide baseline B = 23.9 cm. After rectification, all images are scaled to 640 × 480, and these are the dimensions of all the stereo image pairs in this dataset. The camera intrinsic parameter matrix is: K = [710 0 320; 0 710 240; 0 0 1].
The camera was enclosed in a watertight housing with a flat glass panel (see Fig. 1). When using such housing the light is refracted twice: first on the water-glass and then on the glass-water interface. These refraction effects cause the image to be distorted; as discussed in [8], a camera behind flat glass panel underwater does not possess a single viewpoint and therefore the classic pinhole model is not valid. This problem was addressed in [9] by proposing a new Pinax (PINhole-AXial) camera model that allows for rectification correction by "translating" the image to a rectified, pinhole camera viewpoint. This method was tested on multiple types of cameras, including Bumblebee XB3, yielding higher quality results than direct underwater calibration i.e., recording a calibration pattern underwater. One of the main reasons is that this pattern detection is commonly less accurate when obtained from distorted-raw underwater images, which have low contrast and radial distortions due to magnification artifacts. Instead, Pinax uses the physical camera model, water salinity and glass thickness to map the air-rectified image to its underwater model.
Examples of this rectification process are shown in Fig. 2; in-air intrinsic calibration was done using the CamOd-Cal software package [10] with the camera model from [11]. The obtained calibration files for the used Bum-bleBee XB3 instances are provided for the user's inspection, along with the CamOdCal and Pinax packages in form of a Docker container [12] which can be used for camera underwater housings with flat-glass panels (git@github.com:jacobs-robotics/uw-calibration-pinax.git).

C. DiverNet
During body pose recordings, divers used the DiverNet; its hardware, software and data acquisition modules are described in detail in [3]. In summary, DiverNet is a network of 17 Pololu MinIMU-9 Inertial Measurement Units (IMUs) with 9 degrees of freedom (DoFs). They are distributed and mounted as shown in Fig. 3a, 3b: 3 on each arm and leg, 1 for each shoulder and 1 for the head, torso, and lower back. Since it is practically impossible to have the sensors perfectly aligned and firmly in place during tests, a calibration procedure is performed by asking the diver to hold a T-posture and rotating each sensor's coordinate frame to the expected pose. Raw and filtered orientation for each sensor is then computed as follows: • Raw orientation is acquired based on the magnetometer data and the gravity distribution along each of the accelerometer axes.
• Filtered orientation is computed by fusing the raw orientation with the gyroscope data through a Madgwick-Mahony filter [13].
For data collection, all IMU sensors operate with maximum sensitivity i.e., accelerometer ±16 g, magnetometer ±1.2 mT, and gyroscope ±2000 deg/s. Values are recorded at 50 Hz through an optic fiber connecting the DiverNet data acquisition unit to an on-land workstation.
The filtered orientation consists of absolute values, globally referenced to the map frame M; thus, it is computed in the BUDDY-AUV frame R through the transformation R M T. The orientation comes from the average input of the torso and lower back IMU, and its angle in the XY plane, denoted as the heading φ, it's the one reported in this dataset (see Fig. 3c). As in the EU-FP7 CADDY trials, the data can be exploited to obtain the diver's swimming direction (heading) and test tracking algorithms with the AUV based on stereo imagery.

A. Data collection
The recordings for both the underwater gesture and diver pose database took place in 3 different locations in open sea, indoor and outdoor pools. Respectively in Biograd na Moru and Brodarski Institute, Croatia, and Genova, Italy. Then, the collected data was further divided into 8 different scenarios according to the type of conditions that have an impact on the image quality. Underwater gestures were recorded in all of them, whereas diver pose/heading only on 3 of them. Table I presents a description of these scenarios, including their dynamic and environmental properties. Dynamics refer to the relative motion between AUV and the diver caused by currents or the diver swimming; environmental characteristics mainly cover type of illumination (bright, fairly or strongly dim) and other distortions i.e., blur, haze and color absorption (high blue component). Figure 4 shows a sample image from each setting. All of the described data and software tools for its analysis and visualization are hosted at http://caddy-underwater-datasets.ge.issia.cnr.it/

B. Underwater gestures
The underwater gesture database is a collection of annotated rectified stereo images from divers using the CADDIAN gesture based language [2]. It is important to mention that the number of samples and the class distribution in each scenario varies significantly (see Fig. 5,6,7). This is due to the fact that recordings were done at different development stages of the EU FP7 CADDY project.
Biograd-A and B, and Genova-A trials were done mainly for data collection, hence their high number of samples; the rest of the data was collected during test experiments and real diver missions. However, since all of these scenarios have different environmental conditions and image quality levels, they are useful to: • Test algorithms and/or image features robustness across different unseen settings i.e., image distortions.
• Investigate which underwater image distortions have greater impact on classification methods.
• Balance the number of training samples used per scenario to achieve better performance.
• Find approaches that fuse 2D and 3D information from the stereo pairs to boost performance.
• Test not only object classification methods but also object detectors i.e., locate the diver's hand. For reference, we also mention that the diver's gloves have a 2.5 cm radius circle and a 5 cm square in the forehand and backhand respectively; both with a 1 cm white border. Likewise, each finger has a color stripe, HSV colors and fingers are associated as follows:  (70, 60, 100)}. The main goal was to provide a defined texture for the diver's hands (the target object) to help classification and disparity calculation and, at the same time, different colors in the fingers help studying underwater color absorption.

1) Data description:
From all the mentioned scenarios in the previous section, 9231 annotated stereo pairs were gathered for 15 classes (gesture types) i.e., 18462 total samples. Fig. 6 shows the distribution of samples per class. It is evident that the number of samples is considerably higher for some classes, but since the data was acquired from real diver missions, it is representative of the CADDIAN language distribution in the same way we use some words more frequently than others in  our daily speech. Due to this and the different class distribution each recording scenario has, we provide Fig. 5, 6 and 7 for the user to decide how to build their training/test sets to suit their application; for example, follow the original distribution or build a balanced data set. Likewise, we include 7190 true negative stereo pairs (14380 samples) that contain background scenery and diver without gesturing; these follow the same distribution per scenario as the true positives in Fig. 5.

2) Data parsing:
All of these data is compiled in tabular form in *.csv files as they do not require any extra software to be handled and most data analysis packages have built-in methods to process them. One file contains the true positives data and other the true negatives. Table II shows the header/column fields in these files; row with index 0 contains a brief description of the field data, row index 1 shows an example of how a true positive  image is referenced and row index 2 an example from a true negative. An explanation of these fields is also given below: • Scenario: Name corresponding to a location that encompasses particular settings affecting the quality of the image according to Table I.
• Label name: String that identifies the gesture class.
• Label id: Integer that identifies the gesture class.
• Roi left/right: Arrays that describe the regions of interest in the left/right image i.e., where the hand-gesture is located. Each array element is separated by a comma. When 2 instances of the target object are present in the image, each array is separated by a semicolon (this is only true for the mosaic gesture).
To augment and complement this database, we added 4 different types of distortions to the original stereo-pairs: blur, contrast reduction, channel noise and compression. These standard distortions are the most commonly present while collecting and transmitting underwater data. The user can utilize these synthetic images or choose to apply these or other  distortions themselves; nonetheless, they are briefly described here as some extra header/column files are reserved to describe them. Then, the user can take advantage of this format, the directory structure presented in Fig. 8 and the provided software scripts to log their own synthetic images. Table III shows the additional columns used to reference synthetic images; the param columns store key values of the applied distortions e.g., for blur, param 1 represents kernel size and for compression, param 1 and param 2 refer to the compression scheme and quality level.

3) Database directory and provided software:
The provided image files follow the directory structure shown in Fig 8. As stated in section III-A, the dataset is first divided by scenarios; which contain a true positives and a true negatives folder. Then, each of these folders contain a raw directory with all the original rectified stereo pairs, plus a directory for each image distortion applied to them. Since we can apply a particular image distortion with different parameters, a subdirectory named dir_## is created for each different set of parameters used. The correspondence between these subdirectories and the distortion parameters can be checked in the database tabular file (see Tables II,III). Finally, an extra folder named visualization is available for each scenario, where images with highlighted region of interest (ROIs) or hand gesture patches are saved.
In summary, we provide the following tools/scripts for parsing and visualizing the described data, and files the user can utilize as sensor reference. Their usage is explained in http://caddy-underwater-datasets.ge.issia.cnr.it/.
• Parse and visualization scripts to: parse by label ID, label name and/or recording scenario.  apply the mentioned image distortions with user defined parameters.

C. Diver pose estimation
The diver pose/heading database is as well a collection of annotated rectified stereo images extracted from video sequences showing the diver free-swimming; each stereo pair is associated with a diver heading as explained in Section II-C. In the CADDY project, the primary objective was to track the diver and position the AUV in front as for the diver to always face the camera. In this way, the AUV can monitor the diver's activities and communicate through gestures or the underwater tablet shown in Fig. 1.
Thus, the dataset can be used to test human pose estimation, segmentation or scene geometry understanding methods in this particular context e.g., our work in [14]. For one-shot or frame-by-frame algorithms we offer the rectified stereo pairs while for methods that consider the input history (i.e., diver's previous movements) we provide a sequence number explained in the next section (see Table IV). Data was collected from scenarios Biograd-B, Brodarski-A and Brodarski-C.

1) Data description:
To collect the data, divers were asked to perform three tasks in front of the AUV: (1) turn 360 deg horizontally (chest pointing downwards, to the floor) and (2) vertically, clockwise and anticlockwise, and (3) swim freely. For the latter, the AUV was operated manually to follow the diver. In total, 12708 rectified stereo pair images are provided from which 3D representations can be computed as well.
The collected measurements have passed through a noise (median) filter with a buffer size 5, and an offset correction step (sensor bias) done manually before each test. As mentioned, φ is the diver's angle in the XY plane relative to the AUV (frame R) and 0 deg is defined when the diver is facing the camera (see Fig 3c). Hence, the range of values go from −180 deg to 180 deg.

2) Data parsing:
This dataset is also presented in tabular *.csv form as in Table IV. The explanation of its headers is as follows: • Scenario: Name corresponding to recording location and specific settings as in Table I.
• Sequence: Integer that identifies the sequence to which the stereo pair belongs. An image only belongs to a sequence if its from the same scenario and forms part of set continuous in time.
• Stereo left/right: c.f. Table II. • Heading: Float number in degrees that indicates the diver heading.

3) Database directory and provided software:
The provided video sequences are just split into directories for each scenario, as the first level of the directory structure in Fig. 8. We also offer software tools to: • Extract a stereo pair given a scenario name, a sequence or a combination of both.
• Extract all stereo pairs associated with a range of heading values.
• Output a sequence as video file for visualization purposes.