3.1. Target Scenario
Figure 3 describes the pipeline of the proposed method for classifying underwater objects. In the survey area, the AUV scans the seabed using the FSS while moving in a lawnmower trajectory, one of the primary methods of investigation of the AUV [
17]. In the lawnmower trajectory, the AUV passes over the underwater objects covering the entire scan area. Then, by analyzing the captured sonar images as the AUV passes over an object, the 3D geometry of the underwater scene is restored in a point cloud. Because the sonar images have a low SNR and low resolution, the reconstructed point cloud is also noisy and sparse. Clustering removes the noise and extracts only the point cloud of the object. Finally, the NN predicts the class of the object from the extracted point cloud.
The proposed method can handle various difficulties of the sonar sensor. We address the 3D reconstruction of objects to classify underwater objects. Owing to the unique projection principle of the sonar sensor, the appearance of the object in the sonar image is largely affected by the viewing point. Sonar-image-based underwater object classification is problematic because it is nearly impossible to predict the angle at which the AUV encounters an object in an unstructured environment. If the object is restored in 3D, it is possible to classify the object more accurately by comparing the restored model with the 3D ground truth of the target object in various aspects.
For the classifier of the point cloud, we introduced an NN. The NN extracts and pools features through deep layers to enable robust classification from noisy and sparse sonar data.
Furthermore, the proposed method has a straightforward training process. When applying existing NN-based algorithms to underwater sonar sensors for their mission, the training step was one of the main challenges. The shape of the target object in the sonar images had to be predicted in advance, or a dataset consisting of many images had to be constructed through additional underwater experiments. On the other hand, because the proposed method classifies objects by restoring the object to its original form, the training does not require sonar images. Instead, it is possible to directly synthesize the training dataset from a 3D model of the object by reflecting the characteristics of the sonar sensor. As a result, the proposed classifier is easy to implement.
We propose using a point cloud when reconstructing the object in 3D from sonar images. There are several methods to represent the 3D model of an object, such as using voxels and meshes. The point cloud is a set of points with coordinate values of ; therefore, it does not consume much memory. Therefore, the point cloud is a suitable data format for the AUV, which has limited computing power. Additionally, the point cloud is a standardized format that is often used on land, so various terrestrial algorithms could be applicable for underwater operation.
The generated point cloud and classified object information can be utilized for AUV operations such as target detection and navigation. The remainder of this section explains the main elements of the proposed method in more detail.
3.2. Reconstruction of the 3D Point Cloud of an Object Using FSS
For underwater object classification, the proposed method first reconstructed the 3D geometry of an underwater object using a sonar sensor. By restoring the 3D geometry using a point cloud, the AUV can accurately classify underwater objects without excessive memory overhead.
We could generate the 3D point cloud of underwater objects from a series of 2D sonar images. Cho et al. [
18] developed a method to specify the elevation angle of underwater terrain by analyzing highlights in consecutively taken sonar images with an AUV, as shown in
Figure 4. Although the FSS can scan a region between
and
in a single capture, the FSS has a sweet spot in which the strength of acoustic beams is concentrated and has the highest SNR. Therefore, when capturing an object by using an FSS, it is common to place the object in the sweet spot and make the seabed appear in the background, as shown in
Figure 4a. On the other hand, Cho addressed an effect called highlight extension to restore the elevation angle of the object. As shown in
Figure 4b, if the FSS approaches the object, the highlight extension effect occurs, in which the highlight of the object is observed before the seabed, since the object protrudes from the seabed. If the FSS approaches the object more, the length of the highlight extension increases until it reaches a critical position and does not change thereafter, as shown in
Figure 4c,d.
The point cloud of the object can be generated using this highlight extension effect.
Figure 5 shows the geometry between the FSS and the object at the critical position. Once the object reaches the critical point, the elevation angle of the object is specified as
by the tilt angle of FSS
t and spreading angle of beam
s. Thus, the global coordinates of the highlight pixel of the object
can be calculated as
where
denotes the position of the FSS,
is the coordinate transform matrix from the FSS coordinate to the global coordinate, and
and
are the range and azimuth of the highlight pixel for which the 3D coordinate is to be calculated. Here,
and
can be measured using navigational sensors of the AUV, and
and
can be calculated from the sonar image as
where the size of the sonar image is
M by
N, and
is the pixel coordinate of the highlight. Then, the point cloud of the object can be generated by accumulating the calculated coordinates of the extended highlights while scanning the object by utilizing the mobility of the AUV.
We pre-processed the point cloud for robust object classification. The pre-processing is to select only the points that belong to the object from the generated point cloud, and it consists of seabed removal and noise removal.
The points belonging to the seabed are removed first because the seabed is independent of the shape of the object. The raw point cloud is generated by analyzing highlights from the consecutive scene captured by the sonar image. The seabed itself also has highlights, so points corresponding to the seabed are also generated. However, according to (
1), the
z-coordinate of the seabed is calculated as near zero. We eliminated the seabed from the point cloud by filtering the points whose
z values were smaller than a small threshold value.
Next, noise points that do not belong to the scanning target object are removed. Because the SNR of the sonar image is low, there is much noise in the point cloud generated from the sonar images. Additionally, even if the points with small z values are filtered, there may be natural features such as rocks and seaweed that have height. These points should be recognized as noise and removed.
To remove noise, we used density-based spatial clustering of applications with noise (DBSCAN) [
19]. DBSCAN is a clustering algorithm that groups points based on whether there are enough neighboring points within a certain radius. The noise of the point cloud occurs from several highlight noise pixels of the sonar image, so points that do not have enough neighboring points could be considered as noise.
This method can generate point clouds of underwater objects in real time with less computation. Classically, 3D reconstruction from 2D images uses stereo vision. Stereo vision compares multiple images to find common areas or features, so it requires more memory and computation. On the other hand, this method generates the point cloud by calculating the height of an object slice that intersects the scan line whose elevation angle is while an AUV passes over the object. This approach does not require extraction of features, remembering previous images, or complicated path planning for AUVs. Therefore, this method is suitable for AUVs that operate in an unstructured environment and have limited computing power and batteries.
3.3. Object Classification Based on a Point Cloud Using PointNet
We introduced an NN to classify underwater objects with the generated point cloud of an object. As performance of GPUs has recently improved, NNs have been actively developed and exhibit an outstanding performance in object classification. Furthermore, a well-trained NN is robust to noise and environmental variation [
20].
A point cloud is more advantageous than voxel or mesh in terms of memory [
21], especially when reconstructing and storing a relatively small object in 3D using a sonar sensor with a large field of view and low resolution; however, it is difficult to extract meaningful information, since it is an unordered and unstructured set. An NN for classifying objects from a point cloud should have the following characteristics. First is the permutation invariance. Because a point cloud is a set of points that comprise an object, the NN should output the same result regardless of the order of those points. Second is the rigid motion invariance. The essential information of the point cloud is the overall shape and the distances among points. Therefore, transformations of the entire point cloud, such as translation or rotation, should not change the result.
From the NNs with these characteristics, we adopted PointNet [
22].
Figure 6 illustrates the PointNet pipeline. The
coordinate values of
n points are input to the NN. For each point, PointNet extracts the local features of 1024 channels through multi-layer perceptrons (MLPs). Then, by applying max pooling to the extracted local features, a global feature vector representing the 3D shape of the point cloud is created. Finally, using this global feature vector, the NN can classify objects by predicting the score for each class through two fully connected layers.
In this pipeline, MLPs extracted local features independently for each point. Additionally, the max pooling operation is a symmetric function satisfying , and is not affected by the input order. Therefore, PointNet satisfies order invariance.
Furthermore, PointNet applies transformations to the input point cloud and local features to meet the rigid motion invariance. The input and local features are transformed into canonical space by predicting the affine transformation matrix using mini-networks, which have an architecture analogous to the entire PointNet, and multiplying by the predicted matrix. These transformation steps can align inputs and extracted features, so the point clouds are classified as the same objects even if points are rotated and translated.
The PointNet is suitable for sonar-based underwater object classification. Because the sonar image is noisy and has low resolution, the point cloud generated from the sonar image is also noisy and sparse. However, PointNet can accurately classify sonar-based point clouds by extracting high-dimensional features through multiple layers. Furthermore, the PointNet has a simple and unified architecture consisting of few-layer MLPs, so the inference is calculated quickly and efficiently. Therefore, it is also suitable for use with AUVs.
Using the PointNet, underwater objects are classified as follows: An AUV generates a point cloud while the AUV passes over an underwater objects. For the input of PointNet, n points are randomly sampled from the generated point cloud. These points are then normalized to fit inside a sphere whose center is the origin and whose radius is one. Then, the object is classified by the inference of the PointNet.
3.4. Training Point Cloud Synthesis
Finally, we addressed a method to construct data to train the proposed underwater object classifier. A point cloud generated using a sonar sensor has two characteristics, which are different from those of sampling points directly from the polygons of the 3D shape of an object. We analyzed these two features for synthesized training point clouds from the 3D model of the object.
The first characteristic is the front slope, as in
Figure 7. After the FSS reaches the critical point, the elevation angle is specified as
. However, from the beginning of the highlight extension until reaching the critical position, the frontmost and the uppermost point of the object, whose elevation angle is not
, is approximated to the point of the front surface. This approximation of the elevation angle causes the slope of the front face.
The front slope can be modeled by considering the displacement of the FSS in two consecutive sonar images when generating the point cloud [
23], as shown in
Figure 8. The sonar image is originally generated by projecting the points along the arc to the image plane, but it can be approximated that the points are projected orthogonally to the center plane of the sonar beam when the beam angle is sufficiently small. Then, if the FSS moves by
, the difference of ranges to the highlight pixel in two consecutive sonar images
is approximated as
where
t is the tilt angle of the FSS. On the
plane of the FSS, the points of the front face in two consecutive sonar images
i and
are calculated as follows:
Assuming that the FSS maintains altitude,
is negligible. Then, from (
3) and (
4), the front slope is derived as
As a result, the front slope could be estimated from the tilt angle of the sonar. The generated point cloud can be corrected using the calculated slope. Alternatively, the network can be trained more robustly by adding the modeled front slope.
Next, the generated point cloud has limited surface information. The proposed method scans an object in a single direction to avoid overhead in the operation time of the AUV. Additionally, elevation angles can be specified only for the points reaching the critical position. Therefore, the points are reconstructed from the limited surfaces.
We proposed a method to detect the hidden surfaces that are not scanned according to the movement of the FSS, as shown in
Figure 9. When the 3D model of an object is given, polygons facing the back based on the FSS are culled first. The scan direction vector
of the FSS is specified as
by the heading of the AUV
, the tilt angle of the FSS
t, and the beam spreading angle
s. When the normal vector of a polygon of the 3D model is
, if
, it means that the acoustic beam is not reflected from the polygon. Therefore, we removed those polygons from the given model.
Furthermore, points on the polygon could be generated when the polygon meets the scan line whose elevation angle is from the FSS. Hidden surfaces blocked by other surfaces could be removed by inspecting the collision between polygons and scan lines as the FSS moves.
We could construct a training dataset through the following process. When a 3D computer-aided design (CAD) model of an object is given, we first set a scan direction and remove the hidden surfaces. Then, points are randomly sampled from the remaining surface. By adding the estimated front slope through shear transformation, we could synthesize realistic training point clouds of target objects. In this way, because it is unnecessary to conduct actual underwater experiments to obtain the training data, the training process of the proposed object classifier becomes straightforward.