Author Contributions
Conceptualization, C.V. and S.K.; methodology, C.V.; software, C.V.; validation, C.V., S.K. and M.O.; formal analysis, C.V.; investigation, C.V.; resources, M.O.; data curation, C.V.; writing—original draft preparation, C.V.; writing—review and editing, C.V., S.K. and M.O.; visualization, C.V.; supervision, S.K. and M.O.; project administration, M.O.; funding acquisition, M.O. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Training and classification data flows.
Figure 1.
Training and classification data flows.
Figure 2.
Training dataset augmentation by rotating each object along the Z axis 0°, 45°, 90°, 135°, 180°, 225°, 270°, 315°.
Figure 2.
Training dataset augmentation by rotating each object along the Z axis 0°, 45°, 90°, 135°, 180°, 225°, 270°, 315°.
Figure 3.
Training dataset augmentation by rotating each object along the Z axis 0°, 45°, 90°, 135°, 180°, 225°, 270°, 315° including only the frontal projection of the object with respect to the camera position.
Figure 3.
Training dataset augmentation by rotating each object along the Z axis 0°, 45°, 90°, 135°, 180°, 225°, 270°, 315° including only the frontal projection of the object with respect to the camera position.
Figure 4.
Evaluation setup, RGB image, depth map and object representation in voxels.
Figure 4.
Evaluation setup, RGB image, depth map and object representation in voxels.
Figure 5.
Real dataset augmentation by rotation of each object along the Z axis 0°, 45°, 90°, 135°, 180°, 225°, 270°, 315°.
Figure 5.
Real dataset augmentation by rotation of each object along the Z axis 0°, 45°, 90°, 135°, 180°, 225°, 270°, 315°.
Figure 6.
Pointcloud data alignment by computing the principal component analysis (PCA) of the estimated ground plane (GP).
Figure 6.
Pointcloud data alignment by computing the principal component analysis (PCA) of the estimated ground plane (GP).
Figure 7.
Intel RealSense D435 depth camera.
Figure 7.
Intel RealSense D435 depth camera.
Figure 8.
Experiment 1. Recognition accuracy without including preprocessing using a synthetic dataset.
Figure 8.
Experiment 1. Recognition accuracy without including preprocessing using a synthetic dataset.
Figure 9.
Experiment 2. Recognition accuracy including pose normalization using a synthetic dataset.
Figure 9.
Experiment 2. Recognition accuracy including pose normalization using a synthetic dataset.
Figure 10.
Experiment 3. Average recognition accuracy without including preprocessing using a real dataset for classification.
Figure 10.
Experiment 3. Average recognition accuracy without including preprocessing using a real dataset for classification.
Figure 11.
Experiment 4. Average recognition accuracy including pose normalization using a real dataset for classification.
Figure 11.
Experiment 4. Average recognition accuracy including pose normalization using a real dataset for classification.
Figure 12.
Experiment 5. Average recognition accuracy on a real dataset using a synthetic training dataset including dataset augmentation preprocessing.
Figure 12.
Experiment 5. Average recognition accuracy on a real dataset using a synthetic training dataset including dataset augmentation preprocessing.
Figure 13.
Experiment 6. Average recognition accuracy on a real dataset using a synthetic training dataset including frontal projection and dataset augmentation preprocessing.
Figure 13.
Experiment 6. Average recognition accuracy on a real dataset using a synthetic training dataset including frontal projection and dataset augmentation preprocessing.
Figure 14.
Experiment 7. Average recognition accuracy results including dataset augmentation, frontal projection and a subset of synthetic training objects.
Figure 14.
Experiment 7. Average recognition accuracy results including dataset augmentation, frontal projection and a subset of synthetic training objects.
Figure 15.
Example of objects in class 8 (Monitor) and class 10 (Stool) captured by the Intel RealSense D435.
Figure 15.
Example of objects in class 8 (Monitor) and class 10 (Stool) captured by the Intel RealSense D435.
Table 1.
ModelNet40 object classes selected and number of objects per class for the synthetic and real datasets.
Table 1.
ModelNet40 object classes selected and number of objects per class for the synthetic and real datasets.
Dataset | Bottle | Bowl | Chair | Cup | Keyboard | Lamp | Laptop | Monitor | Plant | Stool | |
---|
Training | 335 | 64 | 889 | 79 | 145 | 124 | 149 | 465 | 240 | 90 | 2580 |
Augmentation. |
Training | 2680 | 512 | 7112 | 632 | 1160 | 992 | 1192 | 3720 | 1920 | 720 | 20,640 |
Sub.&Augmentation. |
Training | 816 | 448 | 1096 | 288 | 560 | 472 | 536 | 712 | 648 | 528 | 6104 |
Synthetic |
Test | 99 | 19 | 99 | 19 | 19 | 19 | 19 | 99 | 99 | 19 | 510 |
Camera |
Test | 64 | 48 | 64 | 64 | 48 | 48 | 48 | 48 | 48 | 48 | 528 |
Table 2.
3D histogram-of-oriented-gradients (3DHOG) and feature dimensionality for , , voxel grids.
Table 2.
3D histogram-of-oriented-gradients (3DHOG) and feature dimensionality for , , voxel grids.
Voxel Grid | | | |
---|
| 1 | 8 | 27 |
| 8 | 8 | 8 |
| 1296 | 10,368 | 34,992 |
Table 3.
Descriptor configuration parameters.
Table 3.
Descriptor configuration parameters.
Parameter | Value |
---|
| 18 |
| 9 |
| 6 |
| 2 |
| 2 |
Table 4.
Support vector machine (SVM) classifier configuration parameters.
Table 4.
Support vector machine (SVM) classifier configuration parameters.
Parameter | SVM Classifier |
---|
Type | Multiclass |
Method | Error correcting codes (ECOC) |
Kernel function | Radial basis function |
Optimization | Iterative single data (ISDA) |
Data division | Holdout partition 15% |
Table 5.
Experiments definition (Exp.), experimental flow and maximum recognition accuracy (Acc) achieved.
Table 5.
Experiments definition (Exp.), experimental flow and maximum recognition accuracy (Acc) achieved.
Exp. | Training Dataset | Test Dataset | Acc. Results |
---|
1 | Synthetic | No preprocessing | Synthetic | No preprocessing | 91.5% |
2 | Synthetic | PCA-STD alignment | Synthetic | PCA-STD align. | 90% |
3 | Synthetic | No preprocessing | Real | No preprocessing | 65% |
4 | Synthetic | PCA-STD alignment | Real | PCA-STD align. | 21% |
5 | Synthetic | Dataset augm. | Real | No preprocessing | 75.7% |
6 | Synthetic | Dataset augm. + Frontal view | Real | No preprocessing | 73.7% |
7 | Synthetic | Dataset augm. + Frontal view + Subset | Real | No preprocessing | 81.5% |
Table 6.
Intel RealSense D435 active stereo-camera specifications.
Table 6.
Intel RealSense D435 active stereo-camera specifications.
Technology | Active IR Stereo |
---|
Sensor Technology | Global shutter, 3 × 3 m |
Depth Field-of-View | 86 × 57 |
Depth Resolution | up to 1280 × 720 pixels |
RGB Resolution | 1920 × 1080 pixels |
Depth Frame Rate | up to 90 fps |
RGB Frame Rate | 30 fps |
Min Depth range | 0.1 m |
Max Depth range | 10 m |
Dimensions | 90, 25, 25 mm |
Table 7.
Normalized confusion matrix using 300 principal components (PCs) and synthetic training dataset and a voxel grid with frontal view and a subset of objects.
Table 7.
Normalized confusion matrix using 300 principal components (PCs) and synthetic training dataset and a voxel grid with frontal view and a subset of objects.
Estimated Class | 1 | 0.98 | 0 | 0 | 0.04 | 0.02 | 0 | 0.02 | 0.02 | 0 | 0 |
2 | 0 | 0.98 | 0.03 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0.86 | 0 | 0 | 0 | 0 | 0 | 0.02 | 0.2 |
4 | 0.01 | 0 | 0 | 0.86 | 0 | 0.14 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0.95 | 0 | 0 | 0 | 0 | 0.02 |
6 | 0 | 0 | 0.01 | 0.04 | 0 | 0.85 | 0.04 | 0.12 | 0 | 0.04 |
7 | 0 | 0.02 | 0 | 0 | 0.03 | 0 | 0.73 | 0.18 | 0 | 0.04 |
8 | 0 | 0 | 0.09 | 0.05 | 0 | 0 | 0.15 | 0.60 | 0 | 0.40 |
9 | 0.01 | 0 | 0.01 | 0.01 | 0 | 0.01 | 0.06 | 0.06 | 0.85 | 0 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | 0.30 |
| | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| | Input Class |
Table 8.
Response times for a voxel grid (1) using the synthetic dataset and 100 PCs, (2) using a real dataset and 300 PCs.
Table 8.
Response times for a voxel grid (1) using the synthetic dataset and 100 PCs, (2) using a real dataset and 300 PCs.
Test Dataset | | | (%)
| | (ms)
| (ms)
| (ms)
| (ms)
|
---|
Synthetic | 34,992 | 100 | 90 | 9 | 11.5 | 6 | 20.5 | 38 |
Real | 34,992 | 300 | 81.5 | — | 12 | 4 | 180 | 196 |
Table 9.
Summary of the 3DHOG descriptor and results comparison with respect to other 3D object recognition approaches.
Table 9.
Summary of the 3DHOG descriptor and results comparison with respect to other 3D object recognition approaches.
Approach | Method | Training Dataset | Test Dataset | 3D Sensor | Accuracy | GPU | |
---|
3DHOG | Global | ModelNet40 | Real Custom | Stereo | 81.5% | No | 196 ms |
3DHOG [32] | Global | ModelNet10 | ModelNet10 | – | 84.91% | No | 21.6 ms |
VoxNet [11] | CNN | ModelNet10 | ModelNet10 | – | 92% | Yes | 3 ms |
PointNet [10] | CNN | ModelNet10 | ModelNet10 | – | 77.6% | Yes | 24.6 ms |
3DYolo [12] | CNN | KITTY | KITTY | Lidar | 75.7% | Yes | 100 ms |
SPNet [13] | CNN | ModelNet10 | ModelNet10 | – | 97.25% | Yes | – |
VFH [23] | Global | Real Custom | Real Custom | Stereo | 98.52% | – | – |
SI [22,23] | Global | Real Custom | Real Custom | Stereo | 75.3% | – | – |
VFD [6] | Global | ModelNet10 | ModelNet10 | – | 92.84% | No | – |
RCS [17] | Local | UWA | UWA | – | 97.3% | No | 10–40 s |
TriLCI [18] | Local | BL | BL | – | 97.2% | No | 10 s |