# An Efficient Three-Dimensional Convolutional Neural Network for Inferring Physical Interaction Force from Video

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Works

## 3. Proposed 3D CNN-Based Method

#### 3.1. Two-Dimensional Depthwise Separable Convolution (DSC) Neural Network

#### 3.2. Three-Dimensional Convolutional Neural Network

#### 3.3. Proposed 3D Depthwise Separable Convolution Neural Network

## 4. Database and Evaluation Protocol for Interaction Force Estimation

#### 4.1. Automatic Dataset Collection System

#### 4.2. Evaluation Protocol

#### 4.3. Implementation Details

## 5. Experimental Results and Discussion

#### 5.1. Overall Performances

#### 5.2. Investigation of Different Material-Based Objects

#### 5.2.1. Sponge

#### 5.2.2. Paper Cup

#### 5.2.3. Stapler

#### 5.2.4. Tube

#### 5.3. Discussion

## 6. Conclusions and Future Works

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Grosu, V.; Grosu, S.; Vanderborght, B.; Lefeber, D.; Rodriguez-Guerrero, C. Multi-axis force sensor for human–robot interaction sensing in a rehabilitation robotic device. Sensors
**2017**, 17, 1294. [Google Scholar] [CrossRef] [PubMed] - Cirillo, A.; Ficuciello, F.; Natale, C.; Pirozzi, S.; Villani, L. A conformable force/tactile skin for physical human-robot interaction. IEEE Robot. Autom. Lett.
**2016**, 1, 41–48. [Google Scholar] [CrossRef] - Landi, C.T.; Ferraguti, F.; Sabattini, L.; Secchi, C.; Fantuzzi, C. Admittance control parameter adaptation for physical human-robot interaction. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, Singapore, 29 May–3 June 2017. [Google Scholar]
- Lim, S.; Lee, H.; Park, J. Role of combined tactile and kinesthetic feedback in minimally invasive surgery. Int. J. Med Robot. Comput. Assist. Surg.
**2015**, 11, 360–374. [Google Scholar] [CrossRef] [PubMed] - Liu, Y.; Han, H.; Liu, T.; Yi, J.; Li, Q.; Inoue, Y. A novel tactile sensor with electromagnetic induction and its application on stick-slip interaction detection. Sensors
**2016**, 16, 430. [Google Scholar] [CrossRef] [PubMed] - Zhang, H.; Wu, R.; Li, C.; Zang, X.; Zhang, X.; Jin, H.; Zhao, J. A force-sensing system on legs for biomimetic hexapod robots interacting with unstructured terrain. Sensors
**2017**, 17, 1514. [Google Scholar] [CrossRef] [PubMed] - Lazebnik, S.; Schmid, C.; Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 7–22 June 2006. [Google Scholar]
- Krizhevsk, A.; Sulskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Neural Inf. Process. Syst.
**2012**, 25, 1097–1105. [Google Scholar] [CrossRef] - Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell.
**2017**, 39, 2625–2634. [Google Scholar] [CrossRef] - Ernst, M.O.; Banks, M.S. Humans integrate visual and haptic information in a statistically optimal fashion. Nature
**2002**, 415, 429. [Google Scholar] [CrossRef] - Newell, F.N.; Ernst, M.O.; Tjan, B.S.; Bülthoff, H.H. Viewpoint dependence in visual and haptic object recognition. Psychol. Sci.
**2001**, 12, 37–42. [Google Scholar] [CrossRef] - Tiest, W.M.B.; Kappers, A.M. Physical aspects of softness perception. In Multisensory Softness; Luca, M.D., Ed.; Springer: London, UK, 2014; pp. 3–15. [Google Scholar]
- Hwang, W.; Lim, S.C. Inferring interaction force from visual information without using physical force sensors. Sensors
**2017**, 17, 2455. [Google Scholar] [CrossRef] - Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Recognit. Mach. Intell.
**2103**, 35, 221–231. [Google Scholar] [CrossRef] [PubMed] - Hyeon, C.; Dongyi, K.; Junho, P.; Kyungshik, R.; Wonjun, H. 2D barcode detection using images for drone-assisted inventory management. In Proceedings of the 15th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), Honolulu, HI, USA, 26–30 June 2108; pp. 467–471. [Google Scholar]
- Mattioli, T.; Vendittelli, M. Interaction force reconstruction for humanoid robots. IEEE Robot. Autom. Lett.
**2017**, 2, 282–289. [Google Scholar] [CrossRef] - Li, Y.; Hannaford, B. Gaussian process regression for sensorless grip force estimation of cable-driven elongated surgical instruments. IEEE Robot. Autom. Lett.
**2017**, 2, 1312–1319. [Google Scholar] [CrossRef] [PubMed] - Aviles, A.I.; Alsaleh, S.M.; Hahn, J.K.; Casals, A. Towards retrieving force feedback in robotic-assisted surgery: A supervised neuro-recurrent-vision approach. IEEE Trans. Haptics
**2016**, 10, 431–443. [Google Scholar] [CrossRef] - Zhu, Y.; Jiang, C.; Zhao, Y.; Terzopoulos, D.; Zhu, S.C. Inferring forces and learning human utilities from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3823–3833. [Google Scholar]
- Pham, T.H.; Kheddar, A.; Qammaz, A.; Argyros, A.A. Towards force sensing from vision: Observing hand-object interactions to infer manipulation forces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2810–2819. [Google Scholar]
- Dong-Han, L.; Wonjun, H.; Soo-Chul, L. Interaction force estimation using camera and electrical current without force/torque sensor. IEEE Sens. J.
**2018**, 18, 8863–8872. [Google Scholar] - Fermüller, C.; Wang, F.; Yang, Y.; Zampogiannis, K.; Zhang, Y.; Barranco, F.; Pfeiffer, M. Prediction of manipulation actions. Int. J. Comput. Vis.
**2018**, 126, 358–374. [Google Scholar] [CrossRef] - Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv
**2017**, arXiv:1704.04861. [Google Scholar] - Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Landola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. Squeezenet: Alexnet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv
**2016**, arXiv:1602.07360. [Google Scholar] - Chollet, F. Xception: Deep learning with depthwise separable convolutions. arXiv
**2017**, 1610–2357. [Google Scholar] - Ye, R.; Liu, F.; Zhang, L. 3D depthwise convolution: Reducing model parameters in 3D vision tasks. arXiv
**2018**, arXiv:1808.01556. [Google Scholar] - Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2015; Volume 9, pp. 249–256. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2017**, arXiv:1412.6980. [Google Scholar] - Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]

**Figure 1.**(

**a**) Heterogenous network structure in the Convolutional Neural Network (CNN) + Long Short-Term Memory (LSTM) method and (

**b**) homogeneous network structure in 3D CNN method. Both methods are proposed for predicting the interaction force from the input video.

**Figure 3.**Depthwise separable convolution concept consists of the depthwise convolution (e.g., $k=3$) and the pointwise convolution.

**Figure 5.**Concepts of (

**a**) 3D CNN and (

**b**) proposed 3D depthwise separable CNN. The rectangles and circles represent the video frames and their corresponding channels, respectively. For example, (

**a**) is a video with three frames as the input with each frame consisting of three channels. The output of 3D CNN is an output with two channels because the 3D CNN has two filters. (

**b**) In the proposed method, three channels of a frame are independently applied with three corresponding 2D convolution filters, then the final output with two channels is made by two 3D PWC filters.

**Figure 6.**Proposed bottleneck 3D modules based on (

**a**) inverted residual block and (

**b**) linear block (DwConv: depthwise convolutional filter).

**Figure 7.**Interaction force measurement system with load cell and translation stage with electric motor.

**Figure 8.**Collected sample images. From the top to bottom rows, paper cup, stapler, tube, and sponge are shown; angle variations are from the left to right columns. The first and third-row images were captured with 350-lux illumination, whereas the second and fourth-row images were taken with 550 and 750-lux illuminations, respectively.

**Figure 9.**Cropped and normalized sample images. (

**a**) Sponge, (

**b**) paper cup, (

**c**) stapler, and (

**d**) tube.

**Figure 11.**Overall performances of the 2D CNN-based ResNet with 18 layers, the CNN + LSTM, the traditional 3D CNN-based methods (i.e., 18 layer-based VGG style method and 19 layer-based ResNet style method), and the proposed method.

**Figure 14.**Inferred interaction forces of the sponge with the GT. The blue and red dotted lines represent the results of the proposed method and the GT measured by the load cell, respectively. Maximum external forces are randomly generated.

**Figure 16.**Inferred interaction forces for the paper cup with the GT. The blue and red dotted lines represent the results of the proposed method and the GT measured by the load cell, respectively. Note that the durations of the maximum force applied were randomly generated.

**Figure 18.**Inferred interaction forces of the stapler with the GT. The blue and red dotted lines represent the results of the proposed method and the GT measured by the load cell, respectively.

**Figure 20.**Inferred interaction forces of the tube with the GT. The blue and red dotted lines represent the results of the proposed method and the GT measured by the load cell, respectively.

**Figure 21.**The proposed method achieved the best accuracy in inferring the interaction force only from the video; its weight parameter size was also the smallest among the previous works.

**Figure 22.**Average RMSE of the proposed method are shown according to the different moving directions of the steel rod. Average RMSE is calculated by averaging the RMSE values for each angle of the four objects.

**Figure 23.**Average RMSE of the proposed method are shown according to the different illumination conditions. Average RMSE is calculated by averaging the RMSE values for each illumination condition of the four objects.

**Figure 24.**The attention maps of the proposed method are shown according to the different materials.

3D CNN | Proposed Method | |
---|---|---|

No. of Parameters | $n\times c\times d\times \left(k\times k+1\right)$ | $n\times \left(c\times d+1\right)+c\times \left(k\times k+1\right)$ |

$\mathbf{Input}\text{}\mathbf{Size}\text{}\mathit{d},\mathit{h},\mathit{w},\mathit{c}$ | Layer (Type) | Expand Channels | Output Channels | Spatial Stride | Kernel Depth | Depth Stride |
---|---|---|---|---|---|---|

<20, 128, 128, 3> | Conv2D 3 × 3 | - | 32 | 1 | 1 | 1 |

<20, 64, 64, 32> | Bottleneck 3D 3 × 3 (a) | 32 | 16 | 1 | 1 | 1 |

<20, 64, 64, 16> | Bottleneck 3D 3 × 3 (a) | 64 | 24 | 1 | 1 | 1 |

<20, 32, 32, 24> | Bottleneck 3D 3 × 3 (a) | 96 | 32 | 1 | 1 | 1 |

<20, 16, 16, 32> | Bottleneck 3D 3 × 3 (b) | 128 | 64 | 2 | 3 | 2 |

<10, 8, 8, 64> | Bottleneck 3D 3 × 3 (b) | 192 | 92 | 2 | 3 | 2 |

<5, 8, 8, 92> | Bottleneck 3D 3 × 3 (b) | 384 | 128 | 2 | 3 | 2 |

<3, 4, 4, 128> | Bottleneck 3D 3 × 3 (b) | 448 | 192 | 2 | 3 | 2 |

<2, 4, 4, 192> | Conv2D 1 × 1 | - | 1280 | 2 | 2 | 2 |

<1, 4, 4, 1280> | Avg. Pool. 4 × 4 | - | - | 1 | 1 | - |

<1280> | FC 1 | - | 1 | - | - | - |

**Table 3.**Details of training and test protocols. One set consists of four touches with approximately 520 sequential images. The number in parentheses is the total number of sequential images. The total number of images is 387,473.

Object | Training Set | Test Set |
---|---|---|

Sponge | 144 sets (77,097) | 36 sets (19,474) |

Paper cup | 144 sets (76,966) | 36 sets (19,133) |

Stapler | 144 sets (77,941) | 36 sets (19,533) |

Tube | 144 sets (77,849) | 36 sets (19,480) |

Method | Average RMSE | |
---|---|---|

2D CNN-based methods | 2D CNN-based ResNet [25] | 0.1844 |

CNN + LSTM | 0.0925 | |

CNN + LSTM + EI [21] | 0.0882 | |

3D CNN-based methods | VGG-like 3D CNN | 0.0422 |

ResNet-like 3D CNN | 0.0757 | |

Proposed Method | 0.0405 |

Method | Average RMSE | |
---|---|---|

2D CNN-based methods | 2D CNN-based ResNet [25] | 0.2049 |

CNN + LSTM | 0.1007 | |

CNN + LSTM + EI [21] | - | |

3D CNN-based methods | VGG-like 3D CNN | 0.0580 |

ResNet-like 3D CNN | 0.0737 | |

Proposed Method | 0.0560 |

Method | Average RMSE | |
---|---|---|

2D CNN-based methods | 2D CNN-based ResNet [25] | 0.7526 |

CNN + LSTM | 0.3385 | |

CNN + LSTM + EI [21] | 0.2702 | |

3D CNN-based methods | VGG-like 3D CNN | 0.1554 |

ResNet-like 3D CNN | 0.1933 | |

Proposed Method | 0.1602 |

**Table 7.**Quantitative comparison of performances of various methods with the tube as the target object.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kim, D.; Cho, H.; Shin, H.; Lim, S.-C.; Hwang, W.
An Efficient Three-Dimensional Convolutional Neural Network for Inferring Physical Interaction Force from Video. *Sensors* **2019**, *19*, 3579.
https://doi.org/10.3390/s19163579

**AMA Style**

Kim D, Cho H, Shin H, Lim S-C, Hwang W.
An Efficient Three-Dimensional Convolutional Neural Network for Inferring Physical Interaction Force from Video. *Sensors*. 2019; 19(16):3579.
https://doi.org/10.3390/s19163579

**Chicago/Turabian Style**

Kim, Dongyi, Hyeon Cho, Hochul Shin, Soo-Chul Lim, and Wonjun Hwang.
2019. "An Efficient Three-Dimensional Convolutional Neural Network for Inferring Physical Interaction Force from Video" *Sensors* 19, no. 16: 3579.
https://doi.org/10.3390/s19163579