# AR3D: Attention Residual 3D Network for Human Action Recognition

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Works

#### 2.1. Human Action Recognition

#### 2.2. Attention Mechanism

#### 2.3. Residual Learning

## 3. Materials and Methods

#### 3.1. Data and Pre-Processing

#### 3.2. 3D Deeper Residual Network (R3D)

#### 3.2.1. Three-Dimensional SFE-Module

#### 3.2.2. Three-Dimensional Residual Module

- (1)
- Calculate the mean value of each input batch of data. Assuming that the batch input data are $x\in \left(\right)open="\{"\; close="\}">{x}_{1},{x}_{2},\cdots ,{x}_{n}$ and the obtained mean value is $\mu $, the mean value can be calculated by$$\mu =\frac{1}{n}\sum _{i=1}^{n}{x}_{i}$$
- (2)
- Solve the variance ${\sigma}^{2}$ of each input batch of data:$${\sigma}^{2}=\frac{1}{n}\sum _{i=1}^{n}{\left(\right)}^{{x}_{i}}2$$
- (3)
- Use the mean $\mu $ and the variance ${\sigma}^{2}$ obtained in step (1) and (2) to normalize the data to obtain its corresponding 0–1 distribution:$$\hat{{x}_{i}}=\frac{{x}_{i}-\mu}{\sqrt{{\sigma}^{2}+\epsilon}}$$
- (4)
- Perform scale transformation and translation operations on the normalized sample $\hat{{x}_{i}}\in \left(\right)open="\{"\; close="\}">\hat{{x}_{1}},\hat{{x}_{2}},\hat{{x}_{n}}$ obtained in step (3), and finally obtain the output of the normalized layer:$${y}_{i}=\gamma \hat{{x}_{i}}+\beta $$

#### 3.3. Three-Dimensional Attention Mechanism

#### 3.4. Attention Residual 3D Network (AR3D)

#### 3.4.1. AR3D_V1

#### 3.4.2. AR3D_V2

## 4. Experiment

#### 4.1. Experimental Training Process

#### 4.2. Experimental Comparison Model

## 5. Results

#### 5.1. Performance Comparison

#### 5.2. Efficiency Comparison

#### 5.3. Comparison of Three Models Proposed in this Article

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; pp. 568–576. [Google Scholar]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Automatic Human Action Recognition. U.S. Patent 8,345,984, 1 January 2013. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Las Condes, Chile, 11–18 December 2015; pp. 4489–4497. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 20–36. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1933–1941. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Wildes, R.P. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4768–4777. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell.
**2018**, 41, 2740–2755. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Tran, D.; Ray, J.; Shou, Z.; Chang, S.F.; Paluri, M. Convnet architecture search for spatiotemporal feature learning. arXiv
**2017**, arXiv:1708.05038. [Google Scholar] - Ullah, A.; Ahmad, J.; Muhammad, K.; Sajjad, M.; Baik, S.W. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access
**2017**, 6, 1155–1166. [Google Scholar] [CrossRef] - Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; So Kweon, I. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Roy, A.G.; Navab, N.; Wachinger, C. Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Granada, Spain, 16–20 September 2018; Springer: Cham, Switzerland, 2018; pp. 421–429. [Google Scholar]
- Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst.
**2014**, 27, 2204–2212. [Google Scholar] - Ba, J.; Mnih, V.; Kavukcuoglu, K. Multiple object recognition with visual attention. arXiv
**2014**, arXiv:1412.7755. [Google Scholar] - Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; pp. 2017–2025. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
- Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, IEEE, Tokyo, Japan, 25–27 May 2011; pp. 2556–2563. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv
**2012**, arXiv:1212.0402. [Google Scholar] - Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
- Li, Z.; Gavrilyuk, K.; Gavves, E.; Jain, M.; Snoek, C.G. Videolstm convolves, attends and flows for action recognition. Comput. Vis. Image Underst.
**2018**, 166, 41–50. [Google Scholar] [CrossRef] [Green Version] - Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Zhou, Y.; Sun, X.; Zha, Z.J.; Zeng, W. Mict: Mixed 3d/2d convolutional tube for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 449–458. [Google Scholar]
- Cai, J.; Hu, J. 3d rans: 3d residual attention networks for action recognition. Vis. Comput.
**2020**, 36, 1261–1270. [Google Scholar] [CrossRef]

**Figure 6.**Connections between consecutive multi-frames. From the figure, we find that the current action i is not only affected by its own state, but may also be affected by its predecessor action $i-1$ and subsequent action $i+1$. Therefore, if the connection between the frames of the image sequence can be captured, it is possible to improve the accuracy of human action recognition.

Block Number | Convolutional Layer Number | Convolutional Kernel Size | Convolutional Kernel Number | Pooling Layer Number | Pooling Kernel Size |
---|---|---|---|---|---|

Block_1 | 1 | $3\times 3\times 3$ | 64 | 1 | $1\times 2\times 2$ |

Block_2 | 1 | 128 | $2\times 2\times 2$ | ||

Block_3 | 2 | 256 | $2\times 2\times 2$ | ||

Block_4 | 2 | 512 | $2\times 2\times 2$ |

Number | Convolutional Kernel Size | Feature Dimension | Activation Function |
---|---|---|---|

Conv_1 | $1\times 1\times 1$ | 128 | ReLU |

Conv_2 | $1\times 3\times 3$ | 128 | ReLU |

Conv_3 | $3\times 1\times 1$ | 128 | ReLU |

Conv_4 | $1\times 1\times 1$ | 512 | NAN |

Model Name | Dimension | Proposal Time | Category |
---|---|---|---|

3D-ConvNet [2] | 3D | 2013 | 3D CNN |

IDT [21] | 2D | 2013 | 2D CNN |

C3D [3] | 3D | 2015 | 3D CNN |

Two-Stream [1] | 2D | 2015 | 2D CNN |

VideoLSTM [22] | 2D | 2016 | LSTM + Attention |

DB-LSTM [9] | 2D | 2017 | LSTM |

Res3D [8] | 3D | 2017 | 3D ResNet |

P3D-A [18] | 3D | 2017 | 3D ResNet |

I3D [23] | 3D | 2018 | 3D CNN |

MiCT-Net [24] | 2D, 3D | 2018 | 2D CNN + 3D CNN |

3D RAN(ResNet-18) [25] | 3D | 2019 | 3D ResNet + Attention |

Method | Pretraining | UCF101 | HMDB51 | ||
---|---|---|---|---|---|

Baseline | |||||

3D-ConvNet [2] | – | 51.6 | 24.3 | ||

C3D (1 net) [3] | – | 82.3 | 40.4 | ||

C3D (3 nets) [3] | – | 85.2 | 46.2 | ||

Others | |||||

IDT [21] | – | 86.4 | 61.7 | ||

Two-Stream [1] | ImageNet | 88 | 59.4 | ||

VideoLSTM [22] | – | 79.6 | 43.3 | ||

DB-LSTM [9] | ImageNet | 91.21 | 87.64 | ||

Res3D [8] | – | 85.8 | 54.9 | ||

P3D-A [18] | ImageNet | 83.7 | – | ||

MiCT-Net [24] | ImageNet | 84.3 | 48.1 | ||

3D RAN (ResNet-18) [25] | – | 47.6 | 21.3 | ||

I3D (RGB) [23] | – | 84.5 | 49.8 | ||

I3D (RGB) [23] | ImageNet+Kinetics | 95.4 | 74.5 | ||

I3D (Flow) [23] | – | 90.6 | 61.9 | ||

I3D (Flow) [23] | ImageNet+Kinetics | 95.4 | 74.6 | ||

I3D (Two-stream) [23] | – | 93.4 | 66.4 | ||

I3D (Two-stream) [23] | ImageNet+Kinetics | 97.9 | 80.2 | ||

Ours | |||||

R3D | – | 87.89 | +2.69 | 50.27 | +4.07 |

AR3D_V1 | – | 88.39 | +3.19 | 51.53 | +5.33 |

AR3D_V2 | – | 89.28 | +4.08 | 52.51 | +6.31 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dong, M.; Fang, Z.; Li, Y.; Bi, S.; Chen, J.
AR3D: Attention Residual 3D Network for Human Action Recognition. *Sensors* **2021**, *21*, 1656.
https://doi.org/10.3390/s21051656

**AMA Style**

Dong M, Fang Z, Li Y, Bi S, Chen J.
AR3D: Attention Residual 3D Network for Human Action Recognition. *Sensors*. 2021; 21(5):1656.
https://doi.org/10.3390/s21051656

**Chicago/Turabian Style**

Dong, Min, Zhenglin Fang, Yongfa Li, Sheng Bi, and Jiangcheng Chen.
2021. "AR3D: Attention Residual 3D Network for Human Action Recognition" *Sensors* 21, no. 5: 1656.
https://doi.org/10.3390/s21051656