# AR3D: Attention Residual 3D Network for Human Action Recognition

## Abstract

## 1. Introduction

## 2. Related Works

#### 2.1. Human Action Recognition

#### 2.2. Attention Mechanism

#### 2.3. Residual Learning

## 3. Materials and Methods

#### 3.1. Data and Pre-Processing

#### 3.2. 3D Deeper Residual Network (R3D)

#### 3.2.1. Three-Dimensional SFE-Module

#### 3.2.2. Three-Dimensional Residual Module

- Calculate the mean value of each input batch of data. Assuming that the batch input data are $x\in \left(\right)open="\{"\; close="\}">{x}_{1},{x}_{2},\cdots ,{x}_{n}$ and the obtained mean value is $\mu $, the mean value can be calculated by$$\mu =\frac{1}{n}\sum _{i=1}^{n}{x}_{i}$$
- Solve the variance ${\sigma}^{2}$ of each input batch of data:$${\sigma}^{2}=\frac{1}{n}\sum _{i=1}^{n}{\left(\right)}^{{x}_{i}}2$$
- Use the mean $\mu $ and the variance ${\sigma}^{2}$ obtained in step (1) and (2) to normalize the data to obtain its corresponding 0–1 distribution:$$\hat{{x}_{i}}=\frac{{x}_{i}-\mu}{\sqrt{{\sigma}^{2}+\epsilon}}$$
- Perform scale transformation and translation operations on the normalized sample $\hat{{x}_{i}}\in \left(\right)open="\{"\; close="\}">\hat{{x}_{1}},\hat{{x}_{2}},\hat{{x}_{n}}$ obtained in step (3), and finally obtain the output of the normalized layer:$${y}_{i}=\gamma \hat{{x}_{i}}+\beta $$

#### 3.3. Three-Dimensional Attention Mechanism

#### 3.4. Attention Residual 3D Network (AR3D)

#### 3.4.1. AR3D_V1

#### 3.4.2. AR3D_V2

## 4. Experiment

#### 4.1. Experimental Training Process

#### 4.2. Experimental Comparison Model

## 5. Results

#### 5.1. Performance Comparison

#### 5.2. Efficiency Comparison

#### 5.3. Comparison of Three Models Proposed in this Article

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

**Figure 6.**Connections between consecutive multi-frames. From the figure, we find that the current action i is not only affected by its own state, but may also be affected by its predecessor action $i-1$ and subsequent action $i+1$. Therefore, if the connection between the frames of the image sequence can be captured, it is possible to improve the accuracy of human action recognition.

Block Number | Convolutional Layer Number | Convolutional Kernel Size | Convolutional Kernel Number | Pooling Layer Number | Pooling Kernel Size |
---|---|---|---|---|---|

Block_1 | 1 | $3\times 3\times 3$ | 64 | 1 | $1\times 2\times 2$ |

Block_2 | 1 | 128 | $2\times 2\times 2$ | ||

Block_3 | 2 | 256 | $2\times 2\times 2$ | ||

Block_4 | 2 | 512 | $2\times 2\times 2$ |

Number | Convolutional Kernel Size | Feature Dimension | Activation Function |
---|---|---|---|

Conv_1 | $1\times 1\times 1$ | 128 | ReLU |

Conv_2 | $1\times 3\times 3$ | 128 | ReLU |

Conv_3 | $3\times 1\times 1$ | 128 | ReLU |

Conv_4 | $1\times 1\times 1$ | 512 | NAN |

Model Name | Dimension | Proposal Time | Category |
---|---|---|---|

3D-ConvNet [2] | 3D | 2013 | 3D CNN |

IDT [21] | 2D | 2013 | 2D CNN |

C3D [3] | 3D | 2015 | 3D CNN |

Two-Stream [1] | 2D | 2015 | 2D CNN |

VideoLSTM [22] | 2D | 2016 | LSTM + Attention |

DB-LSTM [9] | 2D | 2017 | LSTM |

Res3D [8] | 3D | 2017 | 3D ResNet |

P3D-A [18] | 3D | 2017 | 3D ResNet |

I3D [23] | 3D | 2018 | 3D CNN |

MiCT-Net [24] | 2D, 3D | 2018 | 2D CNN + 3D CNN |

3D RAN(ResNet-18) [25] | 3D | 2019 | 3D ResNet + Attention |

Method | Pretraining | UCF101 | HMDB51 | ||
---|---|---|---|---|---|

Baseline | |||||

3D-ConvNet [2] | – | 51.6 | 24.3 | ||

C3D (1 net) [3] | – | 82.3 | 40.4 | ||

C3D (3 nets) [3] | – | 85.2 | 46.2 | ||

Others | |||||

IDT [21] | – | 86.4 | 61.7 | ||

Two-Stream [1] | ImageNet | 88 | 59.4 | ||

VideoLSTM [22] | – | 79.6 | 43.3 | ||

DB-LSTM [9] | ImageNet | 91.21 | 87.64 | ||

Res3D [8] | – | 85.8 | 54.9 | ||

P3D-A [18] | ImageNet | 83.7 | – | ||

MiCT-Net [24] | ImageNet | 84.3 | 48.1 | ||

3D RAN (ResNet-18) [25] | – | 47.6 | 21.3 | ||

I3D (RGB) [23] | – | 84.5 | 49.8 | ||

I3D (RGB) [23] | ImageNet+Kinetics | 95.4 | 74.5 | ||

I3D (Flow) [23] | – | 90.6 | 61.9 | ||

I3D (Flow) [23] | ImageNet+Kinetics | 95.4 | 74.6 | ||

I3D (Two-stream) [23] | – | 93.4 | 66.4 | ||

I3D (Two-stream) [23] | ImageNet+Kinetics | 97.9 | 80.2 | ||

Ours | |||||

R3D | – | 87.89 | +2.69 | 50.27 | +4.07 |

AR3D_V1 | – | 88.39 | +3.19 | 51.53 | +5.33 |

AR3D_V2 | – | 89.28 | +4.08 | 52.51 | +6.31 |

