Scene description refers to the automatic generation of natural language descriptions from videos. In general, deep learning-based scene description networks utilize multimodalities, such as image, motion, audio, and label information, to improve the description quality. In particular, image information plays an important role in scene description. However, scene description has a potential issue, because it may handle images with severe compression artifacts. Hence, this paper analyzes the impact of video compression on scene description, and then proposes a simple network that is robust to compression artifacts. In addition, a network cascading more encoding layers for efficient multimodal embedding is also proposed. Experimental results show that the proposed network is more efficient than conventional networks.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited