Special Issue "Multimodal Deep Learning Methods for Video Analytics"

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 30 April 2019

Special Issue Editor

Guest Editor
Dr. Seungmin Rho

Department of Media Software, Sungkyul University, Anyang 430-742, Korea
Website | E-Mail
Interests: artificial intelligence; deep learning; multimedia retrieval and recommendation; software engineering; VR/AR/MR applications

Special Issue Information

Dear Colleagues,

The presence of video capturing devices is ubiquitous in the current era. The nature and range of video data now virtually covers all aspects of our daily lives. The wide variety of videos captured by video capturing devices include edited videos (movies, serials, etc.) at one end and a huge amount of unedited content (consumer videos, ego-centric videos, etc.) on the other end. Because of this ubiquitous presence of video capturing devices, the videos now contain rich information and knowledge which can be extracted and analyzed for a variety of applications. Video analytics is a broad field which encompasses the design and development of the systems having the capability to automatically analyze the videos for the detection of spatial and temporal events of interest.

In the last few years, deep learning algorithms have shown tremendous performance in many research areas especially computer vision and natural language processing (NLP). The deep learning-based algorithms have attained such remarkable performance in tasks like image recognition, speech recognition and NLP which was beyond expectation a decade ago. In multimodal deep learning, the data is obtained from different sources and then used to learn features over multiple modalities. This helps in generation of a shared representation between different modalities. It is expected that the usage of multiple modalities results in superior performance. An example of usage of multiple modalities in video analytics is the usage of audio, visual and (possibly) textual data for the sake of analysis.

The objectives of this Special Issue are to gather work done in video analytics using multimodal deep learning-based methods and to introduce work done on large scale new real-world applications of video analytics.

We solicit original research and survey papers addressing the topics including (but not limited to):

  • Analysis of first-person/wearable videos using multimodal deep learning techniques,
  • Analysis of web videos, ego-centric videos, surveillance videos, movies or any other type of videos using multimodal deep learning techniques,
  • Data collections, benchmarking, and performance evaluation of deep learning-based video analytics.
  • Multimodal deep convolutional neural network for audio-visual emotion recognition
  • Multimodal deep learning framework with cross weights
  • Multimodal information fusion via deep learning or machine learning methods

The topics in video analytics may include (but are not limited to):

  • Object detection and recognition
  • Action recognition
  • Event detection
  • Video highlights, summary and storyboard generation
  • Segmentation and tracking
  • Authoring and editing of videos
  • Scene understanding
  • People analysis
  • Security issues in surveillance videos

Dr. Seungmin Rho
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1500 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Audio-Visual Emotion Recognition
  • Deep Learning
  • Natural Language Processing
  • Video Analytics

Published Papers (3 papers)

View options order results:
result details:
Displaying articles 1-3
Export citation of selected articles as:

Research

Open AccessArticle Bilinear CNN Model for Fine-Grained Classification Based on Subcategory-Similarity Measurement
Appl. Sci. 2019, 9(2), 301; https://doi.org/10.3390/app9020301
Received: 16 November 2018 / Revised: 10 January 2019 / Accepted: 11 January 2019 / Published: 16 January 2019
PDF Full-text (1141 KB) | HTML Full-text | XML Full-text
Abstract
One of the challenges in fine-grained classification is that subcategories with significant similarity are hard to be distinguished due to the equal treatment of all subcategories in existing algorithms. In order to solve this problem, a fine-grained image classification method by combining a [...] Read more.
One of the challenges in fine-grained classification is that subcategories with significant similarity are hard to be distinguished due to the equal treatment of all subcategories in existing algorithms. In order to solve this problem, a fine-grained image classification method by combining a bilinear convolutional neural network (B-CNN) and the measurement of subcategory similarities is proposed. Firstly, an improved weakly supervised localization method is designed to obtain the bounding box of the main object, which allows the model to eliminate the influence of background noise and obtain more accurate features. Then, sample features in the training set are computed by B-CNN so that the fuzzing similarity matrix for measuring interclass similarities can be obtained. To further improve classification accuracy, the loss function is designed by weighting triplet loss and softmax loss. Extensive experiments implemented on two benchmarks datasets, Stanford Cars-196 and Caltech-UCSD Birds-200-2011 (CUB-200-2011), show that the newly proposed method outperforms in accuracy several state-of-the-art weakly supervised classification models. Full article
(This article belongs to the Special Issue Multimodal Deep Learning Methods for Video Analytics)
Figures

Figure 1

Open AccessArticle Deep Learning Based Computer Generated Face Identification Using Convolutional Neural Network
Appl. Sci. 2018, 8(12), 2610; https://doi.org/10.3390/app8122610
Received: 30 October 2018 / Revised: 30 November 2018 / Accepted: 10 December 2018 / Published: 13 December 2018
Cited by 1 | PDF Full-text (3670 KB) | HTML Full-text | XML Full-text
Abstract
Generative adversarial networks (GANs) describe an emerging generative model which has made impressive progress in the last few years in generating photorealistic facial images. As the result, it has become more and more difficult to differentiate between computer-generated and real face images, even [...] Read more.
Generative adversarial networks (GANs) describe an emerging generative model which has made impressive progress in the last few years in generating photorealistic facial images. As the result, it has become more and more difficult to differentiate between computer-generated and real face images, even with the human’s eyes. If the generated images are used with the intent to mislead and deceive readers, it would probably cause severe ethical, moral, and legal issues. Moreover, it is challenging to collect a dataset for computer-generated face identification that is large enough for research purposes because the number of realistic computer-generated images is still limited and scattered on the internet. Thus, a development of a novel decision support system for analyzing and detecting computer-generated face images generated by the GAN network is crucial. In this paper, we propose a customized convolutional neural network, namely CGFace, which is specifically designed for the computer-generated face detection task by customizing the number of convolutional layers, so it performs well in detecting computer-generated face images. After that, an imbalanced framework (IF-CGFace) is created by altering CGFace’s layer structure to adjust to the imbalanced data issue by extracting features from CGFace layers and use them to train AdaBoost and eXtreme Gradient Boosting (XGB). Next, we explain the process of generating a large computer-generated dataset based on the state-of-the-art PCGAN and BEGAN model. Then, various experiments are carried out to show that the proposed model with augmented input yields the highest accuracy at 98%. Finally, we provided comparative results by applying the proposed CNN architecture on images generated by other GAN researches. Full article
(This article belongs to the Special Issue Multimodal Deep Learning Methods for Video Analytics)
Figures

Graphical abstract

Open AccessArticle Temporal Modeling on Multi-Temporal-Scale Spatiotemporal Atoms for Action Recognition
Appl. Sci. 2018, 8(10), 1835; https://doi.org/10.3390/app8101835
Received: 28 August 2018 / Revised: 25 September 2018 / Accepted: 30 September 2018 / Published: 6 October 2018
PDF Full-text (6127 KB) | HTML Full-text | XML Full-text
Abstract
As an important branch of video analysis, human action recognition has attracted extensive research attention in computer vision and artificial intelligence communities. In this paper, we propose to model the temporal evolution of multi-temporal-scale atoms for action recognition. An action can be considered [...] Read more.
As an important branch of video analysis, human action recognition has attracted extensive research attention in computer vision and artificial intelligence communities. In this paper, we propose to model the temporal evolution of multi-temporal-scale atoms for action recognition. An action can be considered as a temporal sequence of action units. These action units which we referred to as action atoms, can capture the key semantic and characteristic spatiotemporal features of actions in different temporal scales. We first investigate Res3D, a powerful 3D CNN architecture and create the variants of Res3D for different temporal scale. In each temporal scale, we design some practices to transfer the knowledge learned from RGB to optical flow (OF) and build RGB and OF streams to extract deep spatiotemporal information using Res3D. Then we propose an unsupervised method to mine action atoms in the deep spatiotemporal space. Finally, we use long short-term memory (LSTM) to model the temporal evolution of atoms for action recognition. The experimental results show that our proposed multi-temporal-scale spatiotemporal atoms modeling method achieves recognition performance comparable to that of state-of-the-art methods on two challenging action recognition datasets: UCF101 and HMDB51. Full article
(This article belongs to the Special Issue Multimodal Deep Learning Methods for Video Analytics)
Figures

Figure 1

Appl. Sci. EISSN 2076-3417 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top