Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping

Electronics 2021, 10(21), 2654; https://doi.org/10.3390/electronics10212654

by Jiu Lou

, Decheng Zuo^*, Zhan Zhang and Hongwei Liu

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Electronics 2021, 10(21), 2654; https://doi.org/10.3390/electronics10212654

Submission received: 27 August 2021 / Revised: 24 October 2021 / Accepted: 26 October 2021 / Published: 29 October 2021

(This article belongs to the Special Issue Advanced Deep Learning and Neural Network Technologies for Image Recognition)

Round 1

Reviewer 1 Report

The paper presents a multimodal audio-visual method for the detection of violent content in video streams tested and validated on the MediaEval 2015 Affective Impact of Movies task. The presented method uses deep neural network-based audio-visual feature extractors, that are encoded via an AE architecture in a shared semantic subspace and results are obtained by using a set of final fully connected layers. While the results on the MediaEval dataset surpass the top results registered during the MediaEval competition and the approach presented in the paper is interesting and promising, my opinion is that a set of changes and updates must be made to the paper and to the experiments in order for it to be considered for publication:

1. A number of additional studies and upgrades must be performed in order to better understand the performance gains brought by different components presented in the paper, namely:

While AlexNet one of the first convolutional architectures that gained traction and attention, it is an old architecture and the integration of newer architectures may provide better results. Nonetheless, an ablation study that tests more types of architectures for both audio and visual feature extraction would represent an interesting addition to the paper
CONCAT is used as feature fusion, however, it would be interesting to see how other methods of feature fusion perform
The same observation applies to the use of AutoEncoder architecture

2. The analysis of the current state-of-the-art with regards to violence prediction and related works can be greatly improved, as in its current state it takes up a small portion of the paper and it gives few details regarding current methods.

3. Some references are missing, for example:

There is no reference for AlexNet, AutoEncoders, LSTMs, Mel cepstrum features and LPC features
A reference can be given for the advantage of using waveform envelope map over spectrogram. It is mentioned that this is determined "through experiments" however no examples of experiments are given
The use of frame differences instead of frames for processing video sequences is widely used, however some references to examples of this being used in the literature must be given.

4. Some text from the paper is not needed in my opinion, as it describes theoretical aspects from other papers that can be summarized and referenced, thus creating space for more interesting original content and experiments.

In Page 4, Equations 1-6 serve to describe the ConvLSTM architecture, and the text that follows these equations (rows 139-146) further go into detail regarding the description of ConvLSTM. Considering that these equations are not original and the ConvLSTM paper is adequately referenced, perhaps this section can be shortened.
In Section 4.1.2 the paper goes into details with regards to the evaluation methodology used in the MediaEval dataset, including the equations for the provided metrics. I believe that a simple listing of the provided metrics would be sufficient in this case, considering that the MediaEval dataset paper is referenced.

5. On the other hand, few details are given regarding data enhancement process presented in section 4.1.1, or to the additional frame-level annotations presented in the same section, carried out in order to ensure semantic consistency. More details would shed some light with regards to how these processes were performed and their overall contribution to the results.

6. Tables 1 and 4 are unclear - some horizontal borders are missing and it is quite hard to understand which layers belong to which module and which results belong to which feature type.

7. It is unclear how the setup was created for separating the auditory and the visual features for the results presented in the paper.

8. Figure 4 can be redesigned as it is too large at this moment. For example, the audio and visual feature extractors can be represented by single blocks, considering that they were already presented in previous Figures.

9. Two task-specific difficulties are presented in the introduction, however they are only briefly addressed in the rest of the paper. I was expecting a more in-depth analysis of these two problems, perhaps with some positive and negative examples of how these problems are alleviated.

10. There is a large number of English presentation problems. I would strongly suggest having the paper corrected by a native English speaker or a professional translator before resubmitting. Some of examples would include:

Capital letter after ";" - page 1, row 13
Capital letter after "," - page 1, row 40
"Garcia-Gomez that recognized ..." - page 1, row 43
"... which many progresses ..." - page 2 row 52
"... fusion for the violent identification ..." - page 2, row 55
"... focuses on the above two problems ..." - page 2, row 63
etc

Author Response

A number of additional studies and upgrades must be performed in order to better understand the performance gains brought by different components presented in the paper, namely:

While AlexNet one of the first convolutional architectures that gained traction and attention, it is an old architecture and the integration of newer architectures may provide better results. Nonetheless, an ablation study that tests more types of architectures for both audio and visual feature extraction would represent an interesting addition to the paperCONCAT is used as feature fusion, however, it would be interesting to see how other methods of feature fusion perform. The same observation applies to the use of AutoEncoder architecture.

Relay：First of all, thank you very much for your suggestions. We quite agree with you. Every detail in the model should be explored to improve the results. However, due to the limitation of space and time, we can't do too much. Therefore, we add some new the analysis of feature combination and the comparison of different fusion methods, which are added from 363 to 394 lines. Please understand that the detailed analysis of the main part of the feature extraction model will be the focus of our future work, as mentioned in 428 lines. Thank you again for your suggestions and comments!

The analysis of the current state-of-the-art with regards to violence prediction and related works can be greatly improved, as in its current state it takes up a small portion of the paper and it gives few details regarding current methods.

Relay：According to your comments, we’re sorted the introduction, added highly cited articles, and summarized and commented on the current violence detection algorithms. See lines 28 to 108 in the article for details.

Some references are missing, for example:

There is no reference for AlexNet, AutoEncoders, LSTMs, Mel cepstrum features and LPC features

A reference can be given for the advantage of using waveform envelope map over spectrogram. It is mentioned that this is determined "through experiments" however no examples of experiments are given

The use of frame differences instead of frames for processing video sequences is widely used, however some references to examples of this being used in the literature must be given.

Relay：Thank you very much for your suggestions. According to your comments, we add the reference at the place where needed, such as :

In line 40, Mel-frequency cepstral and LPC(linear predictive coding) used to identify violence [8].

In line 46: (LSTM) [12]

In line 112: autoencoder[27]

In line 129: AlexNet[29]

In line 192: inter-frame differences in a video can theoretically extract the required information more accurately than the video frame itself[7]

For the advantage of using waveform envelope map over spectrogram, we give our experiments results from 174 to 186 line in paper.

Some text from the paper is not needed in my opinion, as it describes theoretical aspects from other papers that can be summarized and referenced, thus creating space for more interesting original content and experiments.

In Page 4, Equations 1-6 serve to describe the ConvLSTM architecture, and the text that follows these equations (rows 139-146) further go into detail regarding the description of ConvLSTM. Considering that these equations are not original and the ConvLSTM paper is adequately referenced, perhaps this section can be shortened.

In Section 4.1.2 the paper goes into details with regards to the evaluation methodology used in the MediaEval dataset, including the equations for the provided metrics. I believe that a simple listing of the provided metrics would be sufficient in this case, considering that the MediaEval dataset paper is referenced.

Relay： According to your suggestions, We use references “the ConvLSTM network [29] in line 198” instead of detailed descriptions, and simplify the details of evaluation indicators . New introduction of evaluation indicators are from 359 to 361 lines.

On the other hand, few details are given regarding data enhancement process presented in section 4.1.1, or to the additional frame-level annotations presented in the same section, carried out in order to ensure semantic consistency. More details would shed some light with regards to how these processes were performed and their overall contribution to the results.

Relay： According to your suggestions, we add some details about semantic correspondence from 258 and 281 line, and give semantic results from 377 to 394 lines.

Tables 1 and 4 are unclear - some horizontal borders are missing and it is quite hard to understand which layers belong to which module and which results belong to which feature type.

Relay： According to your suggestions, we add some horizontal borders in these tables.

It is unclear how the setup was created for separating the auditory and the visual features for the results presented in the paper.

Relay： I wonder if our understanding is correct. Do you want to ask us how we separate visual and auditory signals .The auditory and the visual signal separating was made by the MediaEval 2015, when we download. We add this explanation in 330 lines. If we understand wrong, please correct us. Thank you very much.

Figure 4 can be redesigned as it is too large at this moment. For example, the audio and visual feature extractors can be represented by single blocks, considering that they were already presented in previous Figures.

Relay：According to your suggestions, we redesigned the Figure 4 at 298 line.

Two task-specific difficulties are presented in the introduction, however they are only briefly addressed in the rest of the paper. I was expecting a more in-depth analysis of these two problems, perhaps with some positive and negative examples of how these problems are alleviated.

Relay：According to your suggestions, we give some in-depth explanations for the time axis misalignment and the semantic deviation problems. The explanations are from 89 to 108 lines.

There is a large number of English presentation problems. I would strongly suggest having the paper corrected by a native English speaker or a professional translator before resubmitting. Some of examples would include:

Capital letter after ";" - page 1, row 13
Capital letter after "," - page 1, row 40
"Garcia-Gomez that recognized ..." - page 1, row 43
"... which many progresses ..." - page 2 row 52
"... fusion for the violent identification ..." - page 2, row 55
"... focuses on the above two problems ..." - page 2, row 63
etc

Relay：We apologize for the language problems in the original manuscript. We use the English editing service of the website recommended by the journal.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper presents an interesting idea but it should be reorganized in some parts:
- A section dedicated to the state of the art is missing. It is fundamental as it is useful by the reader to focus the problem well.
- Algorithm 1 should be referenced and commented in the text.
- Section 4.1.1 a table should be inserted for a description of the dataset used.
- The experimental phase should be enriched with further tests by adding another dataset.
- In the specific context, the following paper that adopts a similar deep learning approach should be cited

Manzo, M.; Pellino, S. Voting in Transfer Learning System for Ground-Based Cloud Classification. Mach. Learn. Knowl. Extr. 2021, 3, 542-553. https://doi.org/10.3390/make3030028

Author Response

The paper presents an interesting idea but it should be reorganized in some parts:

（1）A section dedicated to the state of the art is missing. It is fundamental as it is useful by the reader to focus the problem well.

Reply：First of all, thank you very much for your suggestions. According to your comments, we’re sorted the introduction, and summarized and commented on the current violence detection algorithms. See lines 28 to 108 in the article for details.

（2）Algorithm 1 should be referenced and commented in the text.

Reply：According your recommend，we comment at 362 lines.

（3）Section 4.1.1 a table should be inserted for a description of the dataset used.

Reply：According your recommend，we add a new table for the description of the dataset at 334 line

（4）The experimental phase should be enriched with further tests by adding another dataset.

Reply：We agree with you very much.However, only vsd2015 contains audio-visual information in the currently published violence data set, so we only verify the fusion effect on this data set. We also ask you for your understanding. However, in order to test the effectiveness of the algorithm proposed in this paper, we have added some experiments, including the comparison between the later fusion feature combination methods and the later fusion methods and visualized the distribution of features after autoencoder, so as to prove the effectiveness of this method. These results are shown from 363 to 394 lines.

（5）In the specific context, the following paper that adopts a similar deep learning approach should be cited

Reply： According your recommend, we add some highly cited articles and new Latest technical articles, including this paper, such as in 44 lines.

Manzo, M.; Pellino, S. Voting in Transfer Learning System for Ground-Based Cloud Classification. Mach. Learn. Knowl. Extr. 2021, 3, 542-553. https://doi.org/10.3390/make3030028

Round 2

Reviewer 1 Report

I would like to thank and congratulate the authors for the impressive number of updates to the paper, realized in a relatively short timespan. Some minor points do still exist, and it is my opinion that after these are solved the paper will be ready for publishing. These points are as follows:

There still are some places where the presentation of the paper must be improved, either because of English presentation or because of page alignment. There are as follows:
1. Row 27 - "...are an important..."
2. Row 39 - "The auditory feature is thus represented..."
3. Row 53 - "In terms of auditory...."
4. Rows 120 - 124 - should be "In section X" instead of "in the section"
5. Row 116 - "semantically corresponding labels..."
6. Row 129 - "on the temporal axis"
7. Row 312 - "autoencoder model is used to calculate ..."
8. Row324 - Algorithm 1 - is missaligned compared with the rest of the page
9. Row 334 - "272 samples of violence ..."
10. Row 339 - "rotation or transition"
11. Row 340 - "the size of the two classes of samples consistent"
12. Row 353 - Table 2 - is missaligned compared with the rest of the page
13. Row 431 - "while only AlexNet is used in our experiments"
A few clarifications regarding some paragraphs:
1. Rows 46-47: it is not clear to me in what way the "information fusion at the decision level is equivalent to the fusion of features (i.e., decision scores),"
2. Rows 142-147: That entire paragraph must be clarified, as it is hard to read and some sentences seem incomplete - "Because ..." has no final statement (a cause without an effect), "due to the complexity of violent sound source ...", "where the occurence ...", "It is incorrect".
Figure 5 is a great addition to the paper, as it shows differences in class separation between audio and visual. The description of the Figure should add some details to allow readers to understand it at a glance. For example: "Visualization of videos in ultimate feature space, presenting visual (left) and auditory (right) features."

Author Response

1.There still are some places where the presentation of the paper must be improved, either because of English presentation or because of page alignment. There are as follows:

Row 27 - "...are an important..."
Row 39 - "The auditory feature is thus represented..."
Row 53 - "In terms of auditory...."
Rows 120 - 124 - should be "In section X" instead of "in the section"
Row 116 - "semantically corresponding labels..."
Row 129 - "on the temporal axis"
Row 312 - "autoencoder model is used to calculate ..."
Row324 - Algorithm 1 - is missaligned compared with the rest of the page
Row 334 - "272 samples of violence ..."
Row 339 - "rotation or transition"
Row 340 - "the size of the two classes of samples consistent"
Row 353 - Table 2 - is missaligned compared with the rest of the page
Row 431 - "while only AlexNet is used in our experiments"

Reply: First of all, thank you very much for your professional advice and careful guidance. According to your comments, we have corrected all the above mistakes and identified them in the article. And re-check the grammar of the paper, hoping to meet the requirements.

2.A few clarifications regarding some paragraphs:

1. Rows 46-47: it is not clear to me in what way the "information fusion at the decision level is equivalent to the fusion of features (i.e., decision scores),"
2. Rows 142-147: That entire paragraph must be clarified, as it is hard to read and some sentences seem incomplete - "Because ..." has no final statement (a cause without an effect), "due to the complexity of violent sound source ...", "where the occurence ...", "It is incorrect".

Reply: We apologize for our unclear expression. According to your comments, we modified the description as follows:

Rows 76-79 : We use “This is mainly because the information fusion at the decision level is equivalent to the fusion of semantically similar features in the same feature space (i.e., decision scores), which has less risk and is relatively easy to achieve.” instead of “information fusion at the decision level is equivalent to the fusion of features (i.e., decision scores)”
Rows 142-150：We rearrange the expression of this paragraph as ” When deep network is used for end-to-end feature learning, a large number of data sets with uniform distribution of the occurrence frequency of classified events are needed [30]. However, the current violent audio data sets are difficult to meet this requirement. This is because violent sounds are mostly sudden, such as gunfire and shouting, and their occurrence time and frequency are not fixed, random and unpredictable. As a re-sult, the distribution of violent audio events and non-violent audio events in data sets is uneven, and the network cannot fully learn the features of violent audio. Therefore, this paper does not use the end-to-end feature extraction model for feature extraction of violent audio.”

3.Figure 5 is a great addition to the paper, as it shows differences in class separation between audio and visual. The description of the Figure should add some details to allow readers to understand it at a glance. For example: "Visualization of videos in ultimate feature space, presenting visual (left) and auditory (right) features."

Reply:In order to clearly describe the visualization results, we add the calculation process description of the visualization features, at rows 387-390, and give a direct description of Figure 5 at rows 391-393, as you suggest.

Article Menu

Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping

Further Information

Guidelines

MDPI Initiatives

Follow MDPI