Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Comparative Study of Movie Shot Classification Based on Semantic Segmentation

Appl. Sci. 2020, 10(10), 3390; https://doi.org/10.3390/app10103390

by Hui-Yong Bak¹ and Seung-Bo Park^2,*

Reviewer 1:

Jaime Duque Domingo

Reviewer 2: Anonymous

Appl. Sci. 2020, 10(10), 3390; https://doi.org/10.3390/app10103390

Submission received: 22 April 2020 / Revised: 11 May 2020 / Accepted: 12 May 2020 / Published: 14 May 2020

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

This work presents a system to classify frames of a movie in 3 classes: close up/medium/log shots. The system uses a first component to segment the people using a Yolocat or Mask R-CNN networs, and a second CNN based on a ResNet-50 network. Some questions could improve the work:

The output of the Yolocat/Mask R-CNN is directly injected as input of the ResNet? Or you filter the mask or make a thresholding? It would be interesting to have a complete list of Figures of the whole process.
How long have you used to train the ResNet-50 and how you have split the general dataset into training/validation/test? Some frames could be very different from the three datasets.
Did you train a Yolact or you used a pretrained model?
Does the system work in real-time? Two CNN such as Yolocat, which uses a ResNet, and another ResNet-50 can require big HW resources to run in real time.
I think the state-of-art about CNN Technology Used for Shot Type Classification could be improved. There is a section in the article but there are not enough related references.

Author Response

Reviewer 1

Thank you for reviewing our paper. The answers are as follows.

1. The output of the Yolact/Mask R-CNN is directly injected as input of the ResNet? Or you filter the mask or make a thresholding? It would be interesting to have a complete list of Figures of the whole process.

==> The output image of Yolact and Mask R-CNN is resized to 224 * 224, and then, inputed to ResNet. This is depicted in Section 3.1 by Figure 8, and some descriptions is added like following.“After changing the pre-processed frames to 224x224, close-up shots, medium shots, and long shots were classified using ResNet-50 for CNN-based Classifier in Figure 8.”

2. How long have you used to train the ResNet-50 and how you have split the general dataset into training/validation/test? Some frames could be very different from the three datasets.

- How long have you used to train the ResNet-50

==> The ResNet's training time was added at the end of Section 3.2 like following “Training time of ResNet-50 and Yolact with the highest accuracy was 1,654s, and this model took 66s to assort 600 test frames. So, it took approximately 0.11s per frame.”

- how you have split the general dataset into training/validation/test? Some frames could be very different from the three datasets.

==> How had split the general dataset is are described in Section 3.2 and some descriptions are inserted in section 3.2 like following “The ratios of training data and validation data were set to 82% and 15% for shot, and 85% and 15% for medium shot to get enough training data.”

We have extracted randomly 11,066 frames from 89 movies. First, 10,466 frames have been extracted for training and validation from 75 moves, and then, 600 frames have been chosen for testing another 14 movies. This was described in beginning of section 3.2, and we inserted a word “randomly” to clear that our data selection represents the whole. However, classification errors were occurred in the cases with characteristics mixed two shot types as shown at figure 11. We will study how to improve this error. We described these in section 3.3.

3. Did you train a Yolact or you used a pretrained model?

==> Thank you for pointing it out. We used a pre-trained model of Yolact. This has been added to section 3.1 like following “Mask R-CNN and Yolact for Semantic Segmentation have used pre-trained model.”

4. Does the system work in real-time? Two CNN such as Yolocat, which uses a ResNet, and another ResNet-50 can require big HW resources to run in real time.

==> This model takes about approximately 0.11s to classify each frame. Therefore, the model can not be operated in real time to process 30 frames. This was added at the end of section 3.2 like following “this model took 66s to assort 600 test frames. So, it took approximately 0.11s per frame. Therefore, this model is difficult to apply in real time to process 30 frames.”

5. I think the state-of-art about CNN Technology Used for Shot Type Classification could be improved. There is a section in the article but there are not enough related references.

==> The latest studies, references 25 and 26, were added in section 2.2. They have some advances, however, their approaches can be applied to specific domains such as sport movies and music concerts. On the other hand, the purpose of our approach is to use to classify shots of general movies. And thus, we compared to VGG-16 and ResNet-50.

These were inserted in section 2.2 like following.

“Some approaches are studied for shot type classification based on CNN[25, 26]. They accomplished some advances, however, their approaches can be applied to specific domains such as sport movies and music concerts. On the other hand, the purpose of our approach is to use to classify shots of general movies. And thus, we will compare to VGG-16 and ResNet-50 to apply to movies.”

Reviewer 2 Report

This paper describes the novel method to classify shot types as a pre-task of movie analysis. The authors hire pre-processing with semantic segmentation and ResNet-50 and Yolant to improve the average accuracy of classification.

The results are quite interesting and the paper is worth publishing for researchers who work on similar researches.

For the readers who are not familiar with this research field, it would be better to add some more description of the methods they used. For example, Yolant is one of the key technology they introduced. However, the paper does not explain what is Yolant. It should be explained.

Author Response

1. For the readers who are not familiar with this research field, it would be better to add some more description of the methods they used. For example, Yolant is one of the key technology they introduced. However, the paper does not explain what is Yolant. It should be explained.

==> Thank you for reviewing our paper.

Mask R-CNN and Yolact are representative techniques of semantic segmentation, which are used to distinguish boundary among each objects. They are used a fully convolutional network(FCN). Unlike Mask R-CNN, Yolact uses a full range of image space without compressing the image, resulting in better semantic segmentation performance than Mask R-CNN.

The descriptions were already written in section 2.2.3, and we added a sentence “Unlike Mask R-CNN, Yolact uses a full range of image space without compressing the image, resulting in better semantic segmentation performance than Mask R-CNN”

Article Menu

Comparative Study of Movie Shot Classification Based on Semantic Segmentation

Further Information

Guidelines

MDPI Initiatives

Follow MDPI