Next Article in Journal
Learning Prosody in a Video Game-Based Learning Approach
Next Article in Special Issue
Graph-Based Prediction of Meeting Participation
Previous Article in Journal
Socrative in Higher Education: Game vs. Other Uses
Previous Article in Special Issue
Observing Collaboration in Small-Group Interaction
Open AccessArticle
Peer-Review Record

Exploring Methods for Predicting Important Utterances Contributing to Meeting Summarization

Multimodal Technologies Interact. 2019, 3(3), 50;
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Multimodal Technologies Interact. 2019, 3(3), 50;
Received: 28 May 2019 / Revised: 3 July 2019 / Accepted: 4 July 2019 / Published: 6 July 2019
(This article belongs to the Special Issue Multimodal Conversational Interaction and Interfaces)

Round 1

Reviewer 1 Report

This paper investigates the problem of extractive meeting summarization. Especially, the authors propose the prediction model for important utterances by exploiting multimodal and multiparty feature. They first carefully conduct a corpus collection experiment to obtain a dataset that is suitable for the problem. And then they propose handcrafted feature models as well as a deep-learning based model. The experiments show that the deep-learning based model outperforms the handcrafted model. Finally, they implement a useful meeting browser to conduct a user study and show that it contributes to understanding the concent of the discussion.


This paper is very detailed and covers a lot, including the process of the data collection, model description, experiments, and even an application demo.

The writing is ok and the details of technical settings are elaborate.


The deep-learning model is somehow too naive, especially the design of verbal model in section 4.2.3. Typically, recurrent neural networks and its variations have been proved efficient for modeling of utterance sequence. However, the authors use the network similar to nonverbal model.

Relate works are not sufficient. Besides of the extractive approaches of text, speech and visual information summarization, abstractive approaches about the meeting summarization should be included (i.e. Murray(et. al) Abstractive meeting summarization as a markov decision process, Singla(et. al) Automatic community creation for abstractive spoken conversations summarization, Zhao(et. al) Abstractive Meeting Summarization via Hierarchical Adaptive Segmental Network Learning)

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This work is on automatic meeting summarization using multimodal features that capture different aspects of the interaction. One of the novel aspects of the paper is in fusing the multimodal features for summarization purposes. Another focus of the paper is on comparing models using hand-crafted features with deep learning models that take relatively low-level inputs. Finally, a user-study is carried out to validate the usefulness of the generated summaries. A novel group interaction dataset was gathered to support this research.

Overall, the paper is interesting, well-written, and represents a very substantial amount of work.

For the first research question, on the usefulness of multimodal features for summarization, and for the third research question, on the utility of the summaries for aiding users with a meeting browser, I think the conclusions from the research are fairly clear: multimodal features are useful, and the meeting summaries are useful for the designated tasks.

For the second question, on whether the deep learning approach outperforms the hand-crafted approach, there is a crucial piece missing which makes the comparison very uncertain. Specifically, the deep learning model combines verbal and nonverbal features, while the hand-crafted features (SP/OT, PR, CO) are all nonverbal. This does not allow us to make a direct head-to-head comparison, and it is a very odd omission from the experiments. It is even more surprising, given that there are then baseline systems that use verbal features (HC_V, BOW, V_ALL). And, indeed, we see from Table 9 that these baseline systems are competitive with the hand-crafted features models.

More specifically, why is there not a V_ALL + NV_ALL hand-crafted model that combines all of the verbal and nonverbal features into one model? Until we see this, we cannot make a fair comparison between the hand-crafted and deep learning models.

As a baseline, you should also implement a system that greedily selects the longest utterances in each meeting. This can work surprisingly well as a baseline for meeting summarization.

For Tables 9, 10, and 11, are there statistically significant differences?

Regarding lines 781-798, can you provide any similar insight and analysis into cases where the verbal models classified correctly and the nonverbal models did not?

Table 15 is confusing, and it’s not immediately clear what it is showing. Partly this is due to the term “Naive,” which I think is meant to refer to the model previously described as “Simple.” You need to be more clear in this section that you are using the “Simple” model as gold-reference annotations -- initially it sounds like this is meant to be a third experimental condition.

In the related work, you mention two previous corpora (AMI and ELEA), but many others could be added as well, e.g. the ICSI Meeting Corpus, D64 corpus, Team Entrainment Corpus, MULTISIMO corpus, GAP corpus, etc. Not necessarily all of them, but more than just the two.

The Related Work is also lacking in coverage of previous user studies that have been done on meeting summarization, e.g. “Generating and Validating Abstracts of Meeting Conversations: a User Study” by Murray et al. There is also not much discussion of the move towards more abstractive-style summarization in the field over the past few years.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

All the concerns have been addressed and explained clearly by authors.

Reviewer 2 Report

The paper is much improved in its revised form. The new structure will be much clearer for readers. The related work section is also much more comprehensive. The authors have provided sufficient responses to the initial reviews. 

Back to TopTop