Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Non-Semantic Multimodal Fusion for Predicting Segment Access Frequency in Lecture Archives

Educ. Sci. 2025, 15(8), 978; https://doi.org/10.3390/educsci15080978

by Ruozhu Sheng^1,2,*

, Jinghong Li^1,2

and Shinobu Hasegawa^3,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Educ. Sci. 2025, 15(8), 978; https://doi.org/10.3390/educsci15080978

Submission received: 24 April 2025 / Revised: 25 July 2025 / Accepted: 26 July 2025 / Published: 30 July 2025

(This article belongs to the Special Issue Artificial Intelligence and Blended Learning: Challenges, Opportunities, and Future Directions)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The methodology is adequate and only by enhancing the sample could this article gain new and further perspectives on the topic. The conclusions are consistent with the evidence and arguments and do address the main question posed because the article delivers the literature review and the data collected and discussed into a coherent analysis. The references are appropriate, related to contemporary research made about the topic and validate the findings.

The main question addressed by the research is a multimodal neural network-based approach to predict segment access frequency (SAF) in lecture archives. The study focuses on real classroom lecture archives, characterized by unedited footage. “Non-Semantic Multimodal Fusion for Predicting Segment Access Frequency in Lecture Archives” has a relevant approach to the field and addresses a gap, namely, how to address unedited recordings that are difficult to navigate and review efficiently. This research adds a model with significant results. The archives demonstrate a useful hybrid educational strategy by supporting a blended learning model in which students attend in-person classes and use online recordings for additional review. The model incorporates audio spectrograms, slide page progression, and multimodal information from teachers' activities. The methodology is adequate, and only by enhancing the sample could this article gain new and further perspectives on the topic. The conclusions are consistent with the evidence and arguments and do address the main question posed because the article delivers the literature review and the data collected and discussed in a coherent analysis. The references are appropriate, related to contemporary research made about the topic and validate the findings.

Considering the relevance of the theme, the originality of the proposal, the theoretical consistency and the methodological clarity, this joint opinion recommends the acceptance of the article, with praise for the quality of the writing, the relevance of the content and its practical and scientific applicability.

Author Response

We are truly grateful for your thorough review and exceptionally positive assessment of our manuscript. We are greatly encouraged by your praise for the article's originality, theoretical consistency, methodological clarity, and overall quality.

We especially appreciate your insightful comment that "only by enhancing the sample could this article gain new and further perspectives on the topic." We completely agree that this is a crucial direction for future work. As you noted, the scope of the dataset is a key limitation of the current study, and we have highlighted this in both the Section 5.3 "Limitations" and Sections 6 "Conclusion" of our paper. Expanding the dataset to include more diverse courses and instructors is a primary goal for our subsequent research.

Thank you once again for your valuable time and encouraging feedback.

Reviewer 2 Report

Comments and Suggestions for Authors

Dear authors, congratulations on shedding light on a less approached topic, which may have great impact on future education, in an increasingly digitalised society.

Experiments are always valuable because they test practically what should be done to improve educational support for students. It is a well-designed study, with multiple stages and analysis.

The results support the conclusions offered for new approaches in education.

Author Response

We sincerely thank you for your time and your very positive and encouraging feedback on our work.

We are delighted that you recognized the novelty of our research topic and its potential impact on future education. We are also very grateful for your commendation of our work as a "well-designed study" and for your affirmation that our results support the conclusions.

Your encouraging words and positive assessment are a great motivation for us. Thank you once again.

Reviewer 3 Report

Comments and Suggestions for Authors

Clarity and Scope: "Does the title accurately and clearly reflect the scope and focus of the review article? Does it immediately tell you what the review is about and if it aligns with your interests?" The title highlights the Non-Semantic Multimodal Fusion approach for for Predicting Segment Access Frequency in Lecture Archives.
Impact and Novelty: "Does the title suggest any novel contribution or specific angle the review is taking? Does it promise to synthesize existing knowledge in a new or particularly useful way, or is it a very broad overview?" Yes but there is ambiguity on what is the main novelty and contribution; the non-semantic extraction? The model that follows? The segment access frequency accuracy or validity?

Summary Accuracy: "Does the abstract effectively and accurately summarize the key objectives, methods, main findings, and conclusions of the review article? Does it feel like a true miniature version of the whole review? There is ambiguity on the benchmark data used to confirm the segmentation frequency accuracy and validity.
Information Completeness: "Does the abstract provide enough information to decide if the full review is worth reading for your specific purpose? Are there any crucial details missing that would help you assess its relevance upfront?" Somewhat.

Key Takeaways and Implications: "Based only on the abstract, what are the most important takeaways from this review? What are the potential implications or significance of these findings as presented in the abstract?" Not well contextualized into qualitative findings and implications. Rather, emphasis is placed on numerical reporting of the results

Context and Rationale: "Does the introduction clearly establish the background context and significance of the topic being reviewed? Is the rationale for conducting this review well-justified and persuasive?" yes
Research Question/Objective Clarity: "Is the research question or objective of the review explicitly stated and easy to understand? Yes but may be entangled to one another and the study may lead to overgeneralizations with the research questions.
Synthesis and Quality Assessment Transparency: "Is it clear how the authors synthesized the data from included studies (qualitative or quantitative methods)? Did they assess the quality or risk of bias of the included studies? Is the methodology section transparent enough to allow for replication and assessment of rigor?" somewhat
Evidence Strength and Consistency: "Based on the summary of findings, how strong and consistent is the evidence presented? Are the findings supported by multiple studies, or are there inconsistencies and conflicting results highlighted?" Somewhat.
Critical Analysis and Interpretation (within Results): "Within the results section, do the authors go beyond simply summarizing findings? Do they offer any initial critical analysis or interpretation of the synthesized evidence, or is this primarily left for the discussion? Limitations are well elaborated. But the connection of the methods and results to real-world application is not well justified and explained.

Author Response

Comments 1: "Does the title accurately and clearly reflect the scope and focus of the review article? Does it immediately tell you what the review is about and if it aligns with your interests?" The title highlights the Non-Semantic Multimodal Fusion approach for Predicting Segment Access Frequency in Lecture Archives.

Response 1: Thank you for your comment. We are pleased that the reviewer found our title to be clear and accurate.

Comments 2: "Does the title suggest any novel contribution or specific angle the review is taking? Does it promise to synthesize existing knowledge in a new or particularly useful way, or is it a very broad overview?" Yes but there is ambiguity on what is the main novelty and contribution, the non-semantic extraction? The model that follows? The segment access frequency accuracy or validity?

Response 2: Thank you for pointing out this ambiguity. To clarify, the primary novelty of our work is indeed the non-semantic extraction methodology. This is because most of previous studies in this field have used natural language processing approaches based on the content of the lecturer's speech. This is a powerful approach, but it is difficult to apply when the speech quality is poor or multilingual support is required. Therefore, we have revised the abstract to make this focus explicit from the very first sentence. For example, the abstract now begins with: "This study proposes a non-semantic multimodal approach to predict segment access frequency (SAF) in lecture archives." We believe this change now clearly establishes the main contribution of our paper.

Comments 3: "Does the abstract effectively and accurately summarize the key objectives, methods, main findings, and conclusions of the review article? Does it feel like a true miniature version of the whole review?" There is ambiguity on the benchmark data used to confirm the segmentation frequency accuracy and validity.

Response 3: Thank you for highlighting the ambiguity concerning the benchmark data in the abstract. To address this, we have revised the abstract to include a more direct description of our evaluation dataset. The following sentence has been added: "The model was evaluated on 665 labeled one-minute segments from one such course." We believe this clarification enhances the completeness and accuracy of the abstract.

Comments 4: "Does the abstract provide enough information to decide if the full review is worth reading for your specific purpose? Are there any crucial details missing that would help you assess its relevance upfront?" Somewhat.

Response 4: We have updated the summary in accordance with the response to Comment 3.

Comments 5: "Based only on the abstract, what are the most important takeaways from this review? What are the potential implications or significance of these findings as presented in the abstract?" Not well contextualized into qualitative findings and implications. Rather, emphasis is placed on numerical reporting of the results

Response 5: We agree that the abstract focused too much on numerical results, without sufficiently contextualizing the qualitative implications. To rectify this, we have added the following concluding sentence: "These results demonstrate our system's capacity to enhance lecture archives by automatically identifying key segments, which aids students in efficient, targeted review and provides instructors with valuable data for pedagogical feedback."

Comments 6: "Does the introduction clearly establish the background context and significance of the topic being reviewed? Is the rationale for conducting this review well-justified and persuasive?" Yes

Response 6: Thank you. We are pleased the reviewer found the background and rationale for our study to be well-justified and persuasive.

Comments 7: "Is the research question or objective of the review explicitly stated and easy to understand?" Yes but may be entangled to one another and the study may lead to overgeneralizations with the research questions.

Response 7: Thank you for your critical feedback. We agree they needed more clarity. To resolve the issues of entanglement and potential overgeneralization, we have restructured them into a main question and two sub-questions, as follows:

Main RQ: How effectively can Segment Access Frequency be predicted in lecture archives using exclusively non-semantic features?
Sub RQ1: What is the relative contribution of each modality of non-semantic (action, voice, and slide) to the overall prediction performance?
Sub RQ2: Which combination of multimodal fusion strategy and backbone architecture provides the best trade-off between prediction accuracy and computational cost?

We believe this hierarchical structure provides a much clearer logical roadmap and grounds our conclusions more effectively.

Comments 8: "Is it clear how the authors synthesized the data from included studies (qualitative or quantitative methods)? Did they assess the quality or risk of bias of the included studies? Is the methodology section transparent enough to allow for replication and assessment of rigor?" Somewhat.

Response 8: Thank you for your assessment. We understand that the rigor and generalizability of our methodology are inherently linked to the scope of our dataset, which is currently based on a single course by a single instructor. We have explicitly acknowledged this as a key limitation of our study in Section 5.3 “Limitations”. In this section, we also state that expanding the dataset to include more diverse educational contexts is a primary goal of our future work, which we believe is the most direct way to enhance the methodological rigor of our approach.

Comments 9: "Based on the summary of findings, how strong and consistent is the evidence presented? Are the findings supported by multiple studies, or are there inconsistencies and conflicting results highlighted?" Somewhat.

Response 9: Thank you for your feedback. We acknowledge that our results show some performance variance across the different validation folds, which we have transparently presented in Table 6. We attribute this to the inherent diversity within our real-world lecture dataset. Improving the consistency of our findings by expanding the dataset is a primary goal for our future work, as stated in Section 5.3 “ Limitation”.

Comments 10: "Within the results section, do the authors go beyond simply summarizing findings? Do they offer any initial critical analysis or interpretation of the synthesized evidence, or is this primarily left for the discussion?" Limitations are well elaborated. But the connection of the methods and results to real-world application is not well justified and explained.

Response 10: Thank you for this invaluable feedback, we agree that the connection between our results and their real-world applications was not sufficiently justified in the previous draft.

To thoroughly address this gap, we have introduced a new dedicated subsection, "5.2 Practical Implications and Potential Applications". In this new section, we use a concrete case study from our results (the model's prediction for Lesson 7) to illustrate the tangible value of our framework from two key perspectives:

For students: We demonstrate how our system can automatically generate a highly compressed summary (e.g., reducing a 95-minute lecture to a 22-minute highlight reel) to facilitate efficient and targeted review.
For instructors: We explain how the SAF heatmap serves as a pedagogical diagnostic tool. Crucially, we also elaborate on how the model's predictive capability can address the "cold-start" problem, enabling proactive teaching adjustments for new lectures.

We believe this new section, directly inspired by your insightful comment, now clearly bridges the gap between our methods and their practical utility. We are grateful for your feedback, which has led to a significant improvement in our manuscript.

Reviewer 4 Report

Comments and Suggestions for Authors

The paper focuses on an important aspect of education – recordings of lectures. Authors outline related challenges and background, including studies in the domain presented in literature. They propose their own approach and solution for identifying high-SAF segments in real educational settings The focus is on students’ actual viewing behaviors.

The methodology and research are presented in a detailed, well-ordered way. The description of the proposed prediction framework is followed by description of experiments carried out to evaluation the effectiveness of it.

In the Discussion chapter the authors refer to all questions indicated for the study.

Authors in a logical and convincing way discuss applicability of the developed framework. The limitations of the study are also relevantly presented.

Conclusions deliver relevant summing up of the presented research as well as indicate the possible future work.

My personal impression is that the first-person narration does not contribute to the paper comprehension.

Detailed remarks:

LINES 38, 39 The statements “recordings of face-to-face lectures without editing, have emerged as a prevalent choice for many institutions” should be supported with literature reference. Terms “prevailing choice” and “many institutions” surely should be based on some study. It is also not clear whether the authors mean “worldwide” (which surely is too wide) or e.g. at some part of world..

LINE 45, REF [4]. Reference dated 2014 seems to be insufficient to illustrate the statement. During past ten years presence of videos in everyday life has changed, increased in particular (people prefer “to watch” than “to read”). It might also affect comprehension of educational videos (assimilation of the conveyed information), including recorded lectures.

LINES 69-71 The sentence “These recordings are typically unedited, collected without auxiliary hardware such as eye-tracking devices, and involve only a small number of learners” is confusing. In the earlier sentence it is indicated that the study focuses on recording of lecturers. While the next sentence gives impression that the recordings regarded the learners.

LINE 119. To revise, regarding reference [8]. I have access only to the abstract, and it seems that the paper does not concern “video summarization” while reference [8] does.

LINE 126. To revise, regarding reference [11]. I have access only to the abstract, which is insufficient to assess relevance of the paper. Authors are asked to revise whether the paper does illustrate/support the statement “deep learning models have gained prominence in video summarization,”

LINE 127. To consider. Should that be “Singh and Kumar”?

LINE 178. To consider: adding a short information about what study presented in [20] revealed.

LINE 183. Should it be “Chen and Wu”?

Chapter 3. First-person narration is not the best choice. The whole description should present the study as “process/actions that can be repeated”, not as “author’s study/research diary”

LINES 259-261 – Repeated information about the content of Table 1.

It is not explained why there is focus on Lesson 1 and Lesson 4 (and no mention about other lessons) – see FIG 1, FIG 2, and further description.

Author Response

Thank you for your valuable and constructive feedback on our manuscript. We have found your comments to be incredibly helpful and have revised the paper thoroughly according to your suggestions.

To facilitate the review process, we have used the “changes” package in LaTeX to highlight all modifications. All added text is shown in blue in the PDF of the revised manuscript.

Below, we address each of your comments in detail.

Comments 1: LINES 38, 39 The statements “recordings of face-to-face lectures without editing, have emerged as a prevalent choice for many institutions” should be supported with literature reference. Terms “prevailing choice” and “many institutions” surely should be based on some study. It is also not clear whether the authors mean “worldwide” (which surely is too wide) or e.g. at some part of world..

Response 1: We thank the reviewer for this valuable suggestion. We agree that our original statement was too broad and required evidentiary support. To address this, we have substantially revised the paragraph (lines 39-44). The new version replaces the general claim with a more precise statement, clarifying that lecture capture is a common practice in higher education in countries such as the United Kingdom and the United States. This assertion is now substantiated with three relevant citations [3-5].

Comments 2: LINE 45, REF [4]. Reference dated 2014 seems to be insufficient to illustrate the statement. During past ten years presence of videos in everyday life has changed, increased in particular (people prefer “to watch” than “to read”). It might also affect comprehension of educational videos (assimilation of the conveyed information), including recorded lectures.

Response 2: We thank the reviewer for this insightful comment. We agree that our argument is strengthened by more contemporary evidence. To address this, we have revised the text (lines 51-53) to present a more evolved perspective, supported by recent literature [8, 9]. The revised sentence now reads:

"For instance, students can find it difficult to maintain attention throughout extended viewing periods [7,8], and contemporary research indicates that the key to enhancing learning with long-form material is not merely shortening it, but providing a meaningful, navigable structure [9]."

This refined argument provides stronger and more timely support for our study's premise.

Comments 3: LINES 69-71 The sentence “These recordings are typically unedited, collected without auxiliary hardware such as eye-tracking devices, and involve only a small number of learners” is confusing. In the earlier sentence it is indicated that the study focuses on recording of lecturers. While the next sentence gives impression that the recordings regarded the learners.

Response 3: We thank the reviewer for pointing out this confusing sentence. We agree that the original phrasing was ambiguous. To resolve this, we have rewritten the paragraph (lines 77-83) to clearly separate the description of the lecture recording itself from the context of its use by learners. The revised text now clarifies that the "small number of learners" and "no eye-tracking" refer to the audience and data collection environment, not the initial recording subject. The revised sentences read:

"These settings are often resource-constrained, which defines the core challenges we address. Specifically, the recordings themselves are typically unedited, capturing the instructor with a fixed, ceiling-mounted camera and microphone, which can result in audio quality that is too noisy or indistinct for reliable automatic transcription. Furthermore, the context of their use is also constrained: the archives serve a small number of learners from a single course, and the viewing data from these learners is collected without any auxiliary hardware such as eye-tracking devices."

We believe this revision resolves the ambiguity.

Comments 4: LINE 119. To revise, regarding reference [8]. I have access only to the abstract, and it seems that the paper does not concern “video summarization” while reference [8] does.

Response 4 : The reviewer rightly pointed out that this paper's placement created ambiguity. Our original intention was to use this reference to support the underlying "need" for summarization, as it notes that students often find lecture videos too lengthy. We have now moved this citation to a more appropriate context in the Introduction, where it directly supports this specific point (line 52).

Comments 5: LINE 126. To revise, regarding reference [11]. I have access only to the abstract, which is insufficient to assess relevance of the paper. Authors are asked to revise whether the paper does illustrate/support the statement “deep learning models have gained prominence in video summarization,”

Response 5: The reviewer was also correct in questioning this reference. Upon re-evaluation, we determined it was not an accurate source for our claim. We have therefore removed both the specific clause and the incorrect citation to ensure the manuscript's accuracy.

Comments 6: LINE 127. To consider. Should that be “Singh and Kumar”?

Comments 7: LINE 183. Should it be “Chen and Wu”?

Response 6&7: Thank you for pointing out these formatting inconsistencies. We have reviewed the journal's APA style guidelines and have corrected the in-text citations. In addition to correcting the two instances you pointed out (now on lines 136 and 194), we have also conducted a full review of our bibliography to ensure all other two-author references are now cited correctly according to the journal's guidelines.

Comments 8: LINE 178. To consider: adding a short information about what study presented in [20] revealed.

Response 8: Thank you for this valuable suggestion. As requested, we have revised the text (lines 187-189) to include the key findings from the Deng and Gao (2024) study. The added text reads:

"Interestingly, their study revealed no discernible effect on learning performance but did find that the embedded questions significantly reduced students' total viewing time."

This finding strongly supports our paragraph's subsequent point and makes the argument more coherent.

Comments 9: Chapter 3. First-person narration is not the best choice. The whole description should present the study as “process/actions that can be repeated”, not as “author’s study/research diary”

Response 9: Thank you for this advice on narration style. We agree that a third-person, objective perspective is more appropriate. We have revised the entire manuscript, particularly Section 3 (Methodology), to remove first-person narration and adopt a more formal style.

Comments 10: LINES 259-261 – Repeated information about the content of Table 1.

Response 10: Thank you for pointing out this redundancy. We have removed the repetitive sentence to improve the manuscript's conciseness.

Comments 11: It is not explained why there is focus on Lesson 1 and Lesson 4 (and no mention about other lessons) – see FIG 1, FIG 2, and further description.

Response 11: Thank you for pointing out the missing justification. We agree that this explanation is necessary. We have now revised the text (lines 301-302) to explicitly state our rationale. The added sentence reads:

"To illustrate how SAF patterns are aligned with different lecture structures, the distributions for Lesson 1 and Lesson 4 are presented as contrasting examples."

We wish to express our deep gratitude once more. Your sharp insights and detailed suggestions have been transformative for this manuscript. We truly believe that, thanks to your input, the paper is now substantially stronger and makes a much clearer contribution.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you for your work. Please find suggestions below:

Clarity and Scope: The research question and comparison is not clearly scoped. Is non-semantic being compared against benchmarks? What are all available?

Impact and Novelty: Not apparent.

Summary Accuracy: The contextual factor may play a key role and the relevance of the method under important contextual factors (e.g., medical case study) is not well understood.

The study is lacking a clear research question, comparative variables, understanding of what is common to date, a comparative study and methodological framework that stays consistent throughout the article.

Author Response

Thank you for your additional feedback. In the previous round, we made substantial revisions based on your comments, including restructuring our research questions and adding a new section on practical implications.

To address your new comments as clearly as possible, we will outline our core research design below.

Comment 1: Clarity and Scope: The research question and comparison is not clearly scoped. Is non-semantic being compared against benchmarks? What are all available?

Response 1: Thank you. Our study aims to answer one central question:

How effectively can we predict Segment Access Frequency in lecture archives using only non-semantic features?

This question is specifically about a common but challenging real-world scenario. Our approach is defined by the practical constraints of a Typical University Setting. Such environments often lack specialized hardware (like eye-trackers), have a small number of student viewers for any given course, and, critically, lecture recordings frequently suffer from poor audio quality.

For instance, in our specific case, the audio was captured by a single ceiling-mounted microphone. This common setup resulted in distant, noisy sound that is unsuitable for reliable transcription, making it impossible to subsequently implement semantic analysis. These combined realities are precisely why our research focuses on a non-semantic approach. A comparison with methods that rely on semantic analysis is therefore not feasible for our dataset, as our goal was to create a method that works exactly when those other methods cannot.

Comment 2: Impact and Novelty: Not apparent.

Response 2: Thank you. Our work is designed for institutions that lack what we call "resource-intensive systems." By this, we specifically mean systems that require:

Studio-quality audio for accurate semantic analysis.
Specialized hardware like eye-trackers.
Large, manually-labeled datasets.
Powerful computers for training large AI models.

The novelty of our work is providing a lightweight and practical solution for the many institutions that do not have these resources. We detail the practical value of this approach in the "Practical Implications and Potential Applications" section of our paper.

Comment 3: The study is lacking a clear research question, comparative variables, understanding of what is common to date, a comparative study and methodological framework that stays consistent throughout the article.

Response 3: Thank you. We believe our study follows a clear and consistent framework. Here is its logical structure:

First, our research is guided by these questions:

Main RQ: How effectively can we predict SAF using only non-semantic features?

Sub-RQ1: Which non-semantic modality (action, voice, or slide) is most important?

Sub-RQ2: What is the best combination of fusion strategy and model architecture for this task?

Second, to answer these questions, our experiments (Section 4) directly compared several key variables:

To answer Sub-RQ1, we compared different combinations of features (our Ablation Study, Table 4).
To answer Sub-RQ2, we compared different Fusion Strategies (Table 3) and different Model Architectures (Table 5).

We hope this shows the clear and direct link between our research questions, the variables we tested, and our experimental framework.

We thank you again for your time and feedback.

Article Menu

Non-Semantic Multimodal Fusion for Predicting Segment Access Frequency in Lecture Archives

Further Information

Guidelines

MDPI Initiatives

Follow MDPI