LLM-Driven Active Learning for Dependency Analysis of Mobile App Requirements Through Contextual Reasoning and Structural Relationships
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript presents a timely and innovative contribution to the field of requirements engineering by introducing an ontology-enhanced AI framework for predicting interdependencies among user-stated requirements in mobile banking applications. One of the paper’s greatest strengths lies in its integration of large language models, active learning, and domain-specific ontologies; a combination that enables both high adaptability and strong predictive performance. The approach is methodologically rigorous, demonstrating impressive validation results (F1-score > 0.95), and addresses a real-world need for automated, scalable analysis of complex and evolving user feedback. By embedding both contextual and structural reasoning, the framework sets a new benchmark for intelligent, user-centered requirement dependency analysis, providing valuable decision-support for software practitioners in a dynamic development landscape.
Review Suggestions
Dataset Transparency, Basic Statistics
The manuscript currently lacks sufficient detail about exactly how the review database was constructed, which sources (webscraping, API, manual collection) the data comes from, how many unique reviews were processed, and how well these represent different applications or periods. Although it can be inferred from the text that the data comes from banking apps and is likely in English or Arabic, these metadata should be explicitly stated. It is essential to provide some basic statistics (e.g., number of reviews, average length, distribution across apps) about the dataset to ensure reproducibility.
Demonstrating Diversity, Human Validation
While the claim that the dataset is diverse seems intuitively reasonable, Section 3.1 provides no concrete evidence, such as whether any human validation was conducted on a small sample to check for data heterogeneity and relevance. I suggest including either a supplementary diversity analysis (e.g., thematic, author, or temporal breakdown), or at a minimum, a brief qualitative validation.
Preprocessing and NLP Pipeline Details
The current description of the “preprocessing pipeline” is too general; for practical reproducibility and research transparency, it is necessary to detail which specific NLP tools, libraries, or language-specific settings were used (e.g., tokenization, special character handling). It is particularly important to specify how many reviews were filtered out as “non-informative,” “duplicate,” or “concise,” indicating how strict the cleaning process was. If the dataset is not public, these details are even more critical to publish.
Wording and Language Quality
There are some odd or misplaced sentences (“For instance, the study evaluated ChatGPT’s ability ...”), presumably left in by oversight. I strongly recommend a careful proofread before finalizing the manuscript to ensure such errors do not detract from the overall impression and international readability.
Explanation of BERT Classification
It is unclear how the BERT model is structured and what output layers are used for classification: is there a separate head for each classification dimension (Intent, Type, Domain), or is a custom multi-head/multi-label solution applied? What outputs does the model produce (softmax/sigmoid probability distributions or just labels)? These are critical implementation details that need to be clarified so the method is testable and replicable by other researchers.
Model Choice and Fine-tuning Questions
Please clarify exactly which BERT variant (base, large, domain-adapted) and tokenizer were used for the classification task. Was domain adaptation (e.g., further training on a banking corpus) performed, or were standard, pre-trained models used exclusively?
Train/Validation Split, Class Balance
It is not stated how the training and validation data were divided, or how class imbalance issues were handled (e.g., upsampling/downsampling, weighted loss function).
Learning Curve, Overfitting Evaluation
How sensitive was model performance to increasing dataset size? Was it visibly plateauing on the learning curve, or were there signs of overfitting? If no such figure was prepared, it would be worthwhile to add and interpret one.
Ontology: Concept and Relation List
Is the conceptual and relational structure of the ontology available in any form? If not, at least an abstract example of the main concept types and relations should be included in the manuscript.
Active Learning: Thresholds and Batch Size
In the active learning cycles, what BERT uncertainty score threshold was used to forward samples to the LLM (GPT-4)? What was the annotation batch size in a cycle? These parameters significantly affect the generalizability of the system.
In summary: The manuscript presents strong conceptual novelty, but the above details (which are also essential for other researchers and practical experts) are indispensable for meaningful scientific usability and practical reusability. I recommend filling these gaps, possibly as a more detailed technical appendix or in an online repository.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper presents a framework that combines large language models, active learning, and ontology-based reasoning for requirement dependency analysis from mobile app reviews. The work is interesting and potentially valuable for the community. However, before it can be considered for publication, several issues need to be addressed:
1) The paper is very dense, with long and sometimes redundant explanations (particularly in Sections 1–3). The narrative could be streamlined to improve readability and focus on the main contributions.
2) While the proposed combination of LLM-driven active learning and ontology-based reasoning is interesting, the novelty compared to prior studies (including those cited in the literature review) is not sufficiently highlighted. A clearer discussion of how this framework advances beyond existing methods is needed.
3) The experiments rely on banking app reviews only. This raises concerns about the generalizability of the findings. It would strengthen the work to test the framework in at least one additional domain.
4) The ontology construction process is only briefly described. More details are needed regarding the methodology, scope, and reproducibility. How much of it was manually designed versus automatically generated?
5) Given the focus on LLM distillation, exposure bias, and active learning, the following recent paper is directly relevant and should be cited and discussed in the related work section:
- A. Pozzi, A. Incremona, D. Tessera, D. Toti, “Mitigating exposure bias in large language model distillation: an imitation learning approach,” Neural Computing and Applications (2025), doi: 10.1007/s00521-025-11162-0.
6) The results are promising, but the comparison with baselines is limited. Stronger baselines (e.g., recent transformer-based dependency prediction models without active learning) should be included to contextualize the reported improvements.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper introduces a framework that integrates Large Language Models (LLMs), active learning, ontologies, and contextual reasoning to automatically identify dependency relationships among requirements extracted from mobile app user reviews. The research addresses a problem of considerable theoretical and practical importance, employing a methodologically sound and advanced approach, supported by compelling experimental results and clear contributions. The manuscript is well-structured and clearly written. However, the following aspects require further elaboration or revision:
-
The paper does not explicitly state the number of original reviews collected, the number retained after preprocessing, or the number of requirements ultimately extracted. It is recommended to include detailed dataset statistics (e.g., total counts, average review length, distribution of requirement types).
-
Although the annotation process involved Amazon Mechanical Turk (MTurk) and three domain experts, the paper omits details on inter-annotator agreement (e.g., measured by Kappa score) and the methodology for resolving annotation conflicts.
-
To more comprehensively evaluate the framework's performance advantages, it is advisable to include comparisons with additional baseline methods, such as traditional machine learning models, other BERT variants, or graph neural networks.
-
While the prompt used for requirement extraction is provided, the rationale behind its specific structure and whether iterative prompt engineering experiments were conducted to optimize it remain unclear.
-
The specific role of LLMs in updating the ontology—such as generating new concepts or proposing new relationships—should be described in greater detail.
-
The potential for ontology and LLM updates to introduce noise or bias should be discussed, along with possible mitigation strategies to address these concerns.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe objective of this study is to develop a comprehensive and scalable solution to the challenge of automating requirement dependency analysis from dynamic user-generated content. The proposed framework provides developers with a richer set of data for decision making, given the explicit consideration of dependencies between requirements, which allows for the creation of more consistent and reliable versions of release maps for practical user-centered software development.
The scientific novelty of the work is determined by the fact that the authors presented the integration of contextual and structural reasoning using ontologies and LLM.
The work has unconditional practical significance, since it focuses on mobile banking applications, which are an important element in the development of the real industry.
The authors demonstrated high accuracy of the model and an innovative method.
The advantage of the work is also a comparative analysis, during which the authors compare with existing methods and show the advantages of the author's model.
The article is distinguished by a well-thought-out and clear structure and a visual presentation of the results.
Overall, the article addresses a highly important topic and shows strong potential for publication.
Comments and Suggestions for the Authors:
- Please explain how the results of the study can be applied in other areas besides mobile banking.
- Please explain the validity of the data. The authors analyze feedback from banking applications in Saudi Arabia, will intercultural communication barriers prevent the method from being applied in other cultural and linguistic environments?
- Please explain whether there is a risk of error accumulation due to uncontrolled generation of LLM.
- Please mention the ethical and practical risks of using LLM in software development.
- Please clarify the running time, resources and scalability.
These remarks are of a specific nature and in no way diminish the merits of the work. They are intended solely to make the presentation of the research results clearer and more accessible to readers.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThank you for addressing my previous comments. I have no further concerns, and I find the revised version suitable for publication.