LLM-Driven Active Learning for Dependency Analysis of Mobile App Requirements Through Contextual Reasoning and Structural Relationships

Almoqren, Nuha; Alrashoud, Mubarak

doi:10.3390/app15189891

Open AccessArticle

LLM-Driven Active Learning for Dependency Analysis of Mobile App Requirements Through Contextual Reasoning and Structural Relationships

by

Nuha Almoqren

^* and

Mubarak Alrashoud

Software Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 9891; https://doi.org/10.3390/app15189891

Submission received: 10 August 2025 / Revised: 3 September 2025 / Accepted: 6 September 2025 / Published: 9 September 2025

(This article belongs to the Special Issue Knowledge Graphs and Semantic Understanding in Natural Language Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In today’s fast-paced release cycles, mobile app user reviews offer a valuable source for tracking the evolution of user needs. At the core of these needs lies a structure of interdependencies—some enhancements are only relevant in specific usage contexts, while others may conflict when implemented together. Identifying these relationships is essential for anticipating feature interactions, resolving contradictions, and enabling context-aware, user-driven planning. The present work introduces an ontology-enhanced AI framework for predicting whether the requirements mentioned in reviews are interdependent. The core component is a Bidirectional Encoder Representations from Transformers (BERT) classifier retrained within a large-language-model-driven active learning loop that focuses on instances with uncertainty. The framework integrates contextual and structural reasoning; contextual analysis captures the semantic intent and functional role of each requirement, enriching the understanding of user expectations. Structural reasoning relies on a domain-specific ontology that serves as both a knowledge base and an inference layer, guiding the grouping of requirements. The model achieved strong performance on annotated banking app reviews, with a validation F1-score of 0.9565 and an area under the ROC curve (AUC) exceeding 0.97. The study results contribute to supporting developers in prioritizing features based on dependencies and delivering more coherent, conflict-free releases.

Keywords:

requirement dependency; software development; large language models; ontology; context-aware; predictive analytics; data visualization; artificial intelligence

1. Introduction

Requirements Engineering (RE) represents a critical phase in the software development lifecycle, as it systematically captures, analyzes, and validates user needs to guide the specification of system functionalities. Its role extends beyond technical translation, functioning as a strategic process that ensures alignment between developed software and the evolving expectations of users and business stakeholders [1,2]. The precision in eliciting and prioritizing requirements directly influences the relevance, quality, and usability of the final product [3,4]. Empirical studies underscore that effective requirements engineering contributes to optimizing development resources, reducing project uncertainty, and enhancing stakeholder satisfaction [1], thereby establishing RE as a key determinant of project success in contemporary software practice. In light of its foundational significance, recent research has increasingly explored user-generated content as a rich and scalable source for identifying real-world requirements. Mobile app stores have emerged as a dynamic repository of user needs, offering a continuous stream of changing expectations that are reflected in user reviews. These reviews, rich in insights, reflect real-world user experiences, feature requests, and pain points, offering a direct line to understanding what users truly seek in their mobile applications [3,5].

With the advancement of Artificial Intelligence (AI), the shift from traditional Machine Learning (ML) to sophisticated Natural Language Processing (NLP) has significantly improved the extraction of actionable insights from user reviews [1,6,7]. Early ML techniques enabled basic sentiment detection and keyword-based classification, laying the groundwork for more complex analyses. With the advent of NLP, particularly deep-learning-based models such as BERT, it became feasible to perform nuanced classification of user feedback by intent, type, and domain, thereby capturing the contextual and semantic richness embedded in natural language. However, the emergence of Large Language Models (LLMs) such as the Generative Pre-Trained Transformer (GPT) has further expanded the creative and generative potential of these systems, enabling more coherent synthesis of feedback, automated summarization, and interactive dialog generation [6]. Despite these advancements, one critical challenge remains largely underexplored: the discovery of latent dependencies among user-stated requirements. While current models excel at extracting and classifying individual requirements, they often fail to identify how these requirements influence or constrain each other, a gap that hampers holistic understanding and optimal decision making in software evolution [1]. The current emphasis on isolated requirement extraction neglects the intricate web of relationships that exist between different user needs, overlooking how one requirement might influence or be influenced by others [1,7]. Manual identification of these dependencies is not only a resource-intensive endeavor, demanding significant time and effort from analysts, but it is also susceptible to the biases and subjective interpretations of those involved. A sophisticated natural language processing technique is necessary to capture and predict these dependencies effectively from dynamically evolving user reviews. Recent work has demonstrated that transformer-based models, particularly BERT, have the potential to accurately interpret nuanced linguistic patterns. As a binary classifier, BERT reliably distinguishes dependent from independent requirement pairs, harnessing deep contextual embeddings that uncover subtle semantic signals often overlooked by traditional classifiers [4,8,9,10].

To address these challenges, this study aims to introduce an innovative framework that integrates knowledge graphs with multiple AI-driven components—most notably large language models and bidirectional encoder representations from transformer-based classifiers. Central to the framework’s architecture is an LLM-driven active learning loop, which facilitates iterative refinement of the classification model through cycles of prediction, uncertainty-based instance selection, expert annotation, and retraining [11,12]. The adaptive mechanism ensures that the system remains responsive to the evolving landscape of user needs as expressed in app store feedback. So, the proposed framework enhances the accuracy, scalability, and adaptability of requirement dependency extraction in dynamic software environments. In addition, BERT integrates contextual reasoning and structural reasoning to enhance prediction capabilities. Contextual reasoning leverages semantic dimensions—including requirement intent, functional or non-functional type, and domain-specific relevance—factors that are essential for accurately distinguishing subtle variations in user needs and ensuring that predictions align with the contextual meaning of user feedback [13]. Structural reasoning plays a vital role by leveraging calculations between requirements, allowing the model to uncover implicit hierarchical relationships and directional dependencies that are often overlooked in purely semantic analyses, thereby guaranteeing consistency and accuracy in predicted dependency relations [14,15,16]. However, the accuracy and adaptability of BERT depend substantially on the availability of labeled training data and the ability to respond promptly to continuously evolving user-generated requirements.

In the realm of RE, ontologies provide cognitive and structural mappings of domain-specific knowledge that demonstrate utility across a spectrum of tasks, including facilitating requirements elicitation by providing a structured vocabulary and semantic context for understanding stakeholder needs, enabling more precise and unambiguous specification of requirements through the formal definition of concepts and relationships, as well as supporting conflict management by identifying inconsistencies and overlaps between requirements [17,18,19]. Dependency analysis frameworks leverage ontologies to represent various dependency types between requirements and to facilitate conflict detection throughout the software lifecycle. Furthermore, a knowledge graph represents the structural relationships between requirements, providing a valuable resource for identifying dependencies that may not be immediately apparent from the text of reviews [14].

The proposed framework integrates large language models in three key scenarios, creating a robust and coherent AI-driven approach. First, LLMs facilitate the initial extraction and generation of structured requirement statements from unstructured, user-generated reviews, significantly reducing the manual effort required and enabling more consistent representation of evolving user needs [3]. Second, within an active learning loop, LLMs replace human evaluators by estimating model-based uncertainty and selectively identifying the most uncertain requirement pairs for retraining the BERT classifier. With an automated active-learning process, the classifier can be continuously refined while requiring minimal human intervention, thereby enhancing its responsiveness to evolving requirements and lowering annotation costs [20,21]. Third, LLMs play a pivotal role in ontology construction by automatically proposing new domain concepts, relations, and structural refinements, allowing the ontology to remain aligned with emerging software features and user expectations [22,23]. Our study investigates two research questions. RQ1: To what extent can knowledge integration—through contextual and structural reasoning—enhance the prediction performance of requirement dependencies extracted from unstructured user feedback? Meanwhile, the study explores RQ2: to what extent can iterative refinement through dynamic knowledge feedback improve the learning efficiency of the dependency prediction process?

In conclusion, our research contributes a comprehensive and scalable solution to the challenge of automating requirement dependency analysis from dynamic user-generated content. By integrating contextual reasoning—capturing the intent, type, and domain of each requirement—with ontology-driven structural rationale, the proposed framework offers a holistic understanding of inter-requirement relationships. The incorporation of a large-language-model-based active learning loop ensures that both the classifier and the ontology evolve in alignment with shifting user needs, enabling continuous refinement of prediction accuracy. Dual-layer reasoning not only improves the interpretability and reliability of dependency detection but also enhances the coherence of release planning decisions. Moreover, the framework’s fully automated pipeline overcomes the scalability constraints of manual analysis, enabling development teams to harness the vast and growing volume of app-store feedback effectively. With the proposed framework, developers are provided with richer decision-making inputs by explicitly modeling inter-requirement dependencies, which contribute to the construction of more coherent and optimized release roadmaps for user-centered software development practices.

2. Literature Review

The increasing complexity of software systems and the dynamic nature of user expectations necessitate robust methods for managing requirements throughout the software development lifecycle [7]. A critical aspect of effective requirement management involves accurately identifying and understanding the dependencies that exist between various requirements, particularly those elicited directly from customer feedback [24]. Such understanding is crucial for effective decision making in release planning, resource allocation, and ensuring system coherence [24]. However, capturing these dependencies remains inherently challenging due to the informal and often ambiguous nature of customer input, which can obscure subtle interconnections between desired functionalities or constraints [25]. The proliferation of natural language processing techniques offers promising avenues for automated extraction and analysis of such dependencies from unstructured textual feedback [26]. However, the inherent ambiguities and nuances within natural language often pose significant hurdles for fully automated dependency identification, underscoring the need for advanced approaches that can capture more nuanced semantic relationships [27]. The complexity of this challenge is further exacerbated by the volume and velocity of feedback in agile development environments, highlighting the need for scalable and precise methodologies [1,3,7,28,29].

The following section evaluates the current state of automated dependency extraction from natural language, focusing on the application of AI and ontological methods to provide a comprehensive understanding of their effectiveness and limitations. Recent research has focused on automating the identification of dependencies between software requirements using natural language processing techniques. In this context, the authors of [4] proposed a two-stage approach using NLP and weakly supervised learning to extract binary dependencies and their types. In addition, the authors of [30] developed a method of identifying dependencies in evolutionary requirements, extracting relations such as PART OF, AND, OR, and XOR. Another study presented two content-based recommendation approaches, one using document classification techniques and another based on Latent Semantic Analysis. Their evaluation showed that Random Forest with probabilistic features achieved the best prediction quality (F1 0.89) [31]. Similarly, the authors of [32] implemented an approach to identify required dependencies using supervised classification techniques, with Random Forest classifiers achieving an F1-score of ~82%. These studies demonstrate the potential of NLP and machine learning techniques in automating the complex task of identifying requirement dependencies. A series of studies by the same authors has made notable contributions to advancing the extraction of requirement dependencies. In their first work, the authors of [15] proposed a method utilizing part-of-speech features and an enhanced stacking ensemble learning model to improve extraction performance. Building upon this foundation, their subsequent study [11] developed an integrated active learning approach to optimize sample selection for dependency extraction. In addition, the authors of [24] addressed value-related dependencies by mining user preferences and incorporating them into the release planning process. To improve requirement analysis, the authors of [18] introduced an ontology-driven framework for domain-specific requirements, incorporating formal scenario descriptions and dependency graphs. These approaches aim to overcome challenges in traditional requirement engineering methods, such as manual analysis inefficiencies, heterogeneous model interoperability, and reusability across domains. By leveraging machine learning, semantic techniques, and user preferences, these studies demonstrate significant improvements in requirement dependency extraction accuracy and analysis capabilities.

The inherent difficulty in detecting requirement dependencies stems from their natural language documentation, which often leads to incomplete or incorrect identification, thereby reducing release quality and necessitating substantial rework [14]. Moreover, the subjective interpretation of natural language requirements by different stakeholders can lead to inconsistencies in identifying dependencies, further complicating the development process. Such manual analysis is time-consuming and effort-intensive, underscoring the critical need for automated solutions to enhance efficiency and accuracy in requirement management [4,29,30,33,34]. To address these challenges, advanced AI techniques, specifically machine learning and natural language processing, have been increasingly employed to automate the detection of dependencies, offering a more scalable and precise alternative to manual methods [11,14]. Natural language processing has shown promise in identifying semantic dependencies between requirements; however, much of the existing research remains theoretical or focuses on pairwise analyses, overlooking the dynamic interplay of multiple requirements with varying degrees of dependency [4]. A significant challenge lies in the lack of robust evaluation methods for these approaches, often relying on simple case studies rather than comprehensive comparative assessments [26].

Despite these advancements, a significant gap persists in the extensive integration of ontological knowledge with AI methods to enrich the semantic understanding of dependencies, moving beyond mere syntactic or lexical associations to capture deeper conceptual relationships [4]. This integration is vital for mitigating the inherent verbosity, ambiguity, and inconsistency often found in natural language requirements, which traditionally complicate the selection of appropriate system architectures and the evaluation of alternative designs [30]. Further, while NLP techniques are critical for processing natural language requirements, their effectiveness can be significantly bolstered by integrating ontological frameworks that provide a structured and formal representation of domain knowledge, enabling a deeper, semantically aware analysis of dependencies [19,35]. This structured approach can help to formalize the often-informal knowledge embedded within natural language requirements, thereby facilitating more precise and automated dependency detection [14,18]. Furthermore, the integration of domain-specific ontologies can help resolve ambiguities and inconsistencies present in natural language, providing a standardized framework for interpreting and classifying various types of dependencies and thereby enhancing the precision of automated tools [18].

Our study investigates how artificial intelligence, particularly natural language processing, can be synergistically combined with ontological reasoning to overcome these limitations, thus providing a robust framework for automated dependency detection in software requirements. Natural language processing, a subfield of artificial intelligence, enables computers to comprehend, interpret, and generate human language, which is crucial for transforming raw textual data into structured, analyzable information [2,3,28]. Moreover, while recent advances in LLMs and BERT-based architectures have improved the extraction of requirement-related information [3,6,20,36,37], there is no unified framework that integrates semantic features (such as intent, domain, and requirement type) with structural reasoning powered by ontological knowledge to detect requirement dependencies from mobile app reviews. This gap motivates our proposed ontology-enhanced AI framework, which combines semantic analysis with structural dependency modeling to uncover the hidden architecture of user needs directly from natural language feedback.

In parallel, recent studies have explored strategies to reduce distributional bias in large language models. For example, the authors of [38] proposed an imitation-learning-based approach (DaD-DAgger) to mitigate exposure bias in LLM distillation, showing that combining teacher outputs (soft labels) with ground-truth data (hard labels) yields substantial improvements in predictive accuracy and generative quality. Their work underscores the importance of addressing discrepancies between training and deployment conditions. Building on this line of research, our framework targets a different but related challenge: dependency bias in requirement engineering. Unlike exposure bias in autoregressive text generation, dependency bias emerges when labeled data do not fully capture the semantic and structural relations underlying user requirements. By integrating ontology-driven structural features with LLM-driven active learning, we aim to enhance generalizability and robustness in requirement dependency detection.

3. Methodology

The methodology adopted in this study is grounded in our previously proposed framework [39], which outlines a set of structured criteria for calculating requirement priorities—one of which involves detecting inter-requirement dependencies. While the prior framework offered a conceptual basis for prioritization that includes inter-requirement dependency as one of its key components, this aspect has not yet been empirically validated. The proposed methodology introduces an ontology-enhanced AI framework specifically designed to predict such dependencies directly from customer feedback. It leverages an LLM-driven active learning loop to iteratively train and refine a BERT classifier, ensuring adaptive and accurate detection of relational links between features or requirements. As part of this extension, the framework integrates contextual understanding with structural reasoning to derive meaningful relationships from unstructured user-generated content [3], reflecting a refined methodological direction tailored to the goals of this investigation. The robust architecture of the framework allows continuous improvement and adaptation, making it highly suitable for dynamic environments where customer needs and product features evolve rapidly. Its methodological backbone relies on several interconnected stages to achieve accurate dependency prediction and effective visualization.

As illustrated in Figure 1, the research workflow is composed of stages that systematically process raw customer feedback and convert it into structured insights. These insights support the identification and classification of requirements and enable the prediction of complex interdependencies among them. The following subsections elaborate on each stage, offering a detailed account of the data processing pipeline and the advanced techniques integrated into the framework.

3.1. Data Collection and Preparation

This foundational step is critical for ensuring the high quality and relevance of the input data, directly impacting the accuracy and efficacy of subsequent analytical stages, particularly within a deep learning context where model performance is susceptible to input data characteristics. Gathering large quantities of user-generated content from diverse sources is a key part of this foundational step, which ensures data quality and coverage by accounting for volume, velocity, and variety, thereby reducing biases in user feedback. This study selected the mobile banking domain based on several compelling factors. First, the financial technology (FinTech) sector in Saudi Arabia has witnessed rapid growth and widespread digital adoption in recent years, leading to a surge in mobile banking usage. Consequently, user reviews in this domain are both voluminous and rich in requirement-related feedback. Second, banking applications often serve a broad spectrum of user needs—ranging from transactional services to security and user experience—which results in highly diverse and detailed feedback. This diversity enables the extraction of both functional and non-functional requirements, making the dataset particularly suitable for semantic and structural analysis of dependencies. Third, as banking applications are mission-critical systems, users tend to express more precise and task-oriented feedback, often reporting concrete issues, feature gaps, or usability concerns. As a result, user reviews in this context tend to be more issue-focused, feature-specific, and valuable for guiding requirement engineering decisions. By focusing on this context, our dataset offers high ecological validity and ensures the practical relevance of the proposed framework to real-world and user-centered applications. The dataset employed in this study comprises 8816 user reviews collected over a seven-year period (2018–2025), providing a temporally diverse snapshot of user feedback that reflects evolving expectations and system maturity. The reviews were drawn from 12 widely used mobile banking applications in Saudi Arabia, with data sourced from both the Apple App Store and the Google Play Store. In the initial phase, the AppBot platform was utilized to extract historical reviews, primarily from the Apple App Store; however, to improve recency and ensure coverage across platforms, this was complemented by manual retrieval of more recent reviews from both marketplaces. All reviews were authored in English, consistent with the bilingual (Arabic–English) interfaces commonly offered by Saudi banking applications and the frequent tendency of users to provide feedback in English. On average, the reviews contained 10.25 words, with the longest extending to 158 words and the shortest comprising a single word. In terms of lexical characteristics, the average word length was 4.37 characters, reflecting the generally concise and direct style of user feedback. The primary objective of this study was to maximize the extraction of user requirements irrespective of the application source. Accordingly, the dataset preserves its natural distribution, which mirrors real-world market dynamics, whereby leading applications, such as those of Al Rajhi Bank and Saudi National Bank, generate the largest share of reviews. Retaining this distribution ensured that the dataset captured the full breadth of user needs, with review frequency itself serving as an indicator of both application popularity and user engagement. Consequently, balancing across applications was not enforced, as the study’s focus was on requirement diversity rather than cross-application comparability.

The collected user reviews underwent a comprehensive preprocessing pipeline to ensure compatibility with downstream language understanding models. This pipeline was implemented using Python (3.12.11)-based libraries, including NumPy and Pandas for data handling, scikit-learn for preprocessing utilities, and the HuggingFace BERT tokenizer for subword-level tokenization. As part of the normalization process, all text was converted into lowercase, punctuation and special characters were removed, and sentence segmentation was applied. Filtering steps excluded duplicate reviews, non-informative entries (e.g., trivial comments such as “very good”, “Worst bank ever!!”, or “Too nice”), and concise reviews shorter than three words. These procedures reduced the dataset from 8816 to 6487 reviews, thereby enhancing its quality and representativeness. Beyond this preprocessing stage, the refined dataset was prepared for a strict LLM-based prompt filtering phase, in which large language models are guided to validate and retain only content relevant to the requirements semantically. The detailed methodology of requirement extraction is described in the subsequent section.

3.2. Requirement Extraction and Classification

The extraction of requirements from user-generated content has become a complex task in requirements engineering, particularly in the context of mobile app store reviews. Traditional manual elicitation techniques are often inadequate for processing such unstructured and large-scale feedback. Recent advances in natural language processing and deep learning have enabled the automation of this task through the use of LLMs, which are capable of capturing complex semantics and generating coherent textual outputs. The adoption of LLMs in requirement extraction marks a shift from simple keyword-based or classification techniques to models capable of understanding context and generating requirements in a format consistent with engineering standards.

This study adopts a generative modeling paradigm for requirement extraction, using LLMs to synthesize requirement statements from raw user reviews. Recent empirical evidence informs this direction. For instance, the authors of [3] demonstrated the effectiveness of a generative model (GEMMA-7B) in producing requirements from app store feedback with high precision (Accuracy = 92.00%, F1 = 91.39). Similarly, recent studies have demonstrated that users often perceive ChatGPT- 3.5 or 4 generated requirements as transparent, self-contained, and consistent. For instance, the authors of [40] evaluated ChatGPT’s ability to elicit software requirements and found that its outputs were frequently rated by experts as understandable and logically structured, albeit with occasional gaps in traceability and completeness. Moreover, Hymel and Johnson [20] conducted a comparative study between GPT-4 and human analysts, reporting that LLM-generated requirements achieved higher alignment scores and comparable completeness while being generated at a fraction of the time and cost. Guided by these empirical insights, our framework integrates a generative LLM into the requirement extraction pipeline to help automatically generate both functional and non-functional requirements in a clear and meaningful way. To reach the final prompt design, we conducted a series of iterative refinements on a representative subset of the dataset, testing alternative formulations to balance coverage and precision. The final version was selected, as it consistently yielded high-quality extractions aligned with diverse requirement types. To ensure consistent model behavior, a tailored prompt was designed to define the extraction targets for each user review clearly. It includes sentence patterns for positive features and different types of requirements—such as bugs, improvements, and new feature requests—and gives an example to show the expected output. The prompt ensures that the responses are structured as a JSON array and filters out unrelated or unclear feedback. Figure 2 presents the complete prompt.

Upon successful extraction, identified user-stated requirements undergo a multi-label classification process to categorize them based on predefined typologies; this process often embeds a mix of functional requirements, non-functional expectations, defect reports, and feature enhancement suggestions, making manual categorization labor-intensive and error-prone. To address this challenge, our study adopts a BERT-based deep learning approach to automatically classify requirements along three key dimensions: requirement type (functional vs. non-functional), requirement intent (defect fix, service request, feature improvement), and requirement domain (e.g., performance, security, availability). This approach is grounded in recent empirical studies that confirm BERT’s ability to capture the contextual and semantic nuances required for fine-grained requirement classification. The integration of BERT-based models for requirement classification has demonstrated superior performance in categorizing both functional and non-functional requirements, often outperforming traditional methods through fine-tuning and transfer learning techniques [1,36]. Another study [3] demonstrated that BERT significantly outperforms traditional classifiers in identifying requirement intent—such as bug reports and feature requests—from app reviews, achieving over 92% accuracy. Their findings confirm BERT’s effectiveness in capturing the implicit purpose behind user feedback, making it ideal for intent-based requirement classification. In addition, the authors of [41] demonstrated that transformer-based models significantly outperform traditional NLP methods in classifying software requirements, particularly in accurately distinguishing non-functional domains such as performance, security, and usability, making them well suited for fine-grained domain categorization in requirements engineering. This classification leverages the contextual embeddings generated by the BERT model, enabling a robust understanding of the nature and scope of each requirement within the broader product ecosystem. This systematic categorization is vital for organizing the vast array of extracted requirements into manageable and analytically useful groups, facilitating a more granular analysis of their interrelationships and implications for product development. Then, the classification informs the subsequent stages of dependency prediction.

So, for each user review u ∈

𝒰

, a set of candidate requirements 𝓡_u is extracted using a generative-based model; each extracted requirement r_i is then enriched through three independent BERT-based classifiers, each fine-tuned to predict one of the following critical attributes: intent (e.g., feature request or bug report), the requirement type (functional or non-functional), and the domain (e.g., security, performance). Each classifier employs a dense output layer with softmax activation to produce categorical probability distributions, from which the most likely label is selected. Equation (1) denotes the extraction function that identifies candidate requirements r_i from the review. For each requirement r_i, the attribute Intent (r_i) represents its purpose (e.g., feature request, bug report), Type (r_i) classifies it as functional or non-functional, and Domain (r_i) assigns it to a relevant domain, such as performance, security, or availability. The set 𝓡_u thus captures all enriched requirements associated with review u, each annotated with its semantic and structural properties.

𝓡_u = {(r_i, Intent(r_i), Type(r_i), Domain(r_i))|r_i ∈ Extract(u)}

(1)

The dataset comprises 2797 extracted requirements, distributed across three major categories: Bug Reports (1620), Enhancement Requests (992), and New Feature Requests (185). As shown in Figure 3, these requirements are further classified into functional and non-functional dimensions, with noticeable variation across categories (e.g., 89% of bug reports are functional, while 53% of enhancement requests are non-functional). This natural imbalance reflects real-world user feedback patterns, where error reporting and incremental improvements dominate over novel feature requests. Importantly, this skew does not bias the dependency detection task, since the classification is ultimately binary (dependency vs. no dependency), and the integration of contextual features with ontology-based structural reasoning ensures robustness against uneven category sizes.

3.3. Requirement Pairing

Following the classification, requirements are systematically paired to identify potential interdependencies, leveraging sophisticated algorithms that consider both their semantic content and classified attributes to infer logical connections. This pairing process is critical for mapping out the intricate web of relationships between diverse requirements, laying the groundwork for a comprehensive understanding of how changes or implementations in one area might impact others within the product ecosystem. To exhaustively model potential dependencies among requirements, our study adopts a pairwise comparison strategy, where each requirement is evaluated in relation to all others. This method aligns with contemporary practices in requirement dependency extraction, where ML models assess relational features between every possible pair. For instance, the authors of [15] implemented a stacked ensemble (P-Stacking) that processes POS, TF-IDF, and Word2Vec features from requirement pairs, demonstrating significant improvements in F1-scores for dependency detection and validating the effectiveness of comprehensive pairwise feature modeling.

To explore dependencies between user-stated requirements, all pairwise combinations of requirements extracted from each user review are constructed. Let 𝓡_u be the enriched set of requirements from review u; the pairwise set is defined as follows:

𝓟_u = {(r_i, r_j)|r_i, r_j ∈ 𝓡_u, i ≠ j}

(2)

Formulation (2) enables the systematic evaluation of directional dependencies among co-occurring requirements within the same contextual source, forming the basis for subsequent dependency prediction. If the number of extracted requirements in a review u is n_u = |𝓡_u|, then the number of generated pairs is

|𝓟_u| = n_u × (n_u − 1)

(3)

This systematic pairing forms the basis for constructing a dependency matrix, thereby enhancing the overall robustness of the system architecture. The accuracy of this process directly influences the reliability of subsequent dependency prediction and visualization, enabling a proactive approach to managing complex interdependencies in product development. Aggregating across the dataset yields millions of requirement pairs, which is logically consistent with the objective of modeling directional dependencies, as each requirement can potentially influence or be influenced by others. Although the raw pair count is large, active learning and ontology-guided filtering are employed to narrow the focus to the most meaningful candidate pairs, thereby ensuring both computational feasibility and semantic relevance.

3.4. Model Training and Validation

The initial phase of training began with a small, manually labeled subset representing 10% of the user feedback data. Annotation was initially carried out using Amazon Mechanical Turk (MTurk), where multiple annotators provided independent labels for requirement pairs based on logical and functional dependencies. To ensure consistency and reliability, disagreements were first resolved through majority voting, while any remaining conflicts were adjudicated by two domain experts, who are also the authors of this study. With complementary expertise in software requirements engineering, natural language processing, and mobile banking systems, the authors ensured that the resulting consensus yielded a high-confidence reference set. This reference set was subsequently used to fine-tune a pre-trained BERT model and establish the baseline performance for the framework.

The training process is grounded in a hybrid multi-input architecture that combines language understanding with structural and contextual reasoning to enhance the prediction of inter-requirement dependencies. The model ingests three types of input features: first, contextualized textual embeddings are extracted from BERT by concatenating each requirement (r_i, r_j) into a single input sequence. These embeddings capture the semantic relationships, linguistic nuances, and discourse-level dependencies between the textual descriptions of the requirements. Second, a set of structured features is introduced to incorporate domain-informed signals derived from the underlying ontology and graph-based reasoning. Third, contextual features are utilized to simultaneously capture semantic intent and quantify the functional alignment and stylistic similarity between requirement pairs. This triadic feature design enables the model to reason jointly over linguistic content, structural proximity, and semantic consistency. In the following sections, more details about structural and contextual reasoning are provided.

Structural Reasoning: To enable structured reasoning about mobile application requirements, a domain-specific ontology was developed to formalize the hierarchical, semantic, and relational structure of the services and features.

Figure 4 presents the proposed ontology architecture, which was initially manually designed to establish a semantic foundation linking user reviews to software services, features, and their interrelationships. The ontology introduces a review class that stores the feedback content, rating, and timestamp, and it connects each review to one or more requirements. To enrich these requirements with contextual meaning, the model uses a review annotation class that captures the user’s intent and sentiment. Each requirement explicitly refers to a service offered by an app, which includes metadata such as its name and version. The ontology categorizes services into functional or non-functional types and associates each service with relevant features. The model arranges features hierarchically through structural relations such as hasFeature, isSubFeatureOf, and isPartOf, and it defines semantic links between them—such as dependsOn, requiresRelation, conflictsWith, complements, and triggersRelation—to capture dynamic and logical dependencies. This manual design enables the ontology to capture compositional relationships between features, allowing for a structured and interpretable representation of requirement interactions that support reasoning, prediction, and visualization tasks.

Structural reasoning in the proposed framework is operationalized by combining two ontology-based similarity metrics that reflect both hierarchical positioning and topological proximity between requirements. These metrics—depth-based similarity and shortest-path similarity—jointly model the structural characteristics embedded in a graph-encoded domain ontology. To estimate whether a source requirement r_i depends on a target r_j, the model first locates the shortest semantic path between their corresponding nodes in the ontology. If a valid path exists, the model calculates the dependency strength using the following normalized score:

path_sim (r_{i}, r_{j}) = \frac{1}{len (p_{i \to j}) + 1} if path exists, else 0

(4)

This formulation aligns with established approaches in network analysis that utilize the inverse shortest-path length to infer the relational strength between nodes. Recent studies have demonstrated the effectiveness of such metrics in diverse applications, including similarity-based link prediction and structural proximity estimation in complex networks [42,43].

As part of the structural reasoning component, an ontology-based method is employed to estimate the hierarchical relatedness between pairs of requirements. Each requirement is semantically aligned to a corresponding concept within a domain ontology that encodes structural relationships such as ‘isPartOf’ and ‘isSubFeatureOf’. To quantify structural proximity, the depth of each requirement’s ancestors is computed using a traversal function, which iteratively explores parent nodes and assigns a depth score based on their distance from the original requirement node. Given two requirements r_i and r_j, their respective ancestor sets Anc(r_i) and Anc(r_j) are extracted. The degree of structural similarity is then determined by identifying shared ancestors A = Anc(r_i) ∩ Anc(r_j) and computing their Lowest Common Ancestor (LCA). The depth-based similarity score follows the formulation

D e p t h_s i m (r_{i}, r_{j}) = \frac{2 \times d e p t h (L C A_{i, j})}{d e p t h (r_{i}) + d e p t h (r_{j})}

(5)

This formulation favors requirements that share close common ancestors, capturing latent structural dependencies within the ontology; such a depth-based metric supports finer-grained inference beyond surface-level textual similarity. The approach aligns with recent ontology-driven methods in requirements engineering, where structured reasoning has proven effective in enhancing traceability and dependency modeling [44].

Contextual Reasoning: Traditional methods for identifying requirement dependencies often rely on shallow lexical or syntactic similarity, which limits their ability to capture deeper semantic relationships or user-driven intent. To overcome this limitation, the proposed approach adopts a structured reasoning process that considers whether a source requirement r_i depends on a target r_j based on their contextual alignment. This alignment reflects not only their functional classification and domain-specific category (e.g., performance, usability) but also the nature of their annotated intent, such as issue reporting, enhancement, or feature addition. Rather than treating each requirement in isolation, the approach emphasizes how distinct yet causally related intents—particularly when originating from the same domain—may indicate a stronger likelihood of interdependence. For instance, a complaint about a system limitation followed by a request for enhancement may represent a natural progression of refinement within the same functional area. Empirical studies have shown that such intent signals embedded in user feedback often reflect underlying causal and hierarchical relationships that traditional similarity-based models fail to capture [3]. Moreover, prior work on causal requirement reasoning has shown that nearly one-third of requirement artifacts contain explicit or implicit causal markers, which, when systematically modeled, can enhance traceability and prioritization [45]. By incorporating these patterns into a principled evaluation process, the model enables a more nuanced and context-aware estimation of whether r_i depends on r_j. This is particularly valuable in large-scale user feedback scenarios, where subtle intent cues are essential for distinguishing between co-supportive, conflicting, or independent features. As such, this methodology offers a scalable and cognitively informed foundation for automated requirement analysis and release planning.

During training, the dataset was initially split into 80% training and 20% validation, with the validation set kept fixed across all active learning iterations to ensure comparability. We utilized the pre-trained BERT-base (uncased) model together with its default WordPiece tokenizer, as implemented in HuggingFace. The model was fine-tuned directly on the constructed dataset of requirement pairs, without additional domain adaptation, as prior studies have shown that fine-tuning BERT-base on task-specific data is sufficient to capture the contextual and semantic nuances of user feedback. The contextual embeddings generated by BERT were fused with the transformed numerical features and passed to a joint classification layer. Model optimization was performed using the AdamW optimizer (learning rate = 2 × 10⁻⁵, weight decay = 0.03), with gradient clipping and warm-up scheduling to ensure stable convergence. To mitigate class imbalance, we applied a hybrid strategy—at the data level, majority-class downsampling (Label = 0) was performed when composing augmented training batches, and at the loss level, either Binary Cross-Entropy (BCE) with class weights or Focal Loss was employed, depending on the experiment. Validation was conducted at each epoch, with multiple threshold values being evaluated and the one yielding the highest F1-score being selected.

All experiments were executed in a GPU-enabled environment (NVIDIA Tesla T4, 16 GB VRAM, and 25 GB RAM). On average, fine-tuning a single training loop (3–5 epochs, batch size: 16–32) required approximately 20–30 min, while completing a full active learning cycle with GPT-4-assisted annotation required an additional 1–2 h depending on the batch size. The framework is modular and scalable; ontology-based structural reasoning is computationally lightweight, BERT fine-tuning benefits from GPU acceleration, and LLM annotation was managed in controlled batches to ensure feasibility. This integrated architecture, which combines structural and contextual signals, enables the model to generalize effectively across varied requirement scenarios and supports more accurate and interpretable dependency prediction.

3.5. Model Improvement

To enhance the predictive performance of the proposed framework, an LLM-driven active learning loop was implemented to iteratively refine the fine-tuned BERT classifier while minimizing manual annotation effort. Rather than depending exclusively on human labels, the framework leveraged a large language model, GPT-4, to assist in classifying instances that the model was most uncertain about. After each iteration, the classifier produced probability scores for unlabeled data, and uncertainty was quantified using binary entropy, defined as follows:

h (x) = - p l o g p - (1 - p) l o g (1 - p)

(6)

where p denotes the predicted probability of the positive class, and Entropy reaches its maximum when p = 0.5, indicating that the model is least certain and is effectively guessing. Accordingly, unlabeled instances with probabilities in the range [0.40–0.60]—representing the uncertainty zone—were considered the most ambiguous and were forwarded to GPT-4 for automated labeling. The annotation batch size was adjusted adaptively across cycles; larger batches (≈400 pairs) were used in the initial iterations to accelerate learning, while smaller batches (≈200 pairs) were employed in later cycles to improve efficiency and focus on the most uncertain samples. To ensure model robustness and avoid introducing class imbalance during refinement, a class-balancing technique was applied to the newly labeled samples prior to their integration into the training data. These balanced annotations were then merged with the existing labeled set, and the model was retrained, completing a full refinement cycle. This iterative process allowed the framework to progressively improve classification performance by focusing annotation efforts only on the most informative samples, enabling efficient learning while maintaining minimal human supervision.

In parallel, the domain ontology—which serves as the foundation for structural reasoning—was incrementally updated using LLM-generated prompts that were reviewed and validated by domain experts.

The ontology construction prompt (see Figure 5) was not arbitrarily designed but was the result of several iterative refinements conducted on a representative subset of the dataset to ensure both accuracy and coverage. Through these experiments, we arrived at a prompt structure capable of consistently extracting candidate concepts from user requirements, including features, services, and potential semantic or structural relations. While the prompt ensured consistency in the format of extracted suggestions, the outputs were not directly integrated into the ontology. Instead, they were treated as candidate updates and subjected to systematic expert validation to ensure semantic correctness and banking-domain relevance. To further illustrate this process, Figure 6 presents a concrete example of the human-in-the-loop validation step. For the requirement “Implement a feature to transfer points between family and friends”, the LLM generated multiple candidate relations (e.g., isPartOf: loyalty_points_services, complements: points_redemption, dependsOn: points_for_card_spending, requiresRelation: manage_beneficiaries, triggersRelation: sms_email_alerts). The reviewer partially approved these suggestions, retaining only those consistent with the ontology structure and discarding irrelevant ones. This interactive step highlights how automated discovery by LLMs is balanced with expert oversight to reduce potential noise or bias.

The framework establishes a self-improving loop by integrating LLM-driven active learning, refined prompt design, and expert-guided ontology validation. This hybrid and incremental strategy ensures semantic coherence, accuracy, and domain relevance while continuously enriching the ontology. As a result, the framework significantly enhances its ability to capture and reason about complex interdependencies among user-review requirements.

To evaluate model refinement across active learning iterations, our study measured performance improvement using standard classification metrics, including the F1-score, accuracy, and recall. After each training loop, these metrics were recalculated on a fixed validation set to capture the incremental gains attributable to the newly added LLM-labeled instances and ontology updates. To assess the effectiveness of knowledge refinement over successive active learning loops, the ratio of performance improvement (ΔF1-score) to the number of LLM-based interventions and ontology updates was monitored. This ratio served as a practical indicator of knowledge-driven learning efficiency, capturing how much predictive gain was achieved per unit of refinement effort. While this metric is not formally established in prior work, its underlying principle aligns with the efficiency-oriented evaluations adopted in recent active learning studies, particularly in low-resource and annotation-constrained settings [12,21]. By tracking this ratio across loops, the analysis provided insights not only into the extent of model improvement but also into how efficiently that improvement was achieved, highlighting the role of intelligent data selection and dynamic ontology enrichment in minimizing annotation cost while maximizing model performance.

4. Results

This section presents the experimental results in alignment with the research questions, focusing first on the effectiveness of knowledge integration on model performance (RQ1), followed by an analysis of the learning efficiency during the iterative refinement process (RQ2). To support this evaluation, a series of comparative experiments were conducted across three progressively enhanced model configurations. The first serves as a baseline, employing a standard BERT-based classifier trained without any contextual or structural augmentation. The second configuration integrates domain-specific ontology and contextual signals but excludes the active learning component, representing a static knowledge-enhanced setup. The final configuration comprises the whole framework, incorporating an LLM-driven active learning loop that iteratively refines the classifier through uncertainty-guided feedback and dynamic knowledge enrichment. This comparative design enables a systematic evaluation of the individual and combined contributions of contextual reasoning, structural domain knowledge, and adaptive learning dynamics to both predictive performance and knowledge-driven efficiency.

4.1. Impact of Knowledge Integration on Dependency Prediction

The initial evaluation of the baseline model—comprising a standard BERT-based binary classifier trained exclusively on raw textual pairs—reveals important limitations in its predictive behavior. Despite leveraging pre-trained contextual embeddings, the model achieved an F1-score of 0.766, with high recall (0.8738) but substantially low precision (0.6818). This imbalance indicates that while the model is effective at capturing a large proportion of true dependencies, it also produces a high rate of false positives, reflecting an overgeneralized decision boundary. The elevated validation loss (0.8738) and AUC (0.8738) further support this interpretation, suggesting that the model struggles to clearly distinguish between dependent and non-dependent requirement pairs in the absence of explicit structural or semantic signals.

To investigate the effect of incorporating domain knowledge, a knowledge-enriched configuration was implemented in the first loop, where context-awareness and structural relations cues—derived from a domain-specific ontology and fine-tuning BERT—were integrated into the training process. This modification led to a substantial improvement in predictive performance. The model’s F1-score increased to 0.8641, and more importantly, the precision and recall became fully balanced (both at 0.8641). This transition reflects a significant enhancement in the model’s ability to avoid both false positives and false negatives, resulting in more calibrated and reliable predictions. Furthermore, the validation loss dropped dramatically to 0.0298, indicating that the addition of structured knowledge not only improved output quality but also facilitated faster and more stable convergence during training. The AUC rose to 0.9141, confirming that the model had developed stronger discriminative capabilities across a range of decision thresholds.

These findings highlight the value of incorporating contextual and ontological knowledge into the predictive process, even in the absence of iterative refinement. The model configuration in the first loop, enhanced with structural relating and semantic reasoning, exhibits a markedly improved ability to interpret nuanced relationships between requirement pairs. Compared with the baseline, it more effectively captures latent dependencies that would otherwise be overlooked or misclassified. This outcome affirms the central principle of the proposed framework—that accurate and efficient dependency prediction from unstructured user reviews depends not only on linguistic patterns but also on the integration of functional relationships and intent-aware understanding.

4.2. Efficiency Gains Through Active Learning and Iteration

Beyond the initial accuracy improvements obtained through static knowledge integration, the proposed framework demonstrates substantial gains in learning efficiency through iterative refinement and active learning. As shown in Table 1, the transition from Loop#2 to Loop#4 reveals consistent improvements in both performance metrics and training dynamics. In Loop#2, the model achieved an F1-score of 0.9091. While these results marked a clear advancement over the static configuration, the efficiency of this gain—when evaluated in terms of the ratio of performance improvement (ΔF1-score) to the number of LLM-based interventions and ontology updates—remained moderate.

Finding 1
Knowledge integration through contextual awareness and structural relations led to a clear improvement in dependency prediction performance, raising the F1-score from 0.766 to 0.8641 and achieving balanced precision and recall, even without iterative refinement.

Importantly, Loop#4 demonstrated not only the highest F1-score (0.9565) and validation loss reduction (0.0128) but also the most efficient use of refinement effort, with a significantly smaller number of interventions. This loop yielded the largest single-step performance gain (ΔF1 = 0.0327) relative to the amount of refinement effort invested—indicating an improvement in knowledge-driven learning efficiency of over eightfold compared with Loop#3. These findings suggest that, as the model matures, it becomes increasingly effective at selecting the most informative instances for annotation and knowledge injection, thereby maximizing learning gains while minimizing intervention cost.

Figure 7 collectively illustrates the complementary roles of the training volume and knowledge-guided refinement in enhancing model performance. In the left plot, the learning curve depicts the impact of increasing training data volume on model performance. As the relative training set size increases from 10% to 80%, the F1-score steadily improves. Notably, the model surpasses an F1-score of 0.91 at 70% data utilization and approaches 0.935 at 80%, reflecting strong learning capacity with moderate data volumes. The right plot shows the progression of performance across successive refinement loops within the proposed framework. Starting from a baseline model trained only on textual input, each loop incrementally incorporates contextual and structural knowledge, resulting in consistent performance gains. Notably, Loop#1 (static integration) achieves a significant jump over the baseline, while subsequent loops (Loop#2 to Loop#4) demonstrate further improvement through selective, knowledge-efficient updates. Together, these visualizations provide evidence for both the scalability and adaptability of the proposed framework in handling dependency prediction with minimal supervision.

The results depicted in Figure 8 illustrate the progression of model performance and refinement efficiency across active learning loops. While the F1-score increases consistently from Loop #2 to Loop #4, the relative performance gain per unit of annotation and knowledge update effort becomes most pronounced in the final loop. At the same time, the learning curve indicates that performance improvements eventually plateau, reflecting stable generalization rather than overfitting. This trend confirms that the model not only becomes more accurate but also more selective in identifying the most informative instances for annotation and ontology enrichment. Overall, the analysis affirms the effectiveness of integrating domain knowledge and adaptive selection mechanisms in enhancing learning efficiency during iterative refinement.

The final model’s classification performance demonstrates both precision and consistency, as evidenced by the normalized confusion matrix applied to the validation set (see Figure 9). The model correctly identified 96.12% of truly dependent requirement pairs (true positives) while misclassifying only 3.88% as non-dependent (false negatives). Similarly, 94.51% of actual non-dependent pairs were accurately predicted, with a 5.49% false positive rate. This distribution reflects a high degree of sensitivity and specificity, indicating that the model is capable of capturing subtle dependency patterns while avoiding over-generalization. Moreover, the relatively low error rates on both classes validate the model’s balanced decision boundaries and reinforce the strong F1 performance reported in previous evaluations. This level of reliability is particularly critical in requirement dependency prediction, where false positives may introduce unnecessary constraints between unrelated features, while false negatives risk overlooking essential dependencies. Both errors can undermine the integrity of release planning and compromise the overall coherence of software evolution.

This robust performance further justified terminating the iterative refinement process at Loop#4, as the model exhibited clear signs of convergence in both predictive accuracy and refinement efficiency. Although each loop introduced measurable gains, the improvement from Loop#3 to Loop#4 was relatively modest in terms of the F1-score, yet it achieved significantly fewer interventions. Notably, Loop#3 attained the highest F1-score (0.9565) and an optimal balance between precision and recall, indicating the model’s maturity in capturing complex dependencies. Additionally, the ratio of performance gain to refinement effort showed a marked increase, suggesting that the model had become increasingly selective and effective in utilizing new annotations and ontology updates. Further refinement beyond this stage would likely incur higher costs with minimal added value, making Loop#4 a methodologically sound and practically efficient endpoint for the active learning cycle.

Finding 2
The LLM-driven active learning approach substantially improved the framework’s learning efficiency through iterative refinement. In the fourth refinement cycle, the model achieved its highest F1-score (0.9565) while requiring notably fewer annotations and update efforts. This outcome highlights the effectiveness of targeted, knowledge-guided feedback in maximizing predictive performance with minimal supervision, demonstrating that the model becomes increasingly selective and impactful in leveraging new information across refinement cycles.

To illustrate the full capacity of the proposed framework, a 15 × 15 dependency structure matrix was generated, as seen in Figure 10, synthesizing insights derived from contextual reasoning, ontological structure, and LLM-driven iterative refinement. This matrix captures the pairwise dependency scores among 15 user-generated requirements, each processed through a pipeline that is not only context-aware but also structurally informed. The framework’s contextual awareness allows it to interpret nuanced signals from natural language feedback—such as intent, function, and domain cues—enabling the model to recognize implicit dependencies that extend beyond surface-level lexical similarity. For example, strong associations such as (R2–R13), (R7–R10), and (R8–R4) reveal clusters of requirements that are semantically or functionally aligned, suggesting that they contribute to a shared user objective or functional goal. These high-scoring links reflect not just textual correlation but also latent causal and structural logic that emerges from understanding user intent and domain semantics. Meanwhile, weak associations such as (R2–R11) or (R9–R10) point to modular, independent features that can be addressed in isolation, facilitating more efficient resource allocation and lower development risk.

Importantly, this matrix is more than a static visualization—it embodies the dynamic nature of the proposed model. Through its iterative learning process, driven by selective refinement and feedback loops, the model continuously adapts to new data and evolving user expectations. As users’ needs shift over time, the system remains responsive; newly emerging requirements, domain concepts, or semantic patterns can be integrated without reengineering the entire framework. This adaptability ensures that the dependency analysis remains current and actionable, even in rapidly evolving application domains. Furthermore, the architecture is inherently scalable; while the visualization here is limited to 15 requirements for clarity, the model is capable of processing substantially larger sets, maintaining both efficiency and interpretability.

By translating fragmented user input into structured dependency knowledge—grounded in both context and structure—the matrix offers a practical decision-support artifact. It aids stakeholders in identifying influential requirements, resolving conflicts, prioritizing release bundles, and planning modular deployments. Ultimately, this demonstration reinforces the framework’s core contribution: providing an intelligent, scalable, and continuously adaptive mechanism to support real-world software development and evolution through precise, context-aware dependency understanding.

5. Discussion

This study explores unstructured user reviews as a central data source for predicting requirement dependencies—an area that remains underexplored in the existing literature, which predominantly focuses on structured requirement documents or stakeholder-defined inputs. In contrast, app reviews reflect the lived experiences and evolving expectations of users, capturing functional demands, usability concerns, and enhancement suggestions naturally and dynamically. To the best of our knowledge, this is the first study to address the challenge of dependency prediction in this context by integrating contextual semantics, structural ontology reasoning, and iterative refinement through large language models. This shift—from developer-curated requirements to raw, user-generated feedback—introduces a unique set of complexities, including ambiguous phrasing, inconsistent terminology, and rapidly changing user needs. While traditional models often perform well in controlled settings, they typically lack the architectural flexibility needed to interpret informal, evolving inputs. To address this gap, the proposed framework employs a dual-layered reasoning strategy: contextual understanding is enhanced through the analysis of requirement intent and semantic similarity, while structural insight is derived from ontology-based metrics that capture hierarchical and functional relationships. These signals are further refined through an LLM-driven active learning loop, in which the model iteratively identifies uncertain predictions, receives automated labels from the LLM, and retrains accordingly. This feedback-driven pipeline enables the model not only to make accurate predictions but also to adapt over time—continuously aligning its internal representations with the emerging concerns, expectations, and priorities expressed by users. This dynamic adaptability represents a meaningful advancement over prior approaches and reinforces the framework’s relevance to real-world, user-centered software development.

The experimental findings demonstrate a consistent and significant improvement in dependency prediction performance as a result of the proposed framework’s integration of contextual and structural reasoning mechanisms. Notably, the iterative learning strategy supported by LLM feedback loops resulted in a gradual yet robust enhancement of the model’s predictive capabilities across multiple evaluation metrics, including the F1-score, AUC, and precision. These results provide empirical support for the claim that leveraging both semantic context and ontological structure can yield more accurate and interpretable representations of requirement dependencies—particularly those derived from unstructured, natural language app reviews. The substantial performance gains observed across successive refinement loops further highlight the effectiveness of dynamic knowledge infusion in reinforcing the classifier’s ability to discern complex inter-requirement relationships within noisy, user-generated content. Collectively, these outcomes underscore the potential of the proposed hybrid AI-driven approach in addressing the intricacies inherent in extracting and analyzing requirement dependencies from informal, real-world feedback sources.

Discussing the results of the first research question highlights the central role that contextual and structural reasoning played in enhancing the accuracy and robustness of dependency prediction. By incorporating deep semantic understanding of requirement intent, functional categories, and domain relevance, the model was able to interpret nuanced distinctions between user needs that are often obscured by the variability and ambiguity of natural language. Such integration proved particularly effective in uncovering hidden relationships between user statements that, on the surface, appear unrelated. For example, the model correctly identified a dependency between the complaint “the transfer fails whenever I try to send money abroad” and another enhancement request: “please add an option to track international transfers.” While one reflects a transactional error and the other a feature request, both point to a shared concern regarding the reliability and visibility of international transfer functions. Even in more subtle cases—such as linking the grievance “I get a notification that my card was blocked after failed ATM withdrawal, but there was no explanation why” to the suggestion “please allow users to view real-time status updates for their cards in the app”—the framework demonstrated an ability to infer contextual dependencies rooted in user frustration and the desire for greater transparency. These findings underscore the framework’s capacity to move beyond superficial lexical similarity and capture deeper semantic relationships, a limitation that constrains many conventional approaches. This depth of reasoning enables more accurate and meaningful modeling of user needs, ultimately supporting more informed and responsive software planning. In addition, structural reasoning further reinforced prediction quality by incorporating ontological features that quantify the relational distance between requirements. Metrics such as path similarity, ancestor depth, and directional alignment helped the model capture latent hierarchies and implicit dependencies, even in cases where surface-level textual similarity was weak or misleading. Together, these reasoning layers provided a more comprehensive representation of inter-requirement relationships, leading to the observed improvements across all evaluation criteria. The complementary integration of contextual and structural signals thus emerged as a key driver of performance gains in this domain.

The performance of our framework was further enhanced by incorporating iterative refinement strategies, where uncertain predictions identified by the model were selectively reviewed and improved through LLM-assisted feedback. Rather than relying on static supervision or exhaustive manual labeling, the proposed framework adopted an LLM in an active learning strategy. The uncertain requirement pairs identified by the classifier were processed within LLM-driven active learning loops, where the large language model functioned as a labeler. For instance, the model initially struggled to link the complaint “Sometimes it just logs me out without asking—especially when I switch to another app for a second” with the request “Please enable re-authentication when switching back to the app, especially for sensitive actions like transfers.” Although these statements lacked lexical similarity, their latent relationship emerged more clearly through repeated exposure and LLM-supported contextualization. Over time, the classifier learned to associate session interruption events with security enhancement requests, capturing a dependency that is subtle and semantically indirect, requiring a deeper contextual understanding beyond surface similarity. This mechanism replaced manual annotation by enabling the LLM to assign dependency labels to selected high-uncertainty pairs. This process enabled the efficient generation of additional training data without human intervention while maintaining semantic consistency and contextual awareness. The impact of this LLM-guided refinement was evident across successive training loops. A subsequent iteration using LLM-labeled data produced noticeable and consistent improvements in the F1-score, precision, and AUC. These gains underscore the effectiveness of the proposed active learning strategy, which combines uncertainty-based selection with automated, knowledge-infused labeling to enhance the model’s generalization capability progressively. Importantly, this iterative refinement process extended beyond the model’s parameter updates to include ongoing enhancements to the supporting ontology. As the system encountered novel expressions and emerging conceptual relations, the LLM contributed to structural updates—such as introducing new nodes, redefining hierarchies, or expanding relation types. This co-evolution of the model and ontology ensured that the framework’s semantic and structural reasoning capabilities remain aligned with the evolving language of user feedback.

As shown in Table 2, the current findings not only align with prior research in requirement dependency extraction but also contribute to its advancement by addressing limitations in adaptability, supervision, and knowledge integration. Early efforts, such as those in [46], demonstrated the feasibility of detecting functional dependencies through semantic clustering of natural language artifacts, achieving F1-scores of up to 75% by leveraging Doc2Vec embeddings combined with density-based or fuzzy clustering techniques. These unsupervised methods offered scalable alternatives to manual analysis, particularly in contexts lacking formal specification structures. Building on this direction, the authors of [11] introduced an integrated active learning framework that combined PV-DM embeddings with an ensemble learning model and a multi-criteria sample selection strategy. This approach yielded measurable gains in the weighted F1-score (+2.71%) while reducing the annotation cost by 46%, illustrating the efficiency of iterative refinement guided by uncertainty and semantic variation. Notably, the authors of [14] advanced this integration further by combining ontology-based reasoning with active learning mechanisms, allowing the model to benefit from structured domain knowledge while dynamically refining its predictions through iterative annotation. This hybridization enabled more context-aware dependency detection without fully relying on static ontological rules. Similarly, the authors of [16] leveraged ontology and semantic similarity for clustering-based extraction, achieving high precision (up to 89.2%) but remaining limited in adaptability due to fixed structures. Other hybrid models, such as those of [15,31], incorporated TF-IDF, POS tagging, and ensemble classification to enhance accuracy (e.g., F1-score of 89.7% in [31]), yet they lacked contextual awareness or learning adaptation. More recently, BERT-based models [9,45] achieved higher performance levels (F1-scores up to 93%), but they rely on batch learning, and the absence of user-driven feedback adaptation constrains their flexibility.

The inherently dynamic and continuously evolving nature of user needs presents a significant challenge to traditional requirement analysis approaches, particularly those grounded in static, batch-based learning paradigms. While such models have demonstrated the capacity to capture complex semantic relationships—especially within large-scale systems—they often fall short in terms of adaptability. Their inability to incorporate real-time user feedback or assimilate evolving domain knowledge limits their effectiveness in dynamic environments. Moreover, dependency reasoning in these approaches tends to remain implicit, lacking explicit integration with structural cues or domain-specific ontologies. In contrast, our framework exhibits clear advantages in its adaptive capability, achieving superior performance (F1-score > 0.95) across successive refinement loops. More importantly, its design embraces user-centric adaptability—treating requirement prediction not as a one-time inference task but as a dynamic, learning-driven process shaped by ongoing user interaction. This perspective marks a paradigm shift in requirements engineering, underscoring the importance of scalable, context-aware models in supporting modern software evolution.

6. Threats to Validity

Despite the promising results achieved by the proposed framework, several threats to validity must be acknowledged to contextualize the findings and guide future research.

6.1. Internal Validity

The internal validity may be influenced by factors related to the data annotation process, ontology construction, and iterative integration of knowledge. Although annotation was guided by predefined criteria and supported by domain expertise, the inherently subjective nature of interpreting requirement dependencies—particularly in ambiguous or context-sensitive user feedback—could introduce variability or bias. Furthermore, the construction and incremental updating of the domain ontology were performed manually and semi-automatically during the active learning loops. While this mirrors realistic development conditions, it may affect the consistency of dependency representation across iterations. Additionally, the dynamic behavior of the LLM in suggesting ontology updates introduces a non-deterministic element, which, while valuable for capturing emergent knowledge, may also lead to variation in the quality or scope of added concepts. These factors collectively present a threat to the internal consistency and repeatability of the experimental process.

6.2. Construct Validity

Construct validity refers to the extent to which the experimental design accurately captures the concept of requirement dependency as intended. A primary threat arises from the potential misalignment between the theoretical definition of dependency and its operationalization during annotation and model training. To mitigate these risks, the framework employed domain-specific ontologies and LLM-assisted reasoning to capture both semantic and structural links, supported by structured annotation guidelines and iterative model refinement. In addition, to reduce errors introduced during automated processing, a strict prompting strategy was applied to ensure that only requirement-relevant content was retained. For critical tasks such as ontology enrichment, human validation was incorporated to confirm the correctness of dependency relations between requirements. This hybrid approach strengthened the reliability of the constructed ontology, reduced the risk of spurious or misleading links, and partially mitigated the inherent construct ambiguity that may persist when interpreting user feedback.

6.3. External Validity

External validity concerns the generalizability of the proposed framework beyond the specific context of the study. While the model demonstrated strong performance on a curated set of user reviews within the Saudi mobile banking domain—ensuring ecological validity within the target environment—its effectiveness in other domains, languages, or feedback types remains untested. Linguistic and cultural variations may influence how requirements are expressed, posing a challenge to transferring the findings directly to broader software ecosystems. Although the framework is inherently domain- and language-agnostic, its application beyond mobile banking will require adapting the ontology to local contexts and validating LLM outputs with domain experts. This limitation is acknowledged, and future studies are encouraged to validate the framework across diverse, multilingual, and intercultural environments to confirm its scalability and domain-independence.

7. Conclusions

This study addressed the complex challenge of identifying inter-requirement dependencies within unstructured user feedback—a task that is crucial for modern software planning yet underexplored in current research. Given the dynamic and evolving nature of user needs, traditional approaches relying solely on isolated textual features or static rule-based systems have proven inadequate. In response, our knowledge-driven AI framework systematically integrates contextual understanding and structural relations to infer latent dependency relationships from user-generated content. The proposed framework combines a BERT-based classifier with domain-specific ontological knowledge and context-aware embeddings. This approach further incorporates an LLM-driven active learning loop to refine predictions and dynamically update the underlying knowledge base. The experimental results, validated across three model configurations, demonstrate consistent improvements in the F1-score and area under the ROC curve (AUC), with the complete model achieving a peak F1-score of 0.9565. Collectively, these outcomes establish the framework as a scalable, interpretable, and data-efficient approach to predicting requirement dependencies.

As a future enhancement, our proposed dependency detection framework could be developed across a range of domains and multilingual datasets, thereby enhancing its scalability and generalizability. Incorporating temporal dynamics of user needs and fine-grained usage contexts could enable more personalized and adaptive dependency predictions. Future research efforts will aim to integrate with complementary tools and models that address additional prioritization variables—such as importance level, temporal urgency, or risk assessment. This integration would enable a more holistic view of the requirement relationships in the context of broader product planning concerns. Specifically, such integration would support the identification not only of which requirements are dependent on others but also of those that are most critical to address first in upcoming software releases. Ultimately, this integration may enable the framework to serve as a strategic assistant for informed decision making in evolving user-driven software development contexts.

Author Contributions

Conceptualization, N.A.; methodology, N.A.; software, N.A.; validation, N.A. and M.A.; formal analysis, N.A.; investigation, N.A.; resources, N.A.; data curation, N.A.; writing—original draft preparation, N.A.; writing—review and editing, N.A. and M.A.; visualization, N.A.; supervision, M.A.; project administration, N.A. and M.A.; funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by the Ongoing Research Funding program (ORF-2025-800), King Saud University, Riyadh, Saudi Arabia.

Data Availability Statement

Data supporting the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors extend their appreciation to the Ongoing Research Funding Program (ORF-2025-800), King Saud University, Riyadh, Saudi Arabia. The authors acknowledge the use of Generative AI and AI-assisted technologies in the preparation of this manuscript. In particular, ChatGPT was utilized to assist in refining the manuscript’s language, improving clarity, and enhancing overall readability. All AI-assisted outputs were carefully reviewed, verified, and edited by the authors to ensure accuracy and integrity. The authors accept full responsibility for the content and conclusions presented in the final published version of this article.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Dąbrowski, J.; Letier, E.; Perini, A.; Susi, A. Analysing app reviews for software engineering: A systematic literature review. Empir. Softw. Eng. 2022, 27, 43. [Google Scholar] [CrossRef]
Ahmad, K.; Abdelrazek, M.; Arora, C.; Baniya, A.A.; Bano, M.; Grundy, J. Requirements Engineering Framework for Human-centered Artificial Intelligence Software Systems. arXiv 2023, arXiv:2303.02920. [Google Scholar] [CrossRef]
Ghosh, T.K.; Pargaonkar, A.; Eisty, N.U. Exploring Requirements Elicitation from App Store User Reviews Using Large Language Models. arXiv 2024, arXiv:2409.15473. [Google Scholar] [CrossRef]
Deshpande, G.; Arora, C.; Ruhe, G. Data-Driven Elicitation and Optimization of Dependencies between Requirements. In Proceedings of the 2019 IEEE 27th International Requirements Engineering Conference (RE), Jeju Island, Republic of Korea, 23–27 September 2019; pp. 416–421. Available online: https://ieeexplore.ieee.org/document/8920671/ (accessed on 6 July 2025).
Cheng, H.; Husen, J.H.; Lu, Y.; Racharak, T.; Ubayashi, N.; Washizaki, H. Generative AI for Requirements Engineering: A Systematic Literature Review. arXiv 2024, arXiv:2409.06741. [Google Scholar] [CrossRef]
Hemmat, A.; Sharbaf, M.; Kolahdouz-Rahimi, S.; Lano, K.; Tehrani, S.Y. Research directions for using LLM in software requirement engineering: A systematic review. Front. Comput. Sci. 2025, 7, 1519437. [Google Scholar] [CrossRef]
Dąbrowski, J.; Letier, E.; Perini, A.; Susi, A. Mining and searching app reviews for requirements engineering: Evaluation and replication studies. Inf. Syst. 2023, 114, 102181. [Google Scholar] [CrossRef]
Xu, H.; Xu, G. Applying the Bert Algorithm for Software Requirements Classification. 2024. Available online: https://www.researchgate.net/profile/Huajie-Xu-2/publication/378696528_Applying_the_Bert_Algorithm_for_Software_Requirements_Classification/links/65e56df8adf2362b636a7961/Applying-the-Bert-Algorithm-for-Software-Requirements-Classification.pdf (accessed on 6 July 2025).
Deshpande, G.; Sheikhi, B.; Chakka, S.; Zotegouon, D.L.; Masahati, M.N.; Ruhe, G. Is BERT the New Silver Bullet?—An Empirical Investigation of Requirements Dependency Classification. In Proceedings of the 2021 IEEE 29th International Requirements Engineering Conference Workshops (REW), Notre Dame, IN, USA, 20–24 September 2021; pp. 136–145. Available online: https://ieeexplore.ieee.org/document/9582338/ (accessed on 6 July 2025).
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Guan, H.; Cai, G.; Xu, H. Automatic Requirement Dependency Extraction Based on Integrated Active Learning Strategies. Mach. Intell. Res. 2024, 21, 993–1010. [Google Scholar] [CrossRef]
Wang, X. Active Learning for NLP with Large Language Models. arXiv 2024, arXiv:2401.07367. [Google Scholar] [CrossRef]
Dalpiaz, F.; Dell’Anna, D.; Aydemir, F.B.; Cevikol, S. Requirements Classification with Interpretable Machine Learning and Dependency Parsing. In Proceedings of the 2019 IEEE 27th International Requirements Engineering Conference (RE), Jeju Island, Republic of Korea, 23–27 September 2019; pp. 142–152. Available online: https://ieeexplore.ieee.org/document/8920401/ (accessed on 6 July 2025).
Deshpande, G.; Motger, Q.; Palomares, C.; Kamra, I.; Biesialska, K.; Franch, X.; Ruhe, G.; Ho, J. Requirements Dependency Extraction by Integrating Active Learning with Ontology-Based Retrieval. In Proceedings of the 2020 IEEE 28th International Requirements Engineering Conference (RE), Zurich, Switzerland, 31 August–4 September 2020; pp. 78–89. Available online: https://ieeexplore.ieee.org/document/9218190/ (accessed on 6 July 2025).
Guan, H.; Xu, H.; Cai, L. Requirement Dependency Extraction Based on Improved Stacking Ensemble Machine Learning. Mathematics 2024, 12, 1272. [Google Scholar] [CrossRef]
Motger, Q.; Borrull, R.; Palomares, C.; Marco, J. OpenReq-DD: A Requirements Dependency Detection Tool. In Proceedings of the 25th International Conference on Requirements Engineering: Foundation for Software Quality, Essen, Germany, 18–21 March 2019. [Google Scholar]
Spoladore, D.; Pessot, E.; Trombetta, A. A novel agile ontology engineering methodology for supporting organizations in collaborative ontology development. Comput. Ind. 2023, 151, 103979. [Google Scholar] [CrossRef]
Banerjee, S.; Sarkar, A. Domain-specific requirements analysis framework: Ontology-driven approach. Int. J. Comput. Appl. 2022, 44, 23–47. [Google Scholar] [CrossRef]
Mosquera, D.; Ruiz, M.; Pastor, O. Ontology-based NLP tool for tracing software requirements and conceptual models: An empirical study. Requir. Eng. 2025, 30. [Google Scholar] [CrossRef]
Hymel, C.; Johnson, H. Analysis of LLMs vs Human Experts in Requirements Engineering. arXiv 2025, arXiv:2501.19297. [Google Scholar] [CrossRef]
Kholodna, N.; Julka, S.; Khodadadi, M.; Gumus, M.N.; Granitzer, M. LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages. arXiv 2024, arXiv:2404.02261. [Google Scholar] [CrossRef]
Joachimiak, M.P.; Miller, M.A.; Caufield, J.H.; Ly, R.; Harris, N.L.; Tritt, A.; Mungall, C.J.; Bouchard, K.E. The Artificial Intelligence Ontology: LLM-Assisted Construction of AI Concept Hierarchies. Appl. Ontol. 2024, 19, 408–418. [Google Scholar] [CrossRef]
Kommineni, V.K.; König-Ries, B.; Samuel, S. From human experts to machines: An LLM supported approach to ontology and knowledge graph construction. arXiv 2024, arXiv:2403.08345. [Google Scholar] [CrossRef]
Mougouei, D.; Powers, D.M.W. Dependency-aware software release planning through mining user preferences. Soft Comput. 2020, 24, 11673–11693. [Google Scholar] [CrossRef]
Udousoro, I. Effective Requirement Engineering Process Model in Software Engineering. Softw. Eng. 2020, 8, 1. [Google Scholar] [CrossRef]
Letsholo, K.J. Early-Stage Requirements Transformation Approaches: A Systematic Review. arXiv 2024, arXiv:2408.05221. [Google Scholar] [CrossRef]
Kolahdouz-Rahimi, S.; Lano, K.; Lin, C. Requirement Formalisation Using Natural Language Processing and Machine Learning: A Systematic Review. In Proceedings of the 11th International Conference on Model-Based Software and Systems Engineering, Lisbon, Portugal, 19–21 February 2023; SCITEPRESS—Science and Technology Publications: Lisbon, Portugal, 2023; pp. 237–244. Available online: https://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0011789700003402 (accessed on 20 July 2025).
Dehkordi, M.R.; Rastegari, H.; Gandomani, J. User Reviews Classification in Play Store Applications Using Deep Learning: An Empirical Study. Diqiu Kexue - Zhongguo Dizhi Daxue Xuebao/Earth Science. J. China Univ. Geosci. 2023, 48, 43–57. [Google Scholar] [CrossRef]
Song, S.; Gao, F.; Huang, R.; Wang, C. Data Dependencies Extended for Variety and Veracity: A Family Tree. IEEE Trans. Knowl. Data Eng. 2022, 34, 4717–4736. [Google Scholar] [CrossRef]
Asyrofi, R.; Siahaan, D.O.; Priyadi, Y. Extraction Dependency Based on Evolutionary Requirement Using Natural Language Processing. In Proceedings of the 2020 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Yogyakarta, Indonesia, 10–11 December 2020; pp. 332–337. Available online: https://ieeexplore.ieee.org/document/9315489/ (accessed on 6 July 2025).
Samer, R.; Stettinger, M.; Atas, M.; Felfernig, A.; Ruhe, G.; Deshpande, G. New Approaches to the Identification of Dependencies between Requirements. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; pp. 1265–1270. Available online: https://ieeexplore.ieee.org/document/8995268/ (accessed on 6 July 2025).
Atas, M.; Samer, R.; Felfernig, A. Automated Identification of Type-Specific Dependencies between Requirements. In Proceedings of the 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI), Santiago, 3–6 December 2018; pp. 688–695. Available online: https://ieeexplore.ieee.org/document/8609673/ (accessed on 6 July 2025).
Gräßler, I.; Oleff, C.; Hieb, M.; Preuß, D. Automated Requirement Dependency Analysis for Complex Technical Systems. Proc. Des. Soc. 2022, 2, 1865–1874. [Google Scholar] [CrossRef]
Zhi, Y.; Li, T.; Yang, Z. Extracting features from app descriptions based on POS and dependency. In Proceedings of the 36th Annual ACM Symposium on Applied Computing, Virtual Event, Republic of Korea, 22–26 March 2021; ACM: New York, NY, USA, 2021; pp. 1354–1358. Available online: https://dl.acm.org/doi/10.1145/3412841.3442120 (accessed on 6 July 2025).
Mosquera, D.; Ruiz, M.; Pastor, O.; Spielberger, J. Ontology-Based Automatic Reasoning and NLP for Tracing Software Requirements into Models with the OntoTrace Tool. In Requirements Engineering: Foundation for Software Quality; Ferrari, A., Penzenstadler, B., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2023; Volume 13975, pp. 140–158. ISBN 978-3-031-29785-4. [Google Scholar]
Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–79. [Google Scholar] [CrossRef]
Tikayat Ray, A.; Cole, B.F.; Pinon Fischer, O.J.; Bhat, A.P.; White, R.T.; Mavris, D.N. Agile Methodology for the Standardization of Engineering Requirements Using Large Language Models. Systems 2023, 11, 352. [Google Scholar] [CrossRef]
Pozzi, A.; Incremona, A.; Tessera, D.; Toti, D. Mitigating exposure bias in large language model distillation: An imitation learning approach. Neural Comput. Appl. 2025, 37, 12013–12029. [Google Scholar] [CrossRef]
Almoqren, N.; Alrashoud, M. A Smart Framework for Optimizing User Feedback Prioritization in Application Development. In Proceedings of the 2024 International Conference on IT Innovation and Knowledge Discovery (ITIKD), Manama, Bahrain, 13–15 April 2025; pp. 1–8. Available online: https://ieeexplore.ieee.org/document/11004934/ (accessed on 20 July 2025).
Ronanki, K.; Berger, C.; Horkoff, J. Investigating ChatGPT’s Potential to Assist in Requirements Elicitation Processes. arXiv 2023, arXiv:2307.07381. [Google Scholar] [CrossRef]
Xanthopoulou, G.; Siavvas, M.; Kalouptsoglou, I.; Kehagias, D.; Tzovaras, D. Software Requirements Classification: From Bag-of-Words to Transformer. In Distributed Computing and Artificial Intelligence, Special Sessions I, 21st International Conference; Mehmood, R., Hernández, G., Praça, I., Wikarek, J., Loukanova, R., Monteiro Dos Reis, A., Skarmeta, A., Lombardi, E., Eds.; Lecture Notes in Networks and Systems; Springer Nature: Cham, Switzerland, 2025; Volume 1198, pp. 370–380. ISBN 978-3-031-76458-5. Available online: https://link.springer.com/10.1007/978-3-031-76459-2_35 (accessed on 6 July 2025).
Sharma, A.; Yadav, A.K.; Rai, A.K. A novel and precise approach for similarity-based link prediction in diverse networks. Soc. Netw. Anal. Min. 2023, 14, 11. [Google Scholar] [CrossRef]
Noschese, S.; Reichel, L. Network analysis with the aid of the path length matrix. Numer. Algorithms 2024, 95, 451–470. [Google Scholar] [CrossRef]
Kamran, A.B.; Naveed, H. GOntoSim: A semantic similarity measure based on LCA and common descendants. Sci. Rep. 2022, 12, 1–10. [Google Scholar] [CrossRef]
Fischbach, J.; Frattini, J.; Vogelsang, A. CiRA: A Tool for the Automatic Detection of Causal Relationships in Requirements Artifacts. arXiv 2021, arXiv:2103.06768. [Google Scholar] [CrossRef]
Tahvili, S.; Hatvani, L.; Felderer, M.; Afzal, W.; Bohlin, M. Automated Functional Dependency Detection Between Test Cases Using Doc2Vec and Clustering. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Testing (AITest), Newark, CA, USA, 4–9 April 2019; pp. 19–26. Available online: https://ieeexplore.ieee.org/document/8718215/ (accessed on 6 July 2025).

Figure 1. The framework workflow for automated requirement dependency prediction from app reviews.

Figure 2. Prompt for LLM-based requirement extraction: (*) emphasis; (**) key terms.

Figure 3. Distribution of requirement types.

Figure 4. The ontology architecture.

Figure 5. Prompt for LLM-based ontology construction update, (#) comments; (*) bullet points; (**) key terms for emphasis.

Figure 6. Human-in-the-loop ontology update.

Figure 7. Impact of data scaling and knowledge-guided refinement on model performance.

Figure 8. F1-score progression and refinement efficiency across active learning loops.

Figure 9. Confusion matrix: Validation of the performance of the dependency prediction model.

Figure 10. Dependency Structure Matrix (DSM) from user reviews.

Table 1. Performance of the proposed models compared with the baseline across different loops.

Model Configuration	Val Loss	F1-Score	Recall	Precision	AUC
Baseline Model	0.8738	0.766	0.8738	0.6818	0.8738
Proposed Model (Loop#1)	0.0298	0.8641	0.8641	0.8641	0.9141
Proposed Model (Loop#2)	0.0167	0.9091	0.8738	0.9474	0.8738
Proposed Model (Loop#3)	0.0148	0.9238	0.9417	0.9065	0.9777
Proposed Model (Loop#4)	0.0128	0.9565	0.9612	0.9519	0.9876

Table 2. A comparative overview of recent studies related to the discovery of requirement dependencies.

Work	Dataset	Technique	Dependency Types	Knowledge Integration	Performance	Limitations
[16]	Textual requirements (natural language)	Machine learning classifiers (Clustering), NLP	(Refines, Requires, Conflict)	Ontology, Semantic similarity.	Precision: 89.2%	Needs richer cues.
[31]	Industry-sourced SRS dataset	Machine learning classifiers (Random Forest, k-Nearest N, Linear SVM, Naive Bayes, LSA)	Unidirectional dependency (requires)	Word2Vec, POS tags, TF-IDF	Precision: 92.9%, Recall: 86.7%, F1-score: 89.7%	Lacks deep learning, no active refinement or large-scale generalization.
[46]	Natural language requirement specification documents	Unsupervised clustering (HDBSCAN and FCM algorithms)	Dependency/Independency Relation	Semantic similarity (Doc2Vec)	F1-score: 75% (HDBSCAN clustering algorithm)	Relying solely on textual semantic similarity misses structural or causal aspects.
[14]	Request For proposals (RFP) documents	A semi-supervised ensemble-based Active Learning integrated with Ontology-Based Retrieval	(Refines, Requires, Other)	Ontology	F1-score: 82.6%	Manual expert involvement is still required; reduced human effort by 50%, no direct use of artificial intelligence techniques.
[45]	User stories of German open-source Corona-Warn-App	BERT-based binary classification (causality)	Dependency/Independency Relation	Syntactic Information	Precision: 92%, Recall: 75%, F1-score: 83%)	Limited to batch learning; no continual adaptation yet.
[9]	Textual requirements (natural language)	RDC-BERT (fine-tuned BERT) vs. Random Forest (baseline)	Dependency/Independency Relation	TF-IDF	F1-score: 93% (BERT), F1-score: 83% (RF)	BERT requires large training data; it is less suitable for low-resource settings.
[15]	Textual requirements (natural language)	Part-of-speech features with stacking ensemble learning model (P-Stacking)	(Independence, Notification, Arouse, Call, Conflict, Aggregation, Similar)	Part-of-speech features (TF-IDF, Word2Vec)	(Recall: 88.5%, Precision: 89.5%, F1-score: 86.4%)	Scalability is limited by feature engineering.
[11]	Textual requirements (natural language)	Integrated active learning (IAL), ensemble machine learning (EML)	(Independence, Notification, Arouse, Call, Conflict, Aggregation, Similar)	PV-DM vector model based on Doc2Vec	(Recall: 81.9%, Precision: 83.1%, F1-score: 80.9%)	Manual labeling still exists; it increases labor costs.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Almoqren, N.; Alrashoud, M. LLM-Driven Active Learning for Dependency Analysis of Mobile App Requirements Through Contextual Reasoning and Structural Relationships. Appl. Sci. 2025, 15, 9891. https://doi.org/10.3390/app15189891

AMA Style

Almoqren N, Alrashoud M. LLM-Driven Active Learning for Dependency Analysis of Mobile App Requirements Through Contextual Reasoning and Structural Relationships. Applied Sciences. 2025; 15(18):9891. https://doi.org/10.3390/app15189891

Chicago/Turabian Style

Almoqren, Nuha, and Mubarak Alrashoud. 2025. "LLM-Driven Active Learning for Dependency Analysis of Mobile App Requirements Through Contextual Reasoning and Structural Relationships" Applied Sciences 15, no. 18: 9891. https://doi.org/10.3390/app15189891

APA Style

Almoqren, N., & Alrashoud, M. (2025). LLM-Driven Active Learning for Dependency Analysis of Mobile App Requirements Through Contextual Reasoning and Structural Relationships. Applied Sciences, 15(18), 9891. https://doi.org/10.3390/app15189891

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLM-Driven Active Learning for Dependency Analysis of Mobile App Requirements Through Contextual Reasoning and Structural Relationships

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Data Collection and Preparation

3.2. Requirement Extraction and Classification

3.3. Requirement Pairing

3.4. Model Training and Validation

3.5. Model Improvement

4. Results

4.1. Impact of Knowledge Integration on Dependency Prediction

4.2. Efficiency Gains Through Active Learning and Iteration

5. Discussion

6. Threats to Validity

6.1. Internal Validity

6.2. Construct Validity

6.3. External Validity

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI