Re-Distill: A Multi-Stage Retrieval Framework for Functional–Non-Functional Requirement Linking in Software Engineering

Almohammady, Ashwag; Alnanih, Reem; Alowidi, Nahed

doi:10.3390/app16115482

Open AccessArticle

Re-Distill: A Multi-Stage Retrieval Framework for Functional–Non-Functional Requirement Linking in Software Engineering

by

Ashwag Almohammady

^1,2,*

,

Reem Alnanih

^1,3

and

Nahed Alowidi

¹

Department of Computer Science, King Abdul Aziz University, Jeddah 21589, Saudi Arabia

²

Applied College, Taibah University, Madinah 41477, Saudi Arabia

³

Software Engineering and Distributed System Research Group, King Abdulaziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5482; https://doi.org/10.3390/app16115482

Submission received: 29 April 2026 / Revised: 18 May 2026 / Accepted: 29 May 2026 / Published: 1 June 2026

Download

Browse Figures

Versions Notes

Abstract

Non-functional requirements (NFRs) are critical for ensuring software quality, yet they remain difficult to identify due to their implicit and loosely defined relationship with functional requirements (FRs). Existing research predominantly focuses on NFR classification, leaving the more practical problem of linking FRs with their corresponding NFRs largely underexplored. To bridge this gap, this research introduces Re-Distill, a framework that treats FR–NFR association as a retrieval task. It adopts a curriculum-guided, data-centric distillation strategy to improve semantic representations and capture the interdependencies between FRs and NFRs. The framework combines general semantic adaptation, domain-specific specialization, and teacher-guided hard-negative mining in a contrastive learning setting. During inference, it integrates dense and lexical retrieval with cross-encoder reranking to produce ranked NFR candidates for unseen FR queries. Experiments on an expanded FR–NFR dataset show consistent improvements throughout all training stages. The resulting model achieves a Recall@10 of 70.79%, significantly outperforming the zero-shot baseline (42.36% Recall@10). These results highlight the effectiveness of retrieval-based approaches for functional–non-functional requirement linking, providing a practical and scalable way to undertake software requirement analysis.

Keywords:

software requirement analysis; non-functional requirements; functional requirements; semantic retrieval; transformer models; knowledge distillation; information retrieval systems

1. Introduction

Requirements engineering (RE) involves eliciting, analyzing, documenting, validating, and managing system requirements. Its primary objective is to produce complete, clear, and relevant specifications that satisfy user needs. Poorly defined requirements can have serious consequences during software development. Previous studies have reported that faulty requirement specifications contribute to a large proportion of software project failures and account for many system-related issues [1,2].

One important outcome of RE is the Software Requirements Specification (SRS) document, which describes the intended functionality and expected system behavior. SRS documents generally include both functional requirements (FRs) and non-functional requirements (NFRs). FRs specify what the system should do, whereas NFRs describe quality-related characteristics such as performance, security, and usability. Identifying NFRs is often more difficult than defining FRs. Users frequently express such requirements ambiguously, and they are commonly embedded within FR statements [3,4,5,6]. Nevertheless, early identification of NFRs remains important because delayed recognition often increases development cost and effort.

Although SRS documents are expected to include both FRs and NFRs, many real-world specifications, particularly those from educational and early-stage projects, are dominated by FRs. As discussed in [7], quality-related requirements are often omitted or described only in vague terms. Users may easily define system functions but struggle to specify quality expectations. For example, statements such as “the system should be fast” are frequently used without measurable criteria [8]. This limitation motivates the need for automated approaches capable of identifying implicit NFR information rather than relying only on explicitly documented requirements [9,10].

Automated extraction and identification of NFRs from FRs remains challenging because the task often requires domain expertise and contextual understanding [11]. Traditional approaches, including keyword matching, rule-based heuristics, and classical machine learning methods, have shown limited success [12]. Such techniques often struggle to capture the relationships between functional and quality-related requirements. More recently, transformer-based models such as BERT have demonstrated strong capabilities for contextual language understanding [9,10]. However, most existing studies focus on classifying NFRs into predefined categories and assume that NFR statements are already available [9]. Comparatively less attention has been given to learning direct relationships between FRs and their corresponding NFRs through retrieval-based approaches.

The association between FRs and NFRs is closely related to software quality frameworks such as ISO/IEC 25010 [13], where quality attributes act as constraints on functional behavior. Requirements traceability further provides a conceptual basis for connecting functional specifications with quality concerns throughout software development [14,15]. From this perspective, FR–NFR linking can be formulated as a retrieval problem that identifies relevant quality requirements for a given functional specification.

This work introduces Re-Distill, a multi-stage retrieval framework for FR–NFR linking in software engineering. Unlike prior studies that formulate NFR analysis as a classification problem, Re-Distill models the task as semantic retrieval. The framework combines domain adaptation, teacher-guided hard-negative selection, and hybrid retrieval techniques to improve the identification of relevant NFRs for unseen FR queries.

Experiments were conducted using an expanded FR–NFR dataset under realistic requirement analysis settings. The results show that each training stage contributes to retrieval performance improvements, and Re-Distill consistently outperforms zero-shot retrieval baselines. These findings suggest that retrieval-based modeling can effectively capture relationships between FRs and NFRs. The main contributions of this paper are summarized as follows:

A large-scale FR–NFR dataset is constructed to support retrieval-based requirement analysis and facilitate the study of FR–NFR relationships beyond conventional classification settings.
Re-Distill is proposed as a curriculum-guided multi-stage retrieval framework for linking FRs and NFRs through progressive representation refinement.
A data-centric distillation strategy is introduced. A cross-encoder teacher supports curriculum-based hard-negative selection to improve training stability and retrieval performance.
An evaluation is conducted using ranking-based metrics, showing improved performance compared with zero-shot and baseline retrieval models.

The remainder of this paper is organized as follows. Section 2 reviews related work on NFR extraction and semantic retrieval. Section 3 describes the proposed methodology, including dataset construction, preprocessing, augmentation, and framework design. Section 4 presents the experimental results. Section 5 discusses the findings, and Section 6 concludes the paper and outlines future directions.

2. Related Work

This section reviews relevant studies on NFR modeling and retrieval-based semantic matching, and highlights the research gap in linking functional and non-functional requirements.

2.1. Software Engineering and NFR Modeling

NFR identification and extraction have evolved substantially over the past decade. They have shifted from early syntactic and keyword-based methods toward deeper semantic understanding, enabled by modern deep learning and transformer-based models. The following review summarizes the key contributions involved in this evolution and highlights the limitations that motivate our retrieval-based framework.

Early research on NFR extraction mainly relied on lexical analysis and traditional machine-learning techniques. Researchers commonly used Bag-of-Words (BoW), TF-IDF, and classifiers such as SVM and Naïve Bayes to separate FRs and NFRs in SRS documents. These approaches achieved promising results in classification tasks but remained largely dependent on surface-level textual features [16,17].

Researchers later introduced unsupervised and hybrid approaches to improve feature extraction. Methods based on keyphrase extraction and clustering demonstrated improved performance using syntactic and statistical features, achieving high recall and precision on benchmark datasets such as PROMISE and PURE [18]. Similarly, combining keyword extraction with neural models such as CNN enabled more robust classification of NFR categories from textual data [19]. Despite these improvements, these methods still relied heavily on handcrafted or shallow features and lacked deeper semantic understanding.

Subsequent deep learning models built on these approaches to focus on automatic semantic representation learning. Neural architectures such as CNNs, MLPs, and BiLSTM models demonstrated improved performance by capturing contextual dependencies within requirement texts [20,21]. These models reduced reliance on manual feature engineering and improved classification accuracy across multiple NFR categories.

The introduction of transfer learning and pre-trained language models (PLMs) was a major advance in NFR analysis. Transformer-based models such as BERT considerably improved contextual understanding and generalization in requirement classification tasks [22,23,24]. Subsequent studies further enhanced performance by using prompt-based learning and fine-tuning strategies under limited data conditions [25]. Pre-trained models remain the dominant approach for automated NFR classification in modern software development environments [26].

Recent studies have also explored the potential of zero-shot learning to mitigate the challenges of scarce labeled data in requirements engineering. For instance, Alhoshan et al. [27] investigated zero-shot learning for requirements classification, demonstrating that pre-trained language models can generalize to unseen requirement categories under limited supervision settings. Their findings further highlight the challenges associated with scarce labeled datasets in requirements engineering, which motivates the retrieval-oriented learning strategy adopted in this work for modeling FR–NFR associations.

Despite these advances, most existing approaches focus primarily on sentence-level classification and assume that explicit NFRs are available. As a result, they do not capture the semantic dependencies between functional and non-functional requirements, nor do they address the problem of retrieving relevant NFRs for a given FR. These limitations motivate the exploration of retrieval-based and semantic matching approaches, discussed in the next subsection.

2.2. Retrieval and Semantic Matching

Recent advances have introduced dense and semantic retrieval methods that better capture contextual relationships between texts. While lexical approaches are efficient, they often fail to represent deeper semantic connections. Consequently, there is an incentive to adopt neural retrieval models. Bi-encoder architectures, such as Sentence-BERT (SBERT) and Dense Passage Retrieval (DPR), encode queries and documents into dense vector representations, enabling efficient similarity computation and large-scale retrieval [28,29]. Their effectiveness has led to their widespread adoption in modern retrieval systems. However, retrieval quality depends to a significant extent on the training strategies employed, particularly the selection of informative negative samples and effective representation learning.

To improve representation learning, several studies have explored augmentation and distillation strategies for bi-encoders. DistilBERT [30] showed that Transformer models can be compressed effectively while retaining much of their representational power. In Ref. [31] augmented SBERT showed that data augmentation can improve pairwise sentence scoring and strengthen bi-encoder training. In dense retrieval, ERNIE-Search can bridge Cross-Encoders and Dual-Encoders through self-distillation, showing that interaction-aware knowledge transfer can significantly improve dense retrieval performance [32].

Hard negative mining has also become a central component of retrieval training. Early approaches relied on random or in-batch negatives, which were often insufficiently challenging. More advanced methods, such as ANCE, actively mine hard negatives from the retrieval space to improve model discrimination [33]. However, aggressive mining may introduce false negatives, which can degrade model performance. To mitigate this issue, teacher-guided approaches, such as RocketQA in [34], employ stronger models to filter noisy negatives, leading to training that is more stable and practical.

Knowledge distillation has also emerged as a key technique for improving retrieval efficiency. Initially introduced for model compression [35], it has since been extended to semantic representation learning, in which student models learn to approximate the representations produced by larger teacher models. This paradigm has also been applied in multilingual SBERT settings, demonstrating that distilled sentence encoders can preserve strong semantic matching capabilities while improving efficiency [36].

Curriculum learning has further improved training stability in retrieval systems. Instead of training on difficult examples immediately, models are exposed to progressively harder samples, leading to better convergence and generalization [37]. This strategy is particularly beneficial when combined with hard negative mining, as shown in retrieval-oriented training studies that emphasize the importance of difficulty scheduling during optimization [38].

Hybrid retrieval approaches combine lexical methods such as BM25 with dense neural representations, leveraging the complementary strengths of both paradigms. Such approaches have been shown to improve robustness across diverse datasets [39,40]. Replication studies on dense passage retrieval have highlighted that score combination and validation-based tuning play an important role in building reliable hybrid retrieval pipelines [41]. Interaction-based reranking models, such as Cross-Encoders, further refine the initial ranking by modeling fine-grained interactions between queries and documents [42]. More efficient alternatives, such as late-interaction models like ColBERT [43], balance retrieval performance with computational cost.

Recent research has also explored semantic retrieval in software engineering contexts. Pre-trained models such as CodeBERT [44] have demonstrated that dense contextual representations can capture semantic relationships between natural language and software artifacts. Similarly, BERT-based traceability models show that contextual retrieval methods can improve link recovery between requirements and source code [45]. Other work has explored requirement-level analysis using BERT-based models for extracting and analyzing software requirements from app reviews [10], while researchers have applied hybrid retrieval approaches to software requirements retrieval using semantic search and structured relation extraction in [46]. These studies suggest that semantic retrieval demonstrates potential in software engineering; however, they do not explicitly address the problem of linking functional and non-functional requirements through retrieval-based methods. In general, while dense retrieval, knowledge distillation, curriculum learning, and hybrid ranking techniques have significantly advanced semantic matching, their application to requirement-level semantic association remains largely unexplored.

2.3. FR–NFR Association and Research Gap

While retrieval techniques have been applied in software engineering, few studies have explored the relationship between functional and non-functional requirements. Most existing work focuses on classifying NFRs or FRs, rather than modeling their direct association. Ref. [4] presents a representative approach, proposing a semi-supervised method to infer NFR information from FRs. Their method extracts terms from FRs and groups them into semantically coherent clusters using similarity measures such as Latent Semantic Analysis (LSA), pointwise mutual information (PMI), thesaurus-based similarity, and Normalized Google Distance (NGD). The researchers then map these clusters to predefined NFR categories (e.g., performance and security) based on their semantic similarity to NFR keywords. In addition, they link these clusters to implementation artifacts by analyzing their similarity with source code descriptions across multiple case studies.

Similarly, ref. [5] employs a semi-supervised approach that utilizes a pre-trained Word2Vec model to compute semantic similarity between requirement statements and predefined NFR keyword sets. Each requirement is assigned to the closest NFR category based on similarity scores, following a keyword-driven classification strategy.

Despite their relevance, these approaches remain limited to coarse-grained classification and rely on predefined keyword sets, clustering techniques, and similarity thresholds. They do not model direct semantic relationships between FR and NFR sentences, nor do they support retrieving full NFR statements for unseen FR queries.

While recent advances in transformer-based semantic retrieval have shown strong capabilities in modeling contextual relationships, these techniques have not been explicitly applied to FR–NFR association. To the best of our knowledge, prior work has not explicitly formulated FR–NFR association as a semantic retrieval problem within a unified framework. This gap motivates the proposed Re-Distill framework, which models fine-grained relationships between functional and non-functional requirements through multi-stage retrieval.

3. Methodology

This section describes the methodology adopted to support the proposed retrieval framework. First, the construction and preparation of the FR–NFR dataset used for training and evaluation are presented. Next, the proposed Re-Distill framework, a curriculum-guided multi-stage retrieval pipeline for linking FRs and NFRs, is introduced. Finally, the training strategy, implementation details, and hyperparameter optimization procedures used across the different stages of the framework are described.

3.1. FR–NFR Dataset

This subsection describes the construction and preparation of the FR–NFR dataset used in this study. It covers the creation of requirement pairs, the preprocessing pipeline, the dataset partitioning strategy, and the data augmentation method applied to the training set.

3.1.1. Data Construction

The objective of this study is to retrieve NFRs from FRs by modeling the semantic dependencies between them. Specifically, four primary NFR categories are considered: performance, security, availability, and usability, which represent the most critical quality attributes in standard software engineering taxonomies [8,13]. This formulation requires a dataset that supports retrieval-based learning, where each FR is associated with one or more relevant NFRs.

However, constructing such a dataset is challenging due to the limited availability of publicly annotated FR–NFR pairs. The only publicly available dataset is derived from a subset of the Tera-PROMISE repository [2], consisting of 196 FRs mapped to 97 NFRs. In this study, this subset was accessed through a publicly available implementation [47], which provides a processed version of the Tera-PROMISE dataset. While this dataset serves as a valuable starting point for FR–NFR analysis, its limited scale makes it insufficient for training modern transformer-based retrieval models.

To address this limitation, the dataset was expanded by incorporating additional projects from the Tera-PROMISE repository. From 39 available projects, those containing both FRs and NFRs across the target categories were selected, excluding incomplete or highly imbalanced cases. This process resulted in the inclusion of four additional projects.

To further enhance dataset diversity, the PURE dataset [48] was incorporated. From 79 available projects, 28 were selected based on the presence of clearly defined FRs and NFRs within the desired categories. Requirement sentences were manually extracted from SRS documents and organized into structured FR and NFR lists.

The mapping between FRs and NFRs was guided by domain knowledge and followed a structured annotation and verification process. We hired an external business analyst with over 4 years of experience in software requirements analysis to perform the initial pairwise associations. The process started with a pilot phase that included a small number of projects. These were reviewed under academic supervision to establish consistent mapping criteria and resolve early ambiguities. For the remaining projects, the primary researcher conducted continuous manual verification of the mappings to ensure consistent criteria across all 43 projects.

An NFR was linked to an FR only when a direct and meaningful relationship was identified, where the NFR constrained or described the quality aspects of the specific functionality. For example, an FR describing user login functionality was linked only to NFR statements directly related to its quality constraints, such as secure credential handling or authentication performance. System-level NFRs were linked only when a clear and justifiable connection to a specific FR could be established.

These verified FR–NFR associations were retained as the ground-truth links used during retrieval training and evaluation. We intentionally excluded weak, indirect, or ambiguous relationships to reduce annotation noise.

This study did not use a formal multi-annotator setup or calculate an inter-annotator agreement metric such as Cohen’s kappa. However, the annotation process included several rounds of manual review and verification across the dataset. Similar iterative review procedures have been adopted in prior studies to improve annotation quality and reduce inconsistencies during dataset construction [49,50]. In our case, the annotation workflow involved an external business analyst for the initial mapping, followed by academic supervision and continuous manual verification by the primary researcher. This approach helped improve consistency across the dataset and resulted in more than 11,000 pairwise evaluations.

The final dataset consists of 1162 FRs mapped to 393 NFRs across 43 projects, combining both PROMISE and PURE sources. Compared to the original dataset, this represents a substantial increase in scale and diversity, making it suitable for training retrieval-based models. Figure 1 compares the original and expanded datasets. Figure 2 shows the distribution of NFR categories. Performance (43.3%) and security (28.8%) represent the largest proportions, followed by usability (19.6%) and availability (8.4%).

3.1.2. Preprocessing and Partitioning

To reduce noise and ensure consistency, all requirement sentences (FRs and NFRs) were preprocessed before data partitioning and model training. A lightweight preprocessing pipeline was applied, including: (i) lowercasing all text, (ii) removing punctuation and special characters while preserving semicolons as separators for multiple NFRs, (iii) normalizing whitespace, and (iv) lemmatization using WordNet [51] to retain base word forms without altering semantic meaning. The cleaned dataset was stored with essential fields only, including the project identifier, FR, and associated NFRs.

Table 1 presents a sample from the preprocessed dataset, illustrating the structure of FR–NFR pairs and the use of semicolon-separated NFRs within a single field. It also highlights the inclusion of the project identifier, which is later used to support the project-level data partitioning strategy.

To determine an appropriate training configuration, four partitioning strategies were evaluated: random split, semantic clustering, controlled overlap, and project-level split. These strategies were compared using Cosine Similarity and Jaccard Overlap metrics [52]. Among them, the project-level split proved the most reliable, as it isolates entire projects across subsets, preventing data leakage and better simulating real-world generalization [53].

Accordingly, the dataset of 43 projects was divided using a 70–15–15 ratio, resulting in 30 projects for training (939 FRs), 6 for validation (98 FRs), and 7 for testing (125 FRs). To verify the independence of these subsets, similarity analysis was conducted. The Jaccard overlap was negligible (0.00% for Train–Validation and 0.98% for Train–Test), indicating minimal redundancy. Meanwhile, cosine similarity scores (e.g., 0.434 for Train–Test) indicate that the subsets remain within the same domain while being sufficiently distinct for evaluation. In general, this partitioning strategy ensures semantic diversity, prevents information leakage, and provides a realistic evaluation setting in which the model retrieves NFRs for unseen projects.

3.1.3. Data Augmentation

To increase the semantic diversity of NFR candidates, data augmentation was applied exclusively to the NFR side of the training set. This design choice reflects the retrieval setting, where FRs act as fixed input queries and should remain unchanged to preserve their natural distribution during training and evaluation. In contrast, NFRs are more abstract and exhibit higher lexical variation, making them better suited for augmentation.

For each original FR–NFR pair, the FR sentence was kept unchanged, while the corresponding NFR was expanded into multiple paraphrased variants using a GPT-based model (GPT-3.5-turbo) [54]. Each original NFR was augmented with three paraphrased variants, resulting in four NFR candidates per FR–NFR pair. This strategy improves semantic coverage and helps the model generalize across lexically diverse yet semantically equivalent expressions, consistent with prior work on LLM-based augmentation [55,56].

To reduce the possibility of semantic drift during augmentation, the GPT-3.5-turbo model was instructed to preserve the original technical meaning and professional tone of the requirements. After generation, a subset of the paraphrased NFRs was manually reviewed to check whether the original meaning had been preserved. The review focused on maintaining technical constraints, numerical values (e.g., response-time limits), and requirement strength expressed through modal verbs such as `must’ and `shall’. Variants that introduced ambiguity or changed the original NFR intent were excluded from the augmented dataset. This process helped ensure that the observed performance improvements were mainly driven by increased semantic diversity rather than noisy or incorrect signals.

An illustrative example of the FR–NFR mapping and augmentation process is shown in Table 2. The example indicates how a single FR is associated with a performance-related NFR, along with its paraphrased variants.

In this example, the responsiveness-related NFR directly constrains the interaction quality of the dispute-management functionality by defining an acceptable response-time limit during screen navigation. This practical case shows how a specific functional operation—real-time data entry—corresponds to measurable quality requirements, ensuring acceptable system responsiveness during user interaction.

3.2. Proposed Re-Distill Framework

Figure 3 shows a high-level overview of the proposed Re-Distill framework, a multi-stage retrieval pipeline for modeling relationships between FRs and NFRs.

The pipeline begins with general semantic adaptation through bi-encoder training, followed by domain-specific specialization using partial transformer fine-tuning. The learned representations are then further enhanced through curriculum-guided distillation, where a cross-encoder teacher guides safe hard-negative mining and data construction. Finally, a hybrid retrieval and model selection stage integrates dense and lexical retrieval with cross-encoder reranking to produce the final ranked NFR output. Together, these stages form a domain-aware retrieval framework tailored for the association of FR–NFR.

The design of this multi-stage pipeline is motivated by the progressive complexity of modeling FR–NFR relationships. Early stages focus on learning general semantic representations, while later stages introduce domain-specific refinement and controlled hard-negative learning to capture fine-grained dependencies. This staged design enables more stable training and improves generalization compared to single-stage retrieval approaches.

3.2.1. Stage 1: General Semantic Adaptation

In the first stage, a bi-encoder is fine-tuned to adapt general-purpose language representations to the domain of software requirements. Different pre-trained transformer backbones are explored, including MiniLM [57], MPNet [58], and its variant Multi-QA MPNet, to learn semantic representations of FR–NFR pairs.

Training is performed using MultipleNegativesRankingLoss (MNRL) [28], which encourages the model to bring semantically related FR–NFR pairs closer in the embedding space while pushing unrelated pairs apart. To improve robustness and generalization, training is conducted on a mixture of original samples and augmented data (20% of the training set). This ratio was selected empirically to introduce semantic diversity while preserving the natural distribution of the original requirement specifications.

3.2.2. Stage 2: Domain-Specific Specialization

In this stage, the bi-encoder obtained from Stage 1 is further refined to capture domain-specific patterns in FR–NFR relationships. Instead of full fine-tuning, partial fine-tuning is applied by unfreezing only the upper layers of the transformer encoder.

This strategy is motivated by findings that lower layers of transformer models capture general syntactic information, while upper layers encode task-specific semantic features [59]. By applying this strategy consistently across the evaluated backbone models, efficient domain adaptation is obtained while preserving the general linguistic knowledge learned during pre-training. As a result, the model becomes more sensitive to requirement-level semantics and domain-specific phrasing.

3.2.3. Stage 3: Curriculum-Guided Distillation

Figure 4 shows Stage 3, where curriculum-guided distillation is applied through a multi-stage teacher-assisted training process. In this stage, the bi-encoder obtained from Stage 2 is refined using a data-centric distillation strategy supported by a cross-encoder teacher.

Unlike traditional knowledge distillation approaches that directly regress toward teacher similarity scores, our method leverages the teacher model to guide data construction through hard-negative mining. Specifically, the bi-encoder first retrieves a broad pool of candidate NFRs for each FR using FAISS-based dense retrieval [60]. This step captures the semantic neighborhood of each query.

A high-capacity cross-encoder teacher, cross-encoder/ms-marco-MiniLM-L-12-v2, was used as a semantic evaluator to score the retrieved FR–NFR pairs. Based on these scores, a Safe Mining Protocol is introduced to filter out misleading training signals. Candidate pairs with extremely high similarity scores are treated as potential false negatives (toxic negatives) and are removed from the training set to prevent incorrect supervision.

To construct the curriculum-guided training data, candidate NFRs were first retrieved for each FR using the current student bi-encoder and FAISS dense retrieval. Ground-truth NFRs associated with the same FR were removed from the candidate pool to avoid sampling true positives as negatives. The remaining candidate FR–NFR pairs were then scored using the teacher cross-encoder, cross-encoder/ms-marco-MiniLM-L-12-v2.

To reduce the risk of false negatives, candidate pairs with extremely high semantic similarity were excluded using the Safe Mining Protocol and a conservative filtering threshold. The remaining negative samples were then divided into easy and hard negatives according to their relative semantic difficulty. This ranking strategy enabled the curriculum to progressively transition from clearer semantic distinctions toward more challenging hard-negative samples during training.

The remaining candidate pairs are used to train the student bi-encoder following an easy-to-hard curriculum [37]. Training begins with easier negative samples and progressively incorporates more challenging ones, allowing the model to gradually refine its representation space.

Finally, an iterative refinement step is applied within the same stage, where the model is further optimized using harder negatives and stricter training settings. This process sharpens the embedding space and improves the model’s ability to distinguish subtle semantic differences between relevant and non-relevant NFRs.

3.2.4. Stage 4: Hybrid Retrieval and Model Selection

Figure 5 shows the workflow of Stage 4 and the hybrid retrieval process used during inference. In this stage, the distilled bi-encoder is integrated into a hybrid retrieval pipeline to retrieve and rank candidate NFRs for unseen FR queries.

Unlike standard dense retrieval, which relies solely on semantic embeddings, the proposed approach integrates both dense and lexical retrieval to ensure broad coverage of candidate NFRs. Specifically, an initial candidate pool is constructed by combining results from FAISS-based dense retrieval and BM25 lexical retrieval [61].

The retrieved candidates are then reranked using cross-encoder models that directly evaluate the semantic relevance between the FR query and each candidate NFR. Two types of rerankers are employed: (i) a general-purpose cross-encoder and (ii) a domain-specific cross-encoder fine-tuned for FR–NFR matching.

To balance efficiency and accuracy, a score fusion mechanism is applied to combine the initial retrieval scores with the reranking scores. This hybrid strategy leverages the high recall of dense and lexical retrieval with the precision of cross-encoder reranking, resulting in a final ranked list of NFRs.

3.3. Training and Implementation Details

This subsection summarizes the training strategy and implementation setup across the four stages of the proposed framework. All models were implemented using the Sentence-Transformers library and trained on the FR–NFR dataset described earlier.

Stage 1: The bi-encoder is fine-tuned for 10 epochs using MultipleNegativesRankingLoss with the AdamW optimizer [62] (learning rate

2 \times 10^{- 5}

, batch size 32). The maximum sequence length is set to 256 tokens. This stage focuses on adapting general semantic representations to the requirements domain. Preliminary experiments were conducted to determine the optimal augmentation ratio, comparing 20%, 30%, 40%, and full augmentation. The results showed that using 20% augmented data provided the best balance between improved generalization and training stability, and was therefore selected.

Stage 2: The model is initialized from Stage 1 and refined using partial fine-tuning, where only the upper transformer layers are unfrozen. Training is performed for 4 epochs using MultipleNegativesRankingLoss with a reduced learning rate (

1 \times 10^{- 5}

) and weight decay (0.01). This stage enhances domain-specific representation learning while preserving general knowledge.

Stage 3: The model is trained using a curriculum-guided contrastive learning strategy based on the data constructed in this stage. Training combines Triplet Loss with Matryoshka Representation Learning [63] to enable robust multi-scale embedding representations. Hard negative samples, mined using a teacher-guided process, are incorporated during training and organized under an easy-to-hard curriculum across training epochs. To conclude this stage, an iterative refinement phase is seamlessly integrated to further improve the embedding space under stricter optimization settings.

To ensure training stability and maximize retrieval coverage, we implemented a validation-driven grid search over learning rates, triplet margins, and batch sizes. Specifically, the batch size was increased to 90 for pairs and 24 for triplets, providing a richer set of in-batch negatives during the distillation process. Key training configurations for this stage are summarized in Table 3.

The thresholds for curriculum pacing—specifically the bottom 30% for easy negatives and the top 30% for hard negatives—were treated as hyperparameters and tuned based on empirical evaluation on the validation set. Preliminary experiments with alternative ranges (e.g., 20% and 40%) showed that the selected 30% balance provided the most stable convergence representation learning.

Similarly, the false-negative filtering threshold (0.95) was intentionally set conservatively to remove only candidate negatives with extremely high semantic similarity, thereby reducing the risk of introducing false negatives while preserving sufficiently challenging hard-negative samples.

Stage 4: To improve ranking precision, a domain-specific cross-encoder reranker is fine-tuned using hard negative samples generated in Stage 3. These negatives were mined using the domain-specific bi-encoder from Stage 2 as a base retriever, then evaluated by the cross-encoder teacher. The training data consists of binary labeled pairs constructed from positive (FR, NFR) and hard negative (FR, NFR) examples. The model is trained for 2 epochs using the AdamW optimizer with a learning rate of

2 \times 10^{- 5}

and a batch size of 32.

During inference, the final ranking score is computed as a weighted combination of bi-encoder and cross-encoder outputs:

S_{final} (q, d) = α \cdot S_{bi} (q, d) + (1 - α) \cdot S_{ce} (q, d)

(1)

where

S_{bi}

denotes the cosine similarity from the bi-encoder,

S_{ce}

is the sigmoid-normalized score from the cross-encoder, and

α

is a weighting parameter tuned on the validation set. Both scores are normalized to the

[0, 1]

range before fusion.

3.4. Hyperparameter Optimization and Model Selection

A validation-based grid search is conducted across both training and inference stages to identify the optimal configuration of the proposed framework. During training (Stage 3), multiple configurations are explored by varying key parameters such as Triplet Loss margins, learning rates, and Matryoshka embedding dimensions. Each configuration is trained through the complete curriculum and iterative refinement schedule. The resulting models are then evaluated on the validation set using retrieval-oriented ranking metrics, and the best-performing configuration is selected.

During inference (Stage 4), fusion weights are tuned separately for different reranking setups, including both general-purpose and domain-specific cross-encoder rerankers. Specifically, the fusion parameter

α \in {0.0, 0.1, 0.3, 0.5, 0.7, 1.0}

is optimized on the validation set. The final model is selected based on validation performance and evaluated once on the held-out test set.

3.5. Evaluation Metrics

The retrieval performance is evaluated on both the validation and test sets using ranking-based metrics that account for multiple relevant NFRs per query. For a given query q, let

G (q)

denote the set of relevant (gold) NFRs with

| G (q) | = R

, and let

π (q) = {n_{1}, n_{2}, \dots}

denote the ranked list of retrieved NFR candidates.

Recall@10: This metric measures the proportion of relevant NFRs retrieved within the top-10 ranked results:

Recall @ 10 (q) = \frac{| G (q) \cap {n_{1}, \dots, n_{10}} |}{| G (q) |} .

(2)

NDCG@10: Normalized Discounted Cumulative Gain at depth 10 is used to account for ranking quality, emphasizing the position of relevant items by assigning higher importance to those appearing earlier in the ranked list.

NDCG @ 10 (q) = \frac{D C G @ 10 (q)}{I D C G @ 10 (q)} .

(3)

where

D C G @ 10 (q) = \sum_{i = 1}^{10} \frac{I (n_{i} \in G (q))}{{log}_{2} (i + 1)},

(4)

and

I (\cdot)

is an indicator function.

I D C G @ 10 (q)

is the ideal ranking score obtained when all relevant NFRs are ranked at the top positions.

The choice of cutoff

k = 10

is motivated by empirical observations, where many FR queries are associated with multiple relevant NFRs. Compared to smaller cutoffs (e.g., @5), Recall@10 and NDCG@10 provide a more stable and representative evaluation of retrieval performance by balancing coverage and ranking quality.

The evaluation focuses on Recall@10 and NDCG@10, while omitting metrics such as Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP). In software engineering traceability and requirements analysis, retrieving a single relevant artifact (as emphasized by MRR) is insufficient, since a single FR is often associated with multiple NFRs.

Furthermore, developers typically examine only a limited number of top-ranked candidates due to time constraints, making full-ranking metrics such as MAP less representative of practical usage scenarios [64].

Therefore, Recall@10 is used to measure the proportion of relevant dependencies successfully retrieved within a realistic review budget, while NDCG@10 evaluates the ranking quality by prioritizing relevant NFRs appearing earlier in the result list.

While Recall@10 and NDCG@10 are the primary metrics for this retrieval task, additional fixed-depth metrics such as Precision@10 and F1@10 are provided in Appendix A (Detailed Fixed-Depth Metrics) to illustrate the balance between retrieval coverage and ranking precision. It is important to note that the number of relevant NFRs associated with each FR query varies substantially across the dataset. Consequently, fixed-depth metrics such as Precision@10 may appear inherently limited, since even highly effective retrieval models may still return non-relevant candidates when the number of relevant ground-truth NFRs is smaller than the inspection depth.

4. Results

Following the methodology described in Section 3, the proposed retrieval framework is evaluated through a structured experimental protocol that analyzes the impact of backbone selection, representation learning stages, and hybrid retrieval components.

Multiple transformer backbones are first compared within the hybrid retrieval setting to identify the most effective encoder. Next, a stage-wise dense retrieval ablation is performed to examine how each training stage improves semantic representations. Finally, the complete hybrid retrieval pipeline is evaluated by combining dense retrieval with cross-encoder reranking. Performance is measured using Recall@10 as the primary metric and NDCG@10 as a secondary indicator of ranking quality. Model selection is based exclusively on validation performance, with final results reported on the held-out test set.

4.1. Backbone Selection and Hybrid Retrieval Analysis

In this experiment, the impact of the underlying bi-encoder backbone on retrieval performance within the proposed hybrid framework is investigated. Three transformer-based backbones—MiniLM, MPNet, and Multi-QA MPNet—are evaluated, representing lightweight, general-purpose, and retrieval-optimized architectures.

For each backbone, hybrid retrieval is evaluated using both general-purpose and domain-specific cross-encoder rerankers. Fusion weights are tuned on the validation set, and only the best-performing configuration for each backbone is reported.

Table 4 summarizes the test-set performance of the best hybrid configurations. Across all backbones, the domain-specific reranker consistently outperforms the general-purpose reranker, confirming the benefit of domain-aware reranking. Notably, the MPNet backbone reaches the highest Recall@10 and NDCG@10 under domain-specific reranking, indicating superior semantic matching and ranking quality.

This result highlights the importance of domain-aware representation learning, where MPNet provides a stronger semantic foundation for capturing FR–NFR relationships compared to lighter or retrieval-optimized alternatives. Based on these validation-driven results, MPNet is selected as the backbone encoder for all subsequent experiments.

4.2. Stage-Wise Dense Retrieval Ablation

Following backbone selection, a stage-wise ablation analysis is conducted to quantitatively evaluate the contribution of each training stage to retrieval performance. This experiment evaluates whether the multi-stage training strategy provides measurable improvements that justify the additional training complexity. To isolate the effect of representation learning, dense-only retrieval is evaluated without lexical fusion or cross-encoder reranking.

To provide comparison against a traditional lexical retrieval strategy, a BM25-based baseline is also included. In addition, the zero-shot configuration represents a generic pretrained SBERT-style dense retrieval baseline using MPNet without domain adaptation or task-specific fine-tuning.

Four MPNet-based checkpoints are then compared: a zero-shot model, Stage 1 (general adaptation), Stage 2 (domain specialization), and Stage 3 (curriculum-guided distillation). Table 5 reports Recall@10 and NDCG@10.

The results show a clear and consistent improvement across stages, beginning from the traditional BM25 lexical baseline up to the final Stage 3 model. While BM25 captures basic keyword-level matching between FRs and NFRs, the zero-shot dense baseline already provides a notable improvement by modeling semantic similarity. The largest performance gain is observed between the zero-shot model and Stage 1, highlighting the importance of adapting general-purpose representations to the requirements domain.

Further improvements observed in Stage 2 and Stage 3 indicate that domain specialization and curriculum-guided distillation provide additional refinement beyond general semantic adaptation. These results indicate that each stage contributes to retrieval improvements, supporting the multi-stage design of the proposed framework. Notably, Stage 3 achieves the highest dense retrieval performance, suggesting that the proposed training strategy contributes meaningfully even before hybrid reranking is applied.

4.3. Hybrid Retrieval Performance

Following the dense-only ablation, the complete hybrid retrieval pipeline described in Stage 4 is evaluated, combining dense semantic retrieval with cross-encoder reranking.

The same four encoder checkpoints (Zero-shot, Stage 1, Stage 2, and Stage 3) are compared under identical hybrid inference settings. For each configuration, the fusion weight

α

is selected exclusively on the validation set and fixed before test evaluation.

4.3.1. General-Purpose Reranker

Table 6 reports hybrid retrieval performance using an off-the-shelf, general-purpose cross-encoder reranker. Interestingly, compared to dense-only retrieval (Table 5), the general-purpose reranker shows mixed behavior. While it improves performance for the Zero-shot and Stage 3 models, it does not consistently improve the intermediate models (Stage 1 and Stage 2), where slight performance drops are observed.

This behavior suggests that a general-purpose reranker may not fully align with the domain-specific representations learned during intermediate training stages. In particular, the semantic patterns captured in Stage 1 and Stage 2 may differ from the generic relevance signals learned by off-the-shelf rerankers.

This observation indicates a limitation of applying general-purpose reranking models to software engineering tasks, where domain-specific context plays a critical role. It further motivates the use of a domain-specific reranker tailored to the FR–NFR matching problem.

4.3.2. Domain-Specific Reranker

Table 7 presents results using the domain-specific reranker trained on FR–NFR hard negatives. Compared to the general-purpose reranker, the domain-specific model consistently yields higher performance, indicating that task-specific supervision plays a critical role in improving ranking precision.

The best overall performance is obtained by combining the Stage 3 encoder with the domain-specific reranker, reaching the highest Recall@10 and NDCG@10. This suggests that the proposed multi-stage training strategy and hybrid retrieval pipeline improve the modeling of fine-grained relationships between FRs and NFRs.

5. Discussion

Our experimental results show a progressive improvement in retrieval performance across the different training stages. The backbone comparison indicates that MPNet provides the strongest representations within the hybrid framework, achieving a favorable balance between semantic alignment and ranking performance compared with lighter alternatives. This behavior is likely related to MPNet’s stronger contextual representation capability, which enables improved modeling of complex relationships in software requirements.

The dense retrieval ablation further highlights the value of progressive representation learning. The transition from zero-shot retrieval to Stage 1 produces a noticeable improvement, increasing Recall@10 from 38.25 to above 61 and demonstrating the importance of domain adaptation. Improvements between Stage 1 and Stage 2 remain smaller but still consistent. This suggests that partial fine-tuning mainly adjusts existing representations rather than introducing substantial changes to the embedding space.

Stage 3, however, produces a more noticeable performance increase. This observation suggests that curriculum-guided distillation and safe hard-negative mining contribute to a more structured embedding space. The model is gradually exposed to more difficult negative examples, allowing it to better distinguish between semantically similar but irrelevant NFRs. As a result, the learned representations become more discriminative.

Under the hybrid setting, both reranking strategies generally improve retrieval performance compared with dense-only retrieval, although the gains vary across stages. Cross-encoder rerankers directly model token-level interactions between FRs and candidate NFRs, allowing more precise relevance estimation than independent embedding-based similarity approaches.

However, the domain-specific reranker consistently reaches the strongest results. This suggests that explicitly training the reranker using FR–NFR hard negatives helps capture task-specific relevance patterns more effectively than a generic model. The overall effect becomes clearer when comparing the zero-shot hybrid baseline (42.36 Recall@10) with the final Stage 3 configuration, which achieves 70.79 Recall@10 and 51.23 NDCG@10. These improvements indicate that refined representations contribute directly to stronger ranking performance.

While direct comparison with previous work remains difficult because of differences in problem formulation, a contextual comparison can still be made. Existing approaches such as [4,5] primarily formulate FR–NFR association as a classification problem using clustering methods, keyword matching, or semantic similarity measures. These approaches assign NFR categories instead of retrieving complete NFR statements and therefore operate at a more coarse-grained level.

In contrast, the proposed Re-Distill framework formulates FR–NFR linking as a retrieval task, allowing complete NFR statements to be identified for unseen FR queries. Although this formulation introduces additional complexity, the framework achieves strong retrieval performance. Despite differences in evaluation settings, the comparison suggests that retrieval-based approaches may provide a more practical mechanism for modeling FR–NFR relationships.

However, analysis of the remaining false negatives reveals several limitations associated with semantic retrieval of FR–NFR relationships. In certain cases, different NFRs express related quality concerns using similar terminology or partially overlapping meanings, making strict separation between positive and negative instances difficult. This observation is consistent with prior studies indicating that natural language requirements frequently contain ambiguity and unclear boundaries [65]. Although the proposed Safe Mining Protocol reduces this issue by filtering candidates with very high teacher similarity scores, ambiguous or indirectly related samples may still remain within the hard-negative pool. This behavior reflects the context-dependent nature of software requirements and suggests that future work may benefit from soft-label supervision or uncertainty-aware retrieval approaches [5].

Cross-category confusion may also arise when different NFR categories share related terminology or similar quality concepts. For example, performance and usability requirements may both refer to responsiveness or perceived latency. Likewise, availability and security requirements can occasionally contain overlapping operational terminology. Such similarities make strict separation more difficult, particularly when requirements are expressed using abstract or context-dependent language. This reflects the inherent complexity of the requirements domain, where a single FR may influence multiple quality attributes simultaneously [8,66].

From a practical perspective, the proposed framework can support requirement engineers by automatically suggesting relevant quality attributes during the early stages of software design. Linking FRs with their corresponding NFRs can improve requirement completeness and reduce the likelihood of overlooking important quality constraints that are commonly missed in real-world SRS documents [7,8].

In addition, variation in the optimal fusion weight

α

across stages provides an interesting observation. The balance between bi-encoder and cross-encoder contributions appears to depend on representation quality. Earlier stages rely more heavily on reranking signals, whereas stronger representations, as observed in Stage 3, allow the dense bi-encoder to contribute more substantially.

This behavior suggests that improved representations may reduce reliance on computationally expensive reranking components, potentially leading to more efficient retrieval systems. Overall, the findings support the proposed Re-Distill framework as a progressive and domain-aware retrieval approach for FR–NFR matching.

6. Conclusions and Future Research Directions

This paper introduced Re-Distill, a curriculum-guided multi-stage retrieval framework for linking FRs and NFRs by formulating the task as a retrieval problem. The study also presented an expanded FR–NFR dataset, a progressive training strategy, and a data-centric distillation process supported by a cross-encoder teacher for safe hard-negative selection.

Experimental results showed improvements across the different training stages. The final framework reached a Recall@10 of 70.79% and an NDCG@10 of 51.23%, compared with a Recall@10 of 42.36% obtained by the zero-shot baseline. These results indicate that retrieval-based modeling can capture meaningful relationships between FRs and NFRs while improving the identification of relevant NFR statements for unseen FR queries.

Beyond retrieval accuracy, the framework provides a practical approach for identifying implicit quality requirements that may otherwise remain unspecified during early requirement analysis. Such capability may contribute to more complete and reliable requirement specifications in real-world software engineering environments.

Future work could extend the framework toward a unified pipeline that combines retrieval and NFR classification within a single architecture. Future studies may also investigate more advanced approaches for handling semantic overlap between NFR categories, potentially drawing inspiration from adaptive boundary-tracking and hybrid representation techniques explored in other computational domains [67].

Author Contributions

Conceptualization, A.A., R.A. and N.A.; methodology, A.A., R.A. and N.A.; software, A.A.; validation, A.A.; formal analysis, A.A.; investigation, A.A.; data curation, A.A.; writing—original draft preparation, A.A.; writing—review and editing, A.A., R.A. and N.A.; visualization, A.A.; supervision, R.A. and N.A.; project administration, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The FR–NFR dataset and implementation source code used in this study are available from the corresponding author upon reasonable request. These resources are not publicly available at present because they are part of an ongoing postgraduate thesis project and are planned for public release after project completion.

Acknowledgments

The research was funded by KAU Endowment (WAQF) at king Abdulaziz University, Jeddah, Saudi Arabia. The authors, therefore, acknowledge with thanks WAQF and the Deanship of Scientific Research (DSR) for technical and financial support. Additionally, the authors acknowledge the use of AI-based language assistance tools to improve the linguistic quality of the manuscript; however, all scientific contributions and conclusions remain the sole responsibility of the authors.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Detailed Fixed-Depth Metrics

Table A1 reports additional fixed-depth metrics, including Precision@10 and F1@10, to provide complementary insight into retrieval precision and ranking noise beyond Recall@10 and NDCG@10.

Table A1. Detailed fixed-depth performance metrics on the test set comparing general and domain-specific rerankers.

Model Setup	Reranker Type	Recall@10	Precision@10	F1@10
Re-Distill	Domain-specific	70.79	19.16	27.30
Re-Distill	General-purpose	68.45	18.04	25.53

References

Sun, P. A Multi-Layered Desires Based Framework to Detect Evolving Non-Functional Requirements of Users. Ph.D. Thesis, Iowa State University, Ames, IA, USA, 2021. [Google Scholar]
Muqeem, M.; Beg, M.R. Validation of Requirement Elicitation Framework Using Finite State Machine. In Proceedings of the International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT); IEEE: New York, NY, USA, 2014; pp. 1210–1216. [Google Scholar]
Cysneiros, L.M. Evaluating the Effectiveness of Using Catalogues to Elicit Non-Functional Requirements. In Proceedings of the Workshop on Requirements Engineering (WER), Toronto, ON, Canada, 17–18 May 2007; pp. 107–115. [Google Scholar]
Mahmoud, A.; Williams, G. Detecting, Classifying, and Tracing Non-Functional Software Requirements. Requir. Eng. 2016, 21, 357–381. [Google Scholar] [CrossRef]
Younas, M.; Jawawi, D.N.A.; Ghani, I.; Shah, M.A. Extraction of Non-Functional Requirement Using Semantic Similarity Distance. Neural Comput. Appl. 2020, 32, 7383–7397. [Google Scholar]
Ullah, S.; Naz, R.; Khan, M.A. A Survey on Issues in Non-Functional Requirements Elicitation. In Proceedings of the International Conference on Computer Networks and Information Technology (ICCNIT), Bangalore, India, 2–4 January 2011. [Google Scholar]
Alsaqaf, W.; Daneva, M.; Wieringa, R. Agile Quality Requirements Engineering Challenges: First Results from a Case Study. In Proceedings of the ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM); IEEE: New York, NY, USA, 2017; pp. 454–459. [Google Scholar]
Eckhardt, J.; Vogelsang, A.; Fernandez, D.M. Are “Non-Functional” Requirements Really Non-Functional? An Investigation of Non-Functional Requirements in Practice. In Proceedings of the International Conference on Software Engineering (ICSE); IEEE: New York, NY, USA; ACM: New York, NY, USA, 2016; pp. 832–842. [Google Scholar]
Hey, T.; Keim, J.; Koziolek, A.; Tichy, M. NoRBERT: Transfer Learning for Requirements Classification. In Proceedings of the IEEE International Requirements Engineering Conference (RE); IEEE: New York, NY, USA, 2020; pp. 169–179. [Google Scholar]
Mihany, F.A.; Galal-Edeen, G.H.; Hassanein, E.E.; Moussa, H. Data-Driven Requirements Elicitation from App Reviews Framework Based on BERT. Appl. Sci. 2025, 15, 9709. [Google Scholar] [CrossRef]
Slankas, J.; Williams, L. Automated Extraction of Non-Functional Requirements in Available Documentation. In Proceedings of the International Workshop on Natural Language Analysis in Software Engineering (NaturaLiSE); IEEE: New York, NY, USA, 2013; pp. 9–16. [Google Scholar]
Kurtanovic, Z.; Maalej, W. Automatically Classifying Functional and Non-Functional Requirements Using Supervised Machine Learning. In Proceedings of the IEEE International Requirements Engineering Conference (RE); IEEE: New York, NY, USA, 2017; pp. 490–495. [Google Scholar]
ISO/IEC 25010:2011; Systems and Software Quality Models. International Organization for Standardization: Geneva, Switzerland, 2011.
Gotel, O.C.; Finkelstein, A.C. An analysis of the requirements traceability problem. In Proceedings of 1st International Conference on Requirements Engineering; IEEE: New York, NY, USA, 1994; pp. 94–101. [Google Scholar]
Cleland-Huang, J.; Gotel, O.; Zisman, A. Software and Systems Traceability; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Jindal, R.; Malhotra, R.; Jain, A.; Bansal, A. Mining Non-Functional Requirements Using Machine Learning Techniques. Egypt. Inform. J. 2021, 15, 85–114. [Google Scholar] [CrossRef]
Tarawneh, M.M. Software Requirements Classification Using Natural Language Processing and SVD. Int. J. Comput. Appl. 2017, 164, 7–12. [Google Scholar] [CrossRef]
Alrehamy, H.H.; Walker, C. SemCluster: Unsupervised Automatic Keyphrase Extraction Using Affinity Propagation. In Proceedings of the Advances in Computational Intelligence Systems; Springer: Berlin/Heidelberg, Germany, 2018; pp. 222–235. [Google Scholar]
Hidayat, T.; Rochimah, S. NFR Classification Using Keyword Extraction and CNN on App Reviews. In Proceedings of the International Seminar on Research of Information Technology and Intelligent Systems (ISRITI); IEEE: New York, NY, USA, 2021; pp. 211–216. [Google Scholar]
Jp, S.; Menon, V.K.; Soman, K.; Ojha, A.K.R. A Non-Exclusive Multi-Class Convolutional Neural Network for Requirements Classification. IEEE Access 2022, 10, 117707–117714. [Google Scholar] [CrossRef]
Rahman, K.; Ghani, A.; Misra, S.; Rahman, A.U. A Deep Learning Framework for Non-Functional Requirement Classification. Sci. Rep. 2024, 14, 3216. [Google Scholar] [CrossRef]
Khan, M.A.; Khan, M.S.; Khan, I.; Ahmad, S.; Huda, S. Non-Functional Requirements Identification and Classification Using Transfer Learning. IEEE Access 2023, 11, 74997–75005. [Google Scholar]
Yucalar, F. Developing an Advanced Software Requirements Classification Model Using BERT. Appl. Sci. 2023, 13, 11127. [Google Scholar] [CrossRef]
Kaur, K.; Kaur, P. MNoR-BERT: Multi-Label Classification of Non-Functional Requirements Using BERT. Neural Comput. Appl. 2023, 35, 22487–22509. [Google Scholar] [CrossRef]
Luo, X.; Xue, Y.; Xing, Z.; Sun, J. PRCBERT: Prompt Learning for Requirement Classification Using BERT-Based Pretrained Language Models. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE); Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
Alhaizaey, A.; Al-Mashari, M. Automated Classification and Identification of Non-Functional Requirements Using Pre-Trained Language Models. IEEE Access 2025, 13, 87401–87417. [Google Scholar] [CrossRef]
Alhoshan, W.; Ferrari, A.; Zhao, L. Zero-shot learning for requirements classification: An exploratory study. Inf. Softw. Technol. 2023, 159, 107202. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.t. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6769–6781. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT: A Distilled Version of BERT. arXiv 2020, arXiv:1910.01108. [Google Scholar]
Thakur, N.; Reimers, N.; Daxenberger, J.; Gurevych, I. Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL); Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
Lu, Y.; Liu, Y.; Liu, J.; Shi, Y.; Huang, Z.; Sun, S.F.Y.; Tian, H.; Wu, H.; Wang, S.; Yin, D.; et al. ERNIE-Search: Bridging Cross-Encoder and Dual-Encoder for Information Retrieval. arXiv 2022, arXiv:2205.09153. [Google Scholar]
Xiong, L.; Xiong, C.; Li, Y.; Tang, K.-F.; Liu, J.; Bennett, P.; Ahmed, J.; Overwijk, A. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. arXiv 2020, arXiv:2007.00808. [Google Scholar] [CrossRef]
Qu, Y.; Ding, Y.; Liu, J.; Liu, K.; Ren, R.; Zhao, W.X.; Dong, D.; Wu, H.; Wang, H. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL); Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Multilingual Sentence-BERT Using Knowledge Distillation. arXiv 2020, arXiv:2004.09813. [Google Scholar]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum Learning. In Proceedings of the International Conference on Machine Learning (ICML); Association for Computing Machinery: New York, NY, USA, 2009; pp. 41–48. [Google Scholar]
Zhan, J.; Mao, J.; Liu, Y.; Zhang, M.; Ma, S. Optimizing Dense Retrieval with Hard Negatives. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR); Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar]
Thakur, N.; Reimers, N.; Daxenberger, J.; Gurevych, I. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv 2021, arXiv:2104.08663. [Google Scholar] [CrossRef]
Luan, Y.; Eisenstein, J.; Toutanova, K.; Collins, M. Sparse, Dense, and Attentional Representations for Text Retrieval. Trans. Assoc. Comput. Linguist. (TACL) 2021, 9, 329–345. [Google Scholar] [CrossRef]
Ma, X.; Sun, K.; Pradeep, R.; Lin, J. A Replication Study of Dense Passage Retrieval. arXiv 2021, arXiv:2104.05740. [Google Scholar] [CrossRef]
Nogueira, R.; Cho, K. Passage Re-Ranking with BERT. arXiv 2020, arXiv:1901.04085. [Google Scholar] [CrossRef]
Khattab, O.; Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR); Association for Computing Machinery: New York, NY, USA, 2020; pp. 39–48. [Google Scholar]
Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1536–1547. [Google Scholar]
Lin, J.; Nogueira, R.; Yates, A. Pretrained Transformers for Text Ranking: BERT and Beyond. In Proceedings of the International Conference on Software Engineering (ICSE); Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
Lakshmi, S.; Thushara, M. Enhancing Software Requirements Retrieval Using Semantic Search. In Proceedings of the International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 10–13 July 2025. [Google Scholar]
Mitrevski, A. Software Requirements Classification Using Machine Learning Algorithms. GitHub Repository. 2019. Available online: https://github.com/AleksandarMitrevski/se-requirements-classification (accessed on 28 May 2026).
Ferrari, A.; Spagnolo, G.O.; Gnesi, S. PURE: A Dataset of Public Requirements Documents. In Proceedings of the IEEE International Requirements Engineering Conference (RE); IEEE: New York, NY, USA, 2017; pp. 502–505. [Google Scholar]
Li, B.; Nong, X. Automatically classifying non-functional requirements using deep neural networks. Pattern Recognit. 2022, 132, 108948. [Google Scholar] [CrossRef]
Foppiano, L.; Dieb, S.; Suzuki, A.; Baptista de Castro, P.; Iwasaki, S.; Uzuki, A.; Esparza Echevarria, M.G.; Meng, Y.; Terashima, K.; Romary, L.; et al. SuperMat: Construction of a linked annotated dataset from superconductors-related publications. Sci. Technol. Adv. Mater. Methods 2021, 1, 34–44. [Google Scholar] [CrossRef]
Miller, G.A. WordNet: A Lexical Database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
Manning, C.D.; Raghavan, P.; Schutze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Kapoor, S.; Narayanan, A. Leakage and the Reproducibility Crisis in Machine Learning-Based Science. Patterns 2023, 4, 100804. [Google Scholar] [CrossRef] [PubMed]
OpenAI. ChatGPT: Optimizing Language Models for Dialogue. 2022. Available online: https://openai.com/blog/chatgpt (accessed on 28 May 2026).
Yoo, K.M.; Park, D.; Kang, J.; Lee, S.W.; Park, W. GPT3Mix: Leveraging Large-Scale Language Models for Text Augmentation. In Proceedings of the Findings of the Association for Computational Linguistics (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2225–2239. [Google Scholar]
Anaby-Tavor, A.; Carmeli, B.; Goldbraich, E.; Kantor, A.; Kour, G.; Shlomov, S.; Tepper, N.; Zwerdling, N. Do Not Have Enough Data? Deep Learning to the Rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 7383–7390. [Google Scholar]
Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; Volume 33, pp. 5776–5788. [Google Scholar]
Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.Y. MPNet: Masked and Permuted Pre-Training for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; Volume 33, pp. 16857–16867. [Google Scholar]
Jawahar, G.; Sagot, B.; Seddah, D. What Does BERT Learn about the Structure of Language? In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 3651–3657. [Google Scholar]
Johnson, J.; Douze, M.; Jegou, H. Faiss: A Library for Efficient Similarity Search and Clustering of Dense Vectors. IEEE Trans. Big Data 2021, 7, 535–547. [Google Scholar] [CrossRef]
Robertson, S.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 2009, 3, 333–389. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Kusupati, A.; Bhatt, G.; Rebuffi, S.A.; Pugh, M.; Khorrami, V.; Wallach, H.; Farhadi, A.; Kakade, S. Matryoshka Representation Learning. In Proceedings of the Advances in Neural Information Processing System (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 30233–30249. [Google Scholar]
Borg, M.; Runeson, P.; Ardö, A. Recovering from a Decade: A Systematic Mapping of Information Retrieval Approaches to Software Traceability. Empir. Softw. Eng. 2014, 19, 1565–1616. [Google Scholar] [CrossRef]
Zhao, L.; Alhoshan, W.; Ferrari, A.; Letsholo, K.J.; Ajagbe, M.A.; Chioasca, E.V.; Batista-Navarro, R.T. Natural language processing for requirements engineering: The state of the art. ACM Comput. Surv. (CSUR) 2021, 54, 55. [Google Scholar]
Li, Y.; Zhao, L.; Liang, P. Towards Automated Requirements Traceability: A Survey. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), Virtual, 25–28 May 2021; pp. 139–141. [Google Scholar]
Xu, Y.; Yang, G.; Xing, Y.; Hu, D. Interface tracking simulation of multifluid flow by ISPH-FVM coupling method. Exp. Comput. Multiph. Flow 2025, 7, 465–489. [Google Scholar] [CrossRef]

Figure 1. Comparison of the original and expanded FR–NFR datasets in terms of the number of FRs and NFRs.

Figure 2. Distribution of NFR categories in the expanded dataset.

Figure 3. Overview of the proposed Re-Distill framework and its four-stage retrieval pipeline for linking FRs and NFRs.

Figure 4. Workflow of Stage 3: curriculum-guided contrastive distillation with teacher-guided hard-negative mining and iterative refinement.

Figure 5. Workflow of Stage 4: hybrid retrieval and model selection. Dense retrieval (FAISS) and lexical retrieval (BM25) are combined to form a candidate pool, which is reranked using general-purpose and domain-specific cross-encoders before score fusion generates the final ranked NFR candidates.

Table 1. Example of the preprocessed dataset, showing lowercased FR–NFR pairs with multiple NFRs stored in a single field and separated by semicolons.

Project	Preprocessed FR	Preprocessed NFRs (Semicolon-Separated)
project 4	the dispute system will facilitate direct data entry of a dispute case via a user interface that supports real-time responses to the users.	100% of cardmember services representatives shall be able to successfully create a dispute case on the first encounter after completing the training course; the maximum wait time for a user navigating from one screen to another within the disputes application shall be no more than 5 s; 100% of the cardmember and merchant services representatives shall use the disputes application regularly after a 2-day training course.
project 5	the user shall search for the preferred repair facility using vehicle vehicle location and radius in miles	the search for the preferred repair facility shall take no longer than 8 s. the preferred repair facility is returned within 8 s; the user shall easily locate instructions while using the product. user help can be found within 90% of the system; users shall feel satisfied using the product. 85% of all users will be satisfied with the product
project 11	users shall be required to log in to the cafeteria ordering system for all operations except viewing a menu.	responses to queries shall take no longer than 7 s to load onto the screen after the user submits the query; all network transactions that involve financial information or personally identifiable information shall be encrypted; all web pages generated by the system shall be fully downloadable in no more than 10 s over a 40 kbps modem connection; only users who have been authorized for home access to the corporate intranet may use the cos from non-company locations; patrons shall log in according to the restricted computer system access policy

Table 2. An illustrative example of an FR mapped to an original NFR and its augmented variants.

Functional Requirement (FR)	Non-Functional Requirement (NFR Candidates)
The disputes system will facilitate direct data entry via a user interface that supports real-time interaction.	Original: The maximum wait time for a user navigating between screens shall be no more than 5 s.
	Aug 1: Users should not experience a delay of more than 5 s when moving between screens.
	Aug 2: The system must ensure that screen transitions do not exceed a delay of 5 s.
	Aug 3: Users must not encounter response delays longer than 5 s during navigation.

Table 3. Key training settings for Stage 3 (Curriculum-Guided Distillation).

Component	Configuration
Teacher Model	`cross-encoder/ms-marco-MiniLM-L-12-v2`
Candidate Retrieval	FAISS dense retrieval (Top-100 per FR)
False-negative Filtering Threshold	0.95 teacher score
Curriculum Strategy	Easy negatives (bottom 30%) → Hard negatives (top 30%)
Stage 3 Training Schedule	1 epoch (Easy) + 2 epochs (Hard)
Stage 3 Losses	MultipleNegativesRankingLoss + Triplet Loss + MatryoshkaLoss
Stage 3 Learning Rates	$1 \times 10^{- 5}$ and $7 \times 10^{- 6}$ explored
Weight Decay	0.01
Triplet Margins	0.20, 0.25, 0.30, 0.35 explored
Pair Batch Size/Triplet Batch Size	90/24
Warmup Ratio	0.10
Matryoshka Dimensions	[768, 512, 256, 128, 64]
Iterative Refinement	2 continuous epochs using hard negatives only
Refinement Learning Rates	$3 \times 10^{- 6}$ and $5 \times 10^{- 6}$ explored
Refinement Triplet Margins	0.50 and 0.55 explored

Table 4. Hybrid retrieval performance across different transformer backbones using general and domain-specific rerankers. Results are reported on the test set using Recall@10 and NDCG@10.

Backbone	Reranker	$α$	Recall@10	NDCG@10
MiniLM	General	0.1	60.93	39.13
MiniLM	Domain	1.0	65.17	41.89
MPNet	General	0.7	62.17	41.39
MPNet	Domain	0.7	70.79	51.23
Multi-QA MPNet	General	0.1	55.44	38.39
Multi-QA MPNet	Domain	0.7	61.19	45.21

Table 5. Stage-wise retrieval performance compared to a lexical baseline. Results are reported on the test set using Recall@10 and NDCG@10.

Model	Recall@10	NDCG@10
BM25 (Lexical Baseline)	25.58	16.05
Zero-shot (MPNet Dense Baseline)	38.25	24.94
Stage 1	61.71	38.34
Stage 2	62.04	39.25
Stage 3	64.13	44.91

Table 6. Hybrid retrieval performance using a general-purpose cross-encoder reranker across different training stages. Results are reported on the test set using Recall@10 and NDCG@10.

Model	$α$	Recall@10	NDCG@10
Zero-shot	0.1	39.65	26.53
Stage 1	0.9	60.64	37.48
Stage 2	0.9	61.24	38.44
Stage 3	0.7	65.13	45.58

Table 7. Hybrid retrieval performance using a domain-specific cross-encoder reranker trained on FR–NFR hard negatives. Results are reported on the test set using Recall@10 and NDCG@10.

Model	$α$	Recall@10	NDCG@10
Zero-shot	0.7	42.36	26.69
Stage 1	0.9	62.24	39.30
Stage 2	0.5	63.91	41.45
Stage 3	0.7	70.79	51.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Almohammady, A.; Alnanih, R.; Alowidi, N. Re-Distill: A Multi-Stage Retrieval Framework for Functional–Non-Functional Requirement Linking in Software Engineering. Appl. Sci. 2026, 16, 5482. https://doi.org/10.3390/app16115482

AMA Style

Almohammady A, Alnanih R, Alowidi N. Re-Distill: A Multi-Stage Retrieval Framework for Functional–Non-Functional Requirement Linking in Software Engineering. Applied Sciences. 2026; 16(11):5482. https://doi.org/10.3390/app16115482

Chicago/Turabian Style

Almohammady, Ashwag, Reem Alnanih, and Nahed Alowidi. 2026. "Re-Distill: A Multi-Stage Retrieval Framework for Functional–Non-Functional Requirement Linking in Software Engineering" Applied Sciences 16, no. 11: 5482. https://doi.org/10.3390/app16115482

APA Style

Almohammady, A., Alnanih, R., & Alowidi, N. (2026). Re-Distill: A Multi-Stage Retrieval Framework for Functional–Non-Functional Requirement Linking in Software Engineering. Applied Sciences, 16(11), 5482. https://doi.org/10.3390/app16115482

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Re-Distill: A Multi-Stage Retrieval Framework for Functional–Non-Functional Requirement Linking in Software Engineering

Abstract

1. Introduction

2. Related Work

2.1. Software Engineering and NFR Modeling

2.2. Retrieval and Semantic Matching

2.3. FR–NFR Association and Research Gap

3. Methodology

3.1. FR–NFR Dataset

3.1.1. Data Construction

3.1.2. Preprocessing and Partitioning

3.1.3. Data Augmentation

3.2. Proposed Re-Distill Framework

3.2.1. Stage 1: General Semantic Adaptation

3.2.2. Stage 2: Domain-Specific Specialization

3.2.3. Stage 3: Curriculum-Guided Distillation

3.2.4. Stage 4: Hybrid Retrieval and Model Selection

3.3. Training and Implementation Details

3.4. Hyperparameter Optimization and Model Selection

3.5. Evaluation Metrics

4. Results

4.1. Backbone Selection and Hybrid Retrieval Analysis

4.2. Stage-Wise Dense Retrieval Ablation

4.3. Hybrid Retrieval Performance

4.3.1. General-Purpose Reranker

4.3.2. Domain-Specific Reranker

5. Discussion

6. Conclusions and Future Research Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Detailed Fixed-Depth Metrics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI