A Cross-Level Requirement Trace Link Update Model Based on Bidirectional Encoder Representations from Transformers

: Cross-level requirement trace links (i.e., links between high-level requirements (HLRs) and low-level requirements (LLRs)) record the top-down decomposition process of requirements and support various development and management activities (e.g., requirement validation). Undoubtedly, updating trace links synchronously with requirement changes is critical for their constant availability. However, large-scale open-source software that is rapidly iterative and continually released has numerous requirements that are dynamic. These requirements render timely update of trace links challenging. To address these problems, in this study, a novel deep-learning-based method, deep requirement trace analyzer fusing heterogeneous features (DRAFT), was proposed for updating trace links between various levels of requirements. Considering both the semantic information of requirement text descriptions and the process features based on metadata, trace link data accumulated in the early stage are comprehensively used to train the trace link identiﬁcation model. Particularly, ﬁrst, we performed second-phase pre-training for the bidirectional encoder representations from transformers (BERT) language model based on the project document corpus to realize project-related knowledge transfer, which yields superior text embedding. Second, we designed 11 heuristic features based on the requirement metadata in the open-source system. Based on these features and semantic similarity between HLRs and LLRs, we designed a cross-level requirement tracing model for new requirements. The superiority of DRAFT was veriﬁed based on the requirement datasets of eight open-source projects. The average F1 and F2 scores of DRAFT were 69.3% and 76.9%, respectively, which were 16.5% and 22.3% higher than baselines. An ablation experiment proved the positive role of two key steps in trace link construction.


Introduction
In complex software systems, requirements are decomposed layer by layer from top to bottom [1]. In this process, ensuring that each requirement at the high abstraction level is refined into a requirement at a lower level is critical. Each low-level requirement (LLR) should trace up to a specific high-level requirement (HLR); otherwise, subsequent design and implementation cannot satisfy system objectives or may exceed the system scope (over-standard) [2]. Many standards and norms such as DO-178C [3], IEEE Std. 830 [4], and CMMI [5] have emphasized the importance of requirement traceability to software development. In particular, DO-178C clearly stipulates the necessity to ensure that LLRs can satisfy HLRs and that each HLR is developed into LLRs subsequently. In an open-source system, developers cooperate across regions, and personnel mobility is strong [6]. Creating cross-level requirement trace links helps participants to swiftly understand the origin and development of requirements. Therefore, creating trace links between the requirements of different levels to support activities, such as requirement verification, validation, and change management, is crucial for ensuring that system development is correct.
With the constant gathering of requirements during system evolution, the requirement set continues to expand, and the integrity of trace links created in the early stage reduces. Untimely update of trace links mitigates their support for other activities and causes more errors [7]. However, manually updating trace links requires considerable manpower and material resources, and this phenomenon is especially obvious in open-source systems. Linus Torvalds (https://en.wikipedia.org/wiki/Linus_Torvalds, accessed on 25 January 2023), the father of Linux, proposed the principle of "Release early. Release often. And listen to your customers" for open-source software. This principle has the characteristics of short-cycle iterative development, rapid release, and continuous gathering of user requirements. In the constant iteration and release process, following the initial version release of a software project, new requirements from various origins such as new features, user feedback, and technical updates are frequently raised, and the number of trace links also increases rapidly [8]. In an open-source system featuring a short cycle, fast iterations, and high-frequency addition of new requirements, the cost of updating trace links during the evolution process is extremely high and may even exceed the cost of creating trace links at the initial stage of the project [9].
Although both academia and industry have recognized the importance of automated update and maintenance of trace links [10,11], few related studies have focused on it. Mader et al. [12] proposed to maintain the trace links between UML artifacts of different development activities (e.g., requirements and analysis) by capturing the relevant change events. Furthermore, existing studies on trace link maintenance focus on updating the trace links between requirements and code [9,13] and between requirements and Unified Modeling Language (UML) models [12]. However, limited studies have focused on different levels of requirements.
Most traceability-related studies have focused on automatic identification of requirement trace links. Existing methods typically use the text semantic analysis of the required artifacts to automatically create trace links. Mainstream methods include information retrieval-based methods (e.g., vector space model (VSM) [14], latent semantic indexing (LSI) [15], VSM-Part-of-Speech (POS) [16], VSM-Thesaurus [17], and Relevance feedback (RF) [18,19]), machine learning-based methods (e.g., methods mentioned in Refs. [20][21][22][23]), and deep-learning-based methods [24][25][26] (e.g., TraceNN [25] and TraceBERT [26]). Although trace link creation technologies can support the updating of trace links, only the textual distance between requirements is considered, and in many cases, the process-related information (e.g., writers and assigners) of requirement creation is ignored. Furthermore, the trace links of historical requirements are not comprehensively utilized. These two types of information are crucial for automatically creating trace links for new requirements.
As displayed in Figure 1, although the text descriptions (i.e., description and summary) of the high-level requirement JBIDE-26652 (https://issues.redhat.com/browse/JBIDE-26652, accessed on 25 January 2023) and the requirement JBIDE-26790 (https://issues.redhat.com/ browse/JBIDE-26790, accessed on 25 January 2023) exhibit a low similarity, the two requirements highly overlap in terms of process data (e.g., assignee, creator, and components). A decomposition relationship exists between these two requirements, that is, cross-level traceability because the same author is very likely to decompose HLRs into LLRs after creating HLRs. Figure 2 shows that historical requirements also help to trace link identification. For the pair of JBIDE-27384 (https://issues.redhat.com/browse/JBIDE-27384, accessed on 25 January 2023) and JBIDE-27673 (https://issues.redhat.com/browse/JBIDE-27673, accessed on 25 January 2023), the similarity between them is insufficient in both textual and process information. However, analysis revealed that JBIDE-27672 (https://issues.redhat. com/browse/JBIDE-27672, accessed on 25 January 2023) is an LLR traced to JBIDE-27384, which has a high similarity with JBIDE-27673 in terms of textual and process information. Therefore, a trace link is very likely to exist between JBIDE-27673 and JBIDE-27384.  However, analysis revealed that JBIDE-27672 (https://issues.redhat.com/browse/JBIDE-27672, accessed date: 25/1/2023) is an LLR traced to JBIDE-27384, which has a high similarity with JBIDE-27673 in terms of textual and process information. Therefore, a trace link is very likely to exist between JBIDE-27673 and JBIDE-27384. Inspired by these two points, this study proposed a novel deep-learning-based method, deep requirement trace analyzer fusing heterogeneous features (DRAFT), for automatically updating trace links. This method can learn the trace the link identification model from historical data, automatically recommend candidate trace links for analysts for new requirements, and assist analysts in updating cross-level requirement trace links during requirement evolution. In DRAFT, the joint feature representation (i.e., text features and process features) of requirements is established from the perspectives of natural language description and process information. Based on the BERT model [27], DRAFT also integrates the direct feature extracted from the pairs of candidate requirements, as well as the extended features by retrieving the historical trace links to automatically develop trace links between cross-level requirements. In terms of capturing text semantic similarity, considering the semantic differences of terms in various contexts, we proposed to perform second-phase pre-training for the BERT language model to ensure the encoding of the required text is highly suitable for the project context. In terms of extracting process features, DRAFT introduces 11 heuristic features based on metadata and utilizes historical trace link data when extracting features. We collect requirement trace links from   Figure 2 shows that historical requirements also help to trace link identification. For the pair of JBIDE-27384 (https://issues.redhat.com/browse/JBIDE-27384, accessed date: 25/1/2023) and JBIDE-27673 (https://issues.redhat.com/browse/JBIDE-27673, accessed date: 25/1/2023), the similarity between them is insufficient in both textual and process information.
However, analysis revealed that JBIDE-27672 (https://issues.redhat.com/browse/JBIDE-27672, accessed date: 25/1/2023) is an LLR traced to JBIDE-27384, which has a high similarity with JBIDE-27673 in terms of textual and process information. Therefore, a trace link is very likely to exist between JBIDE-27673 and JBIDE-27384. Inspired by these two points, this study proposed a novel deep-learning-based method, deep requirement trace analyzer fusing heterogeneous features (DRAFT), for automatically updating trace links. This method can learn the trace the link identification model from historical data, automatically recommend candidate trace links for analysts for new requirements, and assist analysts in updating cross-level requirement trace links during requirement evolution. In DRAFT, the joint feature representation (i.e., text features and process features) of requirements is established from the perspectives of natural language description and process information. Based on the BERT model [27], DRAFT also integrates the direct feature extracted from the pairs of candidate requirements, as well as the extended features by retrieving the historical trace links to automatically develop trace links between cross-level requirements. In terms of capturing text semantic similarity, considering the semantic differences of terms in various contexts, we proposed to perform second-phase pre-training for the BERT language model to ensure the encoding of the required text is highly suitable for the project context. In terms of extracting process features, DRAFT introduces 11 heuristic features based on metadata and utilizes historical trace link data when extracting features. We collect requirement trace links from Inspired by these two points, this study proposed a novel deep-learning-based method, deep requirement trace analyzer fusing heterogeneous features (DRAFT), for automatically updating trace links. This method can learn the trace the link identification model from historical data, automatically recommend candidate trace links for analysts for new requirements, and assist analysts in updating cross-level requirement trace links during requirement evolution. In DRAFT, the joint feature representation (i.e., text features and process features) of requirements is established from the perspectives of natural language description and process information. Based on the BERT model [27], DRAFT also integrates the direct feature extracted from the pairs of candidate requirements, as well as the extended features by retrieving the historical trace links to automatically develop trace links between cross-level requirements. In terms of capturing text semantic similarity, considering the semantic differences of terms in various contexts, we proposed to perform second-phase pre-training for the BERT language model to ensure the encoding of the required text is highly suitable for the project context. In terms of extracting process features, DRAFT introduces 11 heuristic features based on metadata and utilizes historical trace link data when extracting features. We collect requirement trace links from eight open-source projects to construct datasets and conduct experimental evaluations for DRAFT. The evaluation results revealed that DRAFT outperformed the existing baseline methods in identifying trace links.
The contributions of this paper are as follows:

1.
A pre-trained model-based approach DRAFT is proposed for updating cross-level requirement trace links. Compared with existing studies, we extended the features into two dimensions. In terms of feature types, process features are added in addition to text features. In terms of requirement types, instead of directly analyzing the candidate requirement pairs (i.e., direct features), the requirements related to candidate requirements (i.e., extended features) are also analyzed. The datasets and DRAFT-related code are made available online (https://gitee.com/ ttstr/DRAFT, accessed on 25 January 2023).
The remainder of this paper is organized as follows: Section 2 describes the research background and defines the problems of requirement trace links update; Section 3 summarizes the relevant research status in detail; Section 4 introduces the overall framework of DRAFT; Sections 5-7 introduce the three core steps of DRAFT, i.e., project-specific pretraining, heuristic feature extraction, and the deep neural network architecture of the trace link identification model; Section 8 details an experimental evaluation and comparative analysis of the proposed method and the baseline methods; Section 9 discusses the validity threats and limitations of this study. Finally, a conclusion is given.

Research Background: Cross-Level Requirements Traceability
To adapt to the loosely coupled and cross-regional cooperative development pattern, a lightweight, informal just-in-time requirement engineering [28] is adopted in open-source systems, and an issue log management system is used to record and manage requirements. Requirement development in an open-source system is a top-down decomposition process. At the beginning of a project, analysts define the HLRs that describe the long-term goals, with requirements known as "epics", of the project and decompose these into requirements such as "features" (or "feature requests"), "enhancements" (or "improvements"), and "tasks". Tasks are broken down into finer-grained sub-tasks. Some issue log types are predefined in the issue tracking system to record requirements [29]. The issue log types of requirements on most widely used JIRA (Issue management tools-popularity ranking (2017). https://project-management.zone/ranking/category/issue, accessed on 25 January 2023) are displayed in Figure 3. Requirements have three abstraction levels (from high to low): parent (including epics), standard (including features, enhancements, and tasks), and child (including sub-tasks).
2. An experimental evaluation is performed for eight open-source projects in domains and scales. The results revealed that the performance of DRAFT is erably better than that of the baseline methods such as VSM, relevance feedba TraceBERT. DRAFT achieved average F1 and F2 scores of 69.3% and 76.9% are up to 16.5% and 22.3% higher than those of the baselines, respectively.
The remainder of this paper is organized as follows: Section 2 describes the re background and defines the problems of requirement trace links update; Section marizes the relevant research status in detail; Section 4 introduces the overall fram of DRAFT; Sections 5-7 introduce the three core steps of DRAFT, i.e., project-speci training, heuristic feature extraction, and the deep neural network architecture of th link identification model; Section 8 details an experimental evaluation and comp analysis of the proposed method and the baseline methods; Section 9 discusses th ity threats and limitations of this study. Finally, a conclusion is given.

Research Background: Cross-Level Requirements Traceability
To adapt to the loosely coupled and cross-regional cooperative development p a lightweight, informal just-in-time requirement engineering [28] is adopted in source systems, and an issue log management system is used to record and man quirements. Requirement development in an open-source system is a top-down position process. At the beginning of a project, analysts define the HLRs that descr long-term goals, with requirements known as "epics", of the project and decompos into requirements such as "features" (or "feature requests"), "enhancements" (o provements"), and "tasks". Tasks are broken down into finer-grained sub-tasks. S sue log types are predefined in the issue tracking system to record requirements [2 issue log types of requirements on most widely used JIRA (Issue management popularity ranking (2017). https://project-management.zone/ranking/category/iss cessed date: 25/1/2023) are displayed in Figure 3. Requirements have three abst levels (from high to low): parent (including epics), standard (including features, en ments, and tasks), and child (including sub-tasks). Cross-level requirement traceability means trace links between requirements in different abstraction levels. Given that traceability is primarily used to record the decomposition relationship between requirements, we focus on the trace links between the requirements of adjacent levels, i.e., parent-standard and standard-child in JIRA.
In an open-source system, the raw data of requirement issue logs contain rich textual information and process information [30]. The textual information of the issue log is contained in two fields: summary (a concise summary of the issue) and description (a detailed description of the issue) [31]. When manually constructing a requirement trace link, an analyst should read and understand the text descriptions such as summary and description of the requirement. Next, the analyst analyzes the semantic association between artifacts, which is the most intuitive basis for identifying trace links between requirements. Therefore, requirement traceability is a sentence-pair classification problem in natural language understanding, and the semantic similarity of a pair of requirements based on various technologies, such as information retrieval and deep learning language models, are measured to determine whether a trace link exists between artifacts.
Additionally, in the metadata of issue log, important process information, for example, people-related information fields, such as creator and assignee as well as creation time, components, and labels, is recorded. This type of information is a potential basis for analyzing requirement trace links. For example, requirements raised by the same developer/user are likely to correspond to the same or related functions. Thus, trace links are more likely to exist between requirements raised by the same developer/user.
Therefore, text semantic similarity is used for integrating the analysis of process features between cross-level requirements, which renders accurate identification of complete cross-level requirement trace links.

Problem Definition: Update of Requirement Trace Links
Because of feedback from end users, project technology update, and constant proposal of new requirements during the version evolution of software systems, analysts are required to update the trace links between requirements to ensure their continuous availability. Figure 4 describes a trace link update task. Let a historical project requirement set R hist = {HLRS hist , LLRS hist }, where HLRS hist is the HLR set and LLRS hist is the LLR set. TLS hist is a set of relationships between HLRS hist and LLRS hist . Each element in TLS hist is a two-tuple (hlr , llr ), where hlr ∈ HLRS hist is the source requirement (HLR entry), and llr ∈ LLRS hist is the target requirement (LLR entry). When new requirements R new (R new = HLRS new ∪ LLRS new ) are obtained, the existing trace link set TLS hist needs to be updated. This task is to construct the trace links between requirements of different level in R new and R hist .
analyst should read and understand the text descriptions such as summary and de of the requirement. Next, the analyst analyzes the semantic association between a which is the most intuitive basis for identifying trace links between requirements fore, requirement traceability is a sentence-pair classification problem in natural la understanding, and the semantic similarity of a pair of requirements based on technologies, such as information retrieval and deep learning language models, ar ured to determine whether a trace link exists between artifacts.
Additionally, in the metadata of issue log, important process information, fo ple, people-related information fields, such as creator and assignee as well as creat components, and labels, is recorded. This type of information is a potential basis lyzing requirement trace links. For example, requirements raised by the same oper/user are likely to correspond to the same or related functions. Thus, trace l more likely to exist between requirements raised by the same developer/user.
Therefore, text semantic similarity is used for integrating the analysis of proc tures between cross-level requirements, which renders accurate identification of c cross-level requirement trace links.

Problem Definition: Update of Requirement Trace Links
Because of feedback from end users, project technology update, and consta posal of new requirements during the version evolution of software systems, anal required to update the trace links between requirements to ensure their continuou ability. Figure 4 describes a trace link update task. Let a historical project requirem Rhist = {HLRShist, LLRShist}, where HLRShist is the HLR set and LLRShist is the LLR set. a set of relationships between HLRShist and LLRShist. Each element in TLShist is a tw is the source requirement (HLR entry), an ℎ is the target requirement (LLR entry). When new requirements Rnew (Rnew = ∪ LLRSnew) are obtained, the existing trace link set TLShist needs to be updated. Thi to construct the trace links between requirements of different level in Rnew and Rhi  The following three cases may exist when a trace link is created for new requirements: between new HLR and historical LLR, between historical HLR and new LLR, and between new HLR and new LLR. The trace links defined by the aforementioned three tasks are Trace <HLRnew , LLRhist> , Trace <HLRhist , LLRnew> , and Trace <HLRnew , LLRnew> . The set of trace links created for new requirements to be solved in this task, that is, Trace new , is the union of the three sets. Thus, Trace new = Trace <HLRSnew, LLRShist> ∪Trace <HLRShist, LLRSnew> ∪Trace <HLRSnew, LLRSnew> .
Note that our proposed method can support the addition, deletion, and modification scenarios regarding the requirement evolution and is not limited to establishing a trace link for the newly added requirement. For requirements to be deleted, all related trace links should be deleted, which is not difficult technically. The modification of require-ments is equal to the deletion of old requirements and the addition of new requirements (requirements after modification).

Related Work
The Trace Link Evolver proposed by Rahimi and Cleland-Huang in 2019 [9] can automatically update the trace link between requirements and code during the system iteration process. They first analyzed 24 common scenarios of code change and defined the trace link evolution rules for each change scenario to update the trace link. In 2012, Mader [12] et al. proposed a semi-automatic approach for maintaining the trace links between requirements and design models expressed in UML, which can update the trace links with the progress of development activities. Under this framework, a specific UML modeling tool was used to capture the flow of change events caused by various development activities, and heuristic rules are predefined for development activities, leading to automatic updating of trace links.
In addition, few studies have been conducted on trace link update for cross-levels of requirements. Most studies have focused on the updating of trace links between requirements and codes and between requirements and UML models.
However, in the requirement tracing domain, numerous studies have focused on automatically identifying and creating trace links. Such methods can ensure the continuous availability of trace links by regularly recreating trace links during project version iterations. Rath et al. [23] proposed a machine-learning-based method for identifying trace links between requirement problems and submission records of open-source systems. This method not only calculates the text similarity between artifacts based on the VSM [14] model but also considers process-related attributes such as stakeholder information and timing relationships. Weka's J48 decision tree [32] training model was applied to verify the identification effect when using various feature sets. Their experimental results revealed that the best results can be achieved when both similarity and process features are used. We incorporated their ideas and analyzed the heuristic features related to the process while considering the text description content of requirement entries as the main identification basis.
The semantic similarity of the text description is the most intuitive basis for creating trace links between requirements. Most studies have only relied on the text descriptions of requirements to identify trace links. The information retrieval technique and learningbased methods have been adopted for identifying the requirement trace links based on the text features of requirements. Early methods for automated requirement trace include classic VSM [14] and LSI [15] that determine the trace link by capturing the same words used in the text descriptions of source and target artifacts and calculating the similarity between the two text vectors. However, in practice, artifacts are written by various people or organizations, and different words may have been used to describe the same concept. Thus, a term mismatch problem may occur, which is the primary concern of this type of method [33]. To address this problem, adding semantic information improved identification. For example, thesaurus [17] and domain knowledge [34] can help to capture the semantic association between different words based on the vocabulary support. The relevance feedback technique [19] can improve the query statement and expand the scope of semantic retrieval based on user feedback on information retrieval results, which increases accuracy. With the development of artificial intelligence, methods based on machine learning and deep learning have gradually received considerable research attention. Learning-based methods typically regard the identification of requirement trace links as a binary classification problem. In methods based on machine learning, first, semantic features are extracted from the requirement descriptions, and models such as Naive Bayes [35] and random forest [36] are used to predict the trace link between requirements pairs. Methods based on deep learning can automatically embed text features through a deep neural network, which reduces the dependence on manual selection in the feature representation stage in case of machine learning methods. The recurrent neural network (RNN) is widely used in natural language processing. The method can embed contextual Mathematics 2023, 11, 623 7 of 24 semantic information into word vectors. The RNN has a stronger ability to represent text than the information retrieval and machine learning methods [37]. Guo et al. proposed the TraceNN method [25] to trace requirements to design documents. In TraceNN, first, the features of text sequences are embedded based on the RNN, and multilayer perceptron (MLP) is then used to complete the classification of trace links. They evaluated two types of RNN models, namely long short-term memory (LSTM) and gated recurrent unit (GRU) on large-scale industrial datasets. Their results showed that GRU delivered better results in terms of mean average precision. However, in RNNs, usually only one side of the context information is encoded because of its unidirectional structure. With the increase in the sequence length, the embedded context information gradually weakens [37]. In 2018, Devlin et al. proposed BERT [27] based on the transformers [38], which solved this problem satisfactorily. This method achieved state-of-the-art results in a series of natural language processing tasks. BERT consists of two stages, namely pre-training and fine-tuning on downstream tasks. In the pre-training stage, unsupervised task training is performed on a large corpus. The semantic knowledge in the corpus is encoded into the language model, which is used to embed the text representation vector. This representation is applied to downstream tasks. Native BERT is trained on general corpora such as Wiki. Currently, numerous corpora have been developed for various domains and trained domain-specific language models such as BioBERT [39], FinBERT [40], and CodeBERT [41]. In the requirement trace field, TraceBERT [26], proposed by Lin et al. in 2021, investigated the application effect of BERT in tracing requirements and codes. This is the first study that applied BERT or other transformer-based methods to software traceability tasks. They performed secondphase pre-training on large datasets of similar tasks and subsequently transferred domain knowledge into the language model. The model was then fine-tuned and applied to the downstream task "issue (natural language)-commit (programming language)" to improve the trace effect. They evaluated three commonly used BERT architectures (i.e., single, twin, and Siamese) on open-source projects. Their experimental results showed that single architecture achieves the best accuracy, while Siamese architecture achieved similar accuracy with faster training time [26]. Considering that the requirement trace task is typically a project-specific task, we also developed a corpus from project-related documents based on transfer learning. After second-phase pre-training, we encoded the contextual semantic information and knowledge of the projects into the language model to improve the quality of text representations in the text embedding stage and subsequently achieved superior results in cross-level requirement trace activities.
Studies have proved that the deep learning-based language model BERT achieved excellent performance in downstream tasks of natural language understanding. Refs. [23] and [42] have confirmed that the introduction of process features can improve the quality of trace link identification. However, the two methods are yet to be combined in a study. This study combined these two methods. First, the adaptability of the BERT model to the project context was improved through the second-phase pre-training on the project corpus and used as an encoder for the required text. Second, the trace link in the historical version was fully utilized to extract heuristic features based on the requirement metadata. We constructed a deep neural network to fuse the two kinds of features and improved the effect of trace link identification.

DRAFT Framework
To comprehensively analyze the correlation between cross-level requirements, the features of requirements in terms of text description are analyzed, and information, such as components, task labels, and stakeholders, is processed based on the metadata of issue logs. A network architecture DRAFT that integrates text features and process features is then designed. Based on the BERT language model, DRAFT allows for the embedding and joint feature representation of heterogeneous features, which renders the creation of cross-level requirement trace links for new requirements. As displayed in Figure 5, the DRAFT architecture includes three key components: the project-specific BERT second-phase pre-training module, the heterogeneous feature extraction module, and the trace link identification model that integrates heterogeneous features. These three components are executed in sequence. Trace links can be created for new requirements by training the trace link identification model.
To comprehensively analyze the correlation between cross-level requirements, the features of requirements in terms of text description are analyzed, and information, such as components, task labels, and stakeholders, is processed based on the metadata of issue logs. A network architecture DRAFT that integrates text features and process features is then designed. Based on the BERT language model, DRAFT allows for the embedding and joint feature representation of heterogeneous features, which renders the creation of crosslevel requirement trace links for new requirements.
As displayed in Figure 5, the DRAFT architecture includes three key components: the project-specific BERT second-phase pre-training module, the heterogeneous feature extraction module, and the trace link identification model that integrates heterogeneous features. These three components are executed in sequence. Trace links can be created for new requirements by training the trace link identification model. In this study, the BERT pre-training model [27] was selected to embed requirement text and features. We selected BERT because its performance is excellent in various tasks of natural language understanding (such as question answering and sentence-pair classification) [37]. Furthermore, Lin et al. [24,26] used considerable data and proved that excellent transfer ability and context understanding ability of BERT make their approach TraceBERT more effective in establishing a trace link between commit and code than baseline methods such as the VSM and LSTM. However, the BERT pre-training model runs on general corpora. To enhance its adaptability to projects and domains and improve its ability to understand domain-specific corpora, we collected all text descriptions related to the requirements in the projects. A second-phase pre-training was performed for BERT to obtain a project-specific language model. The second step is to identify and embed the heterogeneous features of the requirements. In most current requirement trace methods based on deep learning, only the text features of requirements are considered [17][18][19][20][21][22][24][25][26]30,34]. The proposed DRAFT method fully utilizes the limited historical trace data to obtain a complete representation In this study, the BERT pre-training model [27] was selected to embed requirement text and features. We selected BERT because its performance is excellent in various tasks of natural language understanding (such as question answering and sentence-pair classification) [37]. Furthermore, Lin et al. [24,26] used considerable data and proved that excellent transfer ability and context understanding ability of BERT make their approach TraceBERT more effective in establishing a trace link between commit and code than baseline methods such as the VSM and LSTM. However, the BERT pre-training model runs on general corpora. To enhance its adaptability to projects and domains and improve its ability to understand domain-specific corpora, we collected all text descriptions related to the requirements in the projects. A second-phase pre-training was performed for BERT to obtain a project-specific language model. The second step is to identify and embed the heterogeneous features of the requirements. In most current requirement trace methods based on deep learning, only the text features of requirements are considered [17][18][19][20][21][22][24][25][26]30,34]. The proposed DRAFT method fully utilizes the limited historical trace data to obtain a complete representation of requirement features. The features of process information, such as the requirement creator and the creation time, were also considered when using the natural language descriptions of requirements as the basis for trace link identification. In addition to directly extracting the text features and process features between the requirements in pairs, we retrieved the historical trace list of each requirement from the historical trace link and analyzed the extended features between the requirements in the historical trace list and the newly added requirements.
Finally, a trace link identification model fusing heterogeneous features was constructed. The model consists of three modules, namely requirement-pair feature embedding layer, feature fusion layer, and trace link identification layer. The primary function of the requirement-pair feature embedding layer is to embed the text features and heuristic features of HLR-LLR pairs based on the aforementioned feature extraction method. Textual features are high-dimensional (200-dimensional), whereas the heuristic process features are multiple low-dimensional (1-dimensional) features. The feature fusion layer reduces the dimensionality of high-dimensional text features based on cosine similarity and concatenates them with low-dimensional features to realize the fusion of heterogeneous features. Finally, the fused features are input into the trace link identification layer to obtain the trace link identification result of the pair of requirements.

Project-Specific Pre-Training
In this study, we designed DRAFT upon BERT language model [27] for the following reasons. Firstly, the BERT architecture is based on transformers [38]. Compared to classic unidirectional models such as RNN, when embedding a word in a given sequence, BERT is capable of encoding the surrounding context bi-directionally. Secondly, the training process of BERT can be parallel. This allows BERT to obtain a more sufficient context vector with much less time consumption. Thirdly, BERT has been successfully applied to requirement traceability problems of open-source projects in recent studies [26,30] and outperformed two popular RNN baselines (i.e., GRU and LSTM).
The currently BERT pre-trained language models are trained on general corpora such as WordPiece and Wiki. Therefore, the context of the corpora differs from the technical documents of the project. Ref. [26] proved that a second phase of pre-training using a domain corpus (i.e., domain-adaptive pre-training) could lead to performance gains. Thus, to obtain a language representation model that can understand project-related documents more accurately and enhance the ability of the model to represent domain/project-related vocabulary, the natural language text contained in the summary and description fields of all requirements (including HLRs and LLRs) for each project was extracted to construct a project-related corpus. We performed second-phase pre-training for the BERT model "uncased_L-12_H-768_A-12" (http://github.com/google-research/bert, accessed on 25 January 2023) based on two unsupervised tasks: masked language model (MLM) task and next sentence prediction (NSP) task.
The MLM task randomly masks some words in the original text to construct a training set. The training goal is to allow the encoder to predict the masked words based on the context. The MLM prediction task enables the model encoding result to contain the context information. The following original MLM training strategy of BERT was adopted: (1) 15% of the tokens in the sentence are randomly selected; (2) among the 15% tokens, 80% are replaced by "[mask]", 10% remain unchanged, and the remaining 10% are replaced with a random token; (3) the tokens selected in the first step are predicted based on the context. NSP is used to train the sentence-level feature extraction ability of BERT, which is a sentence-pair classification task. Thus, given a pair of input sentences S1 and S2, NSP predicts whether S2 is the next sentence of S1. The NSP task was selected to increase the ability of the language model to understand sentence relationships. For the NSP task, first, the paragraphs and sentences of the long natural language descriptions are identified in the corpus based on the Stanford CoreNLP tool. Next, sentence pairs with a sliding window of length 2 are extracted for each natural language description. Finally, the sentence pairs in the original order are considered positive samples, whereas sentence pairs in the reversed order are considered negative samples to construct a training set.
After second-phase pre-training, a project-specific pre-training model was obtained to extract the features of natural language descriptions in the requirements.

Heuristic Feature Extraction Based on Metadata
Based on the three scenarios defined in Section 2.2, the following three types of requirements pairs should be considered when updating trace links between different levels of requirements: (a) new HLR and historical LLR, hlr new -llr hist ; (b) historical HLR and new LLR, hlr hist -llr new ; (c) new HLR and new LLR, hlr new -llr new . This section designs 11 features of cross-level requirement pairs for these three scenarios, including process features and text feature (Table 1). where HLRS hist and LLRS hist are the sets of HLRs and LLRs in the historical version, and TLS hist is a set of historical trace links.
Ft.11 is a set of text embedding features which we extracted from the natural language descriptions (typically included in two metadata fields, summary and description) of the requirement using the language model obtained in Section 5.

Heuristic Features Related to Components and Labels
The component attribute of a requirement marks the specific components that the requirement targets. If two cross-level requirements share the same component, a trace link is more likely to exist between them. This likeliness is also true for a pair of cross-level requirements with the same labels. Therefore, for a pair of cross-level requirements (hlr, llr), heuristic features based on the components and labels fields of the requirements should be designed. Based on these considerations, the features (same_coms and find_coms) based on the components field and the feature (same_labels) based on the labels field are extracted.
The feature same_coms is a floating-point number in the range of 0~1.0 and is a normalized representation of the number of components shared by a pair of cross-level requirements, as presented in Equation (1) pair.same coms = |hlr.coms ∩ llr.coms| max(|hlr.coms|, |llr.coms|) + 1 We extract the find_coms feature, which reflects how many keywords in the component list of one requirement can be obtained in the summary of the other requirement in a pair of cross-level requirements. The find_coms feature of a requirement pair is obtained by considering the maximum value from both find_hcoms and find_lcoms, as presented in Equation (2). In this equation, find_hcoms is the normalized representation of the number of keywords in the component list of the HLR appearing in the LLR summary, as shown in Equation (3). Here, |llr.sum+llr.des| indicates the number of words in the summary and description of llr. Similarly, Equation (3) can be used to calculate find_lcoms, that is, the number of keywords in llr.components appearing in hlr.summary and hlr.description, and then find_coms can be obtained.
Similarly, the feature same_labels is designed for labels. Its extraction method is the same as the extraction methods of same_coms.

Heuristic Features of Stakeholder Information
The person making the request could be the project developer or the end user. A pair of cross-level requirements with the same stakeholder indicates that their themes or contents are correlated to some extent, and they are more likely to have a trace link.
To mark the correlation between a pair of cross-level requirements in terms of stakeholders, we extract the following three Boolean features to record whether the pair of requirements are raised by the same user/developer (same_creator), whether they are assigned to the same person to handle (same_assignee), and whether the stakeholder (same_stk) is the same.

Extended Features Based on Historical Trace Links
Because of the intricate relationship between requirements in the same system, some association may exist between new requirements and existing requirements. This association is helpful for creating traces for new requirements. Therefore, extended features were designed based on the historical trace list. As displayed in Figure 6, when determining whether a trace link exists between the new LLR new and the existing HLR hist , the features between these two requirements (i.e., direct features) should be extracted. If LLR new has high similarity to one or more LLRs related to the HLR hist , the probability that LLR new has a trace link with HLR hist is also high. Therefore, we use LLRs contained in HLR hist .traced_reqs (i.e., the historical trace list of HLR hist ) with LLR new to form requirement pairs and extract features such as same_coms, find_com, same_creator, and same assignee by pairs as extended features (Ft.7-10). The vectors of direct features and extended features were concatenated into the final feature vector between LLR new and HLR hist .

Extended Features Based on Historical Trace Links
Because of the intricate relationship between requirements in the association may exist between new requirements and existing requir ation is helpful for creating traces for new requirements. Therefore were designed based on the historical trace list. As displayed in Figur ing whether a trace link exists between the new LLRnew and the existing between these two requirements (i.e., direct features) should be extr high similarity to one or more LLRs related to the HLRhist, the probab a trace link with HLRhist is also high. Therefore, we use L HLRhist.traced_reqs (i.e., the historical trace list of HLRhist) with LLRnew t pairs and extract features such as same_coms, find_com, same_creator by pairs as extended features (Ft.7-10). The vectors of direct feature tures were concatenated into the final feature vector between LLRnew a Similarly, when identifying a trace link between a pair of requir and llr∈LLRShist, extended features based on the historical trace list of

Textual Feature
The natural language descriptions contain primary semantic info ments, and textual feature is an important factor when identifying contents of a requirement mainly lie in the metadata attributes summ Therefore, for a pair of hlr and llr, we extract the text features (Ft. 11) in hlr.summary, hlr.description, llr.summary and llr.description, utilizin pre-trained project-specific language representation model BERTpjt (s Similarly, when identifying a trace link between a pair of requirements hlr∈HLRS new and llr∈LLRS hist , extended features based on the historical trace list of llr can be extracted.

Textual Feature
The natural language descriptions contain primary semantic information of requirements, and textual feature is an important factor when identifying trace links. Textual contents of a requirement mainly lie in the metadata attributes summary and description. Therefore, for a pair of hlr and llr, we extract the text features (Ft. 11) by embedding texts in hlr.summary, hlr.description, llr.summary and llr.description, utilizing the second-phase pre-trained project-specific language representation model BERT pjt (seen in Section 5).

Trace Link Identification Model Fusing Heterogeneous Features
To support the joint analysis of high-dimensional text features of requirement description and heuristic features based on metadata, a neural network model that fuses heterogeneous features was designed.

Model Structure
As displayed in Figure 7, the model typically includes three layers, namely requirementpair feature embedding, heterogeneous feature fusion, and trace link identification layers. heterogeneous features was designed.

Model Structure
As displayed in Figure 7, the model typically includes three layers, namely requirement-pair feature embedding, heterogeneous feature fusion, and trace link identification layers. Figure 7. Neural network structure of DRAFT, which consists of three modules, that is, requirementpair feature embedding, feature fusion, and trace link identification layers.

Requirement-pair feature embedding layer. This layer embeds text features and
process features for the input cross-level requirement pair <hlr, llr> using the feature extraction method described in Section 6. First, the text features of requirements are embedded by the second-phase pretrained project-specific language representation model BERTpjt (seen in Section 5). For a serialized text T = {t1,t2…tn}, the last layer of BERTpjt (embedding layer) can output the word embedding {ECLS,Et1,Et2..Etn, ESEP} of each word in T, where Ei is a 768-dimensional vector. This embedding is typically used as a representation of sentences for downstream tasks. Nils Reimers [43] et al. argued that three most commonly used strategies are as follows: (1) use ECLS directly; (2) use the average pooling strategy, that is, calculate the average value of the representation vector corresponding to each word in T to obtain Emean; (3) use the maximum pooling strategy, that is, calculate the maximum value of all word vectors in T in each dimension to obtain Emax. This study revealed that the strategy of using Emean as the input of the downstream sentence-pair relationship classification task yields the best performance. Therefore, we adopted this strategy, as presented in Equation (7). To improve the adaptability of the sentence representation output by the pre-training model to the described requirement trace task, we added a fully connected layer after the pooling layer. After Emean passes through the fully connected layer, the embedding vector ET can be obtained for downstream tasks, as presented in Equation (8). In this equation, W is a k×j-dimensional trainable parameter, j is the dimension of Emean, and k is the dimension of the output text representation vector, which is set to 200. Figure 7. Neural network structure of DRAFT, which consists of three modules, that is, requirementpair feature embedding, feature fusion, and trace link identification layers.

1.
Requirement-pair feature embedding layer. This layer embeds text features and process features for the input cross-level requirement pair <hlr, llr> using the feature extraction method described in Section 6.
First, the text features of requirements are embedded by the second-phase pre-trained project-specific language representation model BERT pjt (seen in Section 5). For a serialized text T = {t1,t2 . . . tn}, the last layer of BERT pjt (embedding layer) can output the word embedding {E CLS ,E t1 ,E t2 ..E tn , E SEP } of each word in T, where E i is a 768-dimensional vector. This embedding is typically used as a representation of sentences for downstream tasks. Nils Reimers [43] et al. argued that three most commonly used strategies are as follows: (1) use E CLS directly; (2) use the average pooling strategy, that is, calculate the average value of the representation vector corresponding to each word in T to obtain E mean ; (3) use the maximum pooling strategy, that is, calculate the maximum value of all word vectors in T in each dimension to obtain E max . This study revealed that the strategy of using E mean as the input of the downstream sentence-pair relationship classification task yields the best performance. Therefore, we adopted this strategy, as presented in Equation (7). To improve the adaptability of the sentence representation output by the pre-training model to the described requirement trace task, we added a fully connected layer after the pooling layer. After E mean passes through the fully connected layer, the embedding vector E T can be obtained for downstream tasks, as presented in Equation (8). In this equation, W is a k × j-dimensional trainable parameter, j is the dimension of E mean , and k is the dimension of the output text representation vector, which is set to 200. E mean = mean pooling(BERT pjt .encoder(T)) (8) The text content of a requirement is stored in the summary and description fields. As displayed in Figure 7, after the natural language descriptions of HLRs and LLRs, that is, hlr.sum, hlr.des, llr.sum, and llr.des, pass by the feature embedding layer, we obtain four 200-dimensional sentence representation vectors, namely Second, based on keywords and process information, we extract the common features of low-dimensional heuristic requirement pairs such as find_coms and same_creator using the method mentioned in Section 6. If a historical trace link between the HLR and LLR in a requirement pair exists, then extended features should be extracted based on the historical trace list. Finally, 10 one-dimensional heuristic features F heu were obtained.

2.
Heterogeneous feature fusion layer. The feature fusion layer is used to fuse text features and heuristic features with various dimensions to comprehensively analyze the commonality of a pair of requirements in terms of text semantics and process data. After the processing at the feature embedding layer, the 200-dimensional text embedding representation vector is obtained in the natural language description, whereas heuristic features are one-dimensional.
The text embedding vector is an abstract feature obtained through a deep network. Heuristic features are shallow features, which can directly represent the common features of the HLR and LLR. To fuse the two types of features, the cosine similarity layer is used to reduce the dimensionality of the text vector. In particular, the cosine distance between the summary and description representations The similarity value represents the semantic similarity between the requirements in the requirement pair <hlr, llr>, and F heu represents the commonality of the pair of requirements in terms of process features, as presented in Equation (9). Next, the similarity value was fused with the heuristic low-dimensional feature F h at the concatenation layer to obtain a complete requirement-pair feature representation F pair , with its dimension being 14 (4 similarities plus 10 heuristic features).

3.
Trace link identification layer. This layer includes two fully connected layers and one Softmax output layer. As presented in Equation (10), the feature F pair of the requirement pair in the previous step is used as the input, and the trace link identification result C pair of this pair of cross-level requirements is output. Here, C pair is one-dimensional, and takes the value of 0 (no trace link) or 1 (with a trace link), and W is a 1 × 10-dimensional trainable parameter.

Loss Function
We determine whether a trace link exists between a pair of requirements based on the joint feature representation of the pair of requirements in text information and process information and consider the identification of cross-level trace link to be a binary classification problem. Therefore, cross entropy (CE) was selected as the loss function to train the trace link identification model. The training goal is to minimize this loss. CE measures the difference between the true distribution and the predicted distribution of a random variable, as defined in Equation (11). In this equation, m is the number of samples; n is the number of class labels of the samples; y ij is the true probability that the class of sample i is label j; and y' ij is the probability of the neural network predicting the class of sample i to be label j.
In the presented binary classification problem scenario, the real label of the sample can only take 1 (with a trace link) or 0 (no trace link) values. The binary cross entropy (BCE) can be defined as follows: In requirement trace, the number of positive and negative samples is highly unbalanced. Negative samples far exceed positive samples. This phenomenon is severe in projects with more requirements. The ratio of negative samples to positive samples in the eight projects selected in this study is approximately 300:1 on average. This ratio is the highest in the JBIDE project, reaching 570:1. In this case, predicting the sample as the majority class (negative sample here) causes a smaller loss. Thus, the model always tends to predict a pair of cross-level requirements as having no trace link, which causes the model to fail. To solve this problem, a weight coefficient was set for the BCE loss function according to the ratio of positive and negative samples, and the loss weight of positive examples is increased as follows: where α and β are calculated using the ratio of positive and negative samples, respectively. If the numbers of positive and negative samples are n pos and n neg , respectively, then α = (n pos + n neg )/n pos , and β = (n pos + n neg )/n neg .

Experimental Evaluation
Experimental evaluation was performed to verify the effectiveness of our DRAFT. We selected eight open-source projects involving different domains and collected requirements and trace links from their issue tracking systems. The evaluation was conducted from two aspects. First, the overall effects of DRAFT and the baseline methods in cross-level trace link identification were compared, and the strengths and weaknesses of the proposed method were analyzed. Then, two ablation experiments were performed to study the effects of two key designs of pre-training and extended features.

Objectives of Experimental Evaluation
RQ1 (overall evaluation): How is the performance of DRAFT in identifying cross-level trace links for new requirements? Is DRAFT better than the baseline methods?
RQ2 (ablation experiment): Do the project-specific second-phase pre-training in DRAFT and the heuristic feature extraction based on metadata play a positive role?

Data Acquisition
This sub-section details the project selection and the process of collecting raw data (HLRs, LLRs, and trace links).
Project selection. Following the process in Figure 8, we selected eight open-source projects under the Apache (https://issues.apache.org/jira/, accessed on 25 January 2023) and Redhat (https://issues.redhat.com/, accessed on 25 January 2023) and constructed a dataset for each project by collecting requirements and trace links from their issue tracker. First, based on the API provided by JIRA, we obtained all projects under Apache and Redhat that use JIRA as the issue tracker and have been active for more than three years and collected their issue logs and trace link data between issue logs. Next, we counted the number of cross-level trace links in each project and selected the projects with a link size of more than 600. Because this study investigated the trace link prediction method for new requirements in the continuous release process of a project, the verification data should have clear version information to distinguish historical requirements from new requirements. Therefore, we selected the projects in which the version number and release time were clearly recorded. To improve the generalizability of the experimental conclusions, we manually screened projects from various domains and finally obtained the following eight projects: Apache's Beam, CB (apache cordova), and Redhat's FH (feedHenry), JBIDE (jbosstools), AAH (automation hub), KEYLOACK, KOGITO, and PROJQUAY (project quay). Details of these eight projects are presented in Table 2.
collected their issue logs and trace link data between issue logs. Next, we counted the number of cross-level trace links in each project and selected the projects with a link size of more than 600. Because this study investigated the trace link prediction method for new requirements in the continuous release process of a project, the verification data should have clear version information to distinguish historical requirements from new require ments. Therefore, we selected the projects in which the version number and release time were clearly recorded. To improve the generalizability of the experimental conclusions we manually screened projects from various domains and finally obtained the following eight projects: Apache's Beam, CB (apache cordova), and Redhat's FH (feedHenry), JBIDE (jbosstools), AAH (automation hub), KEYLOACK, KOGITO, and PROJQUAY (projec quay). Details of these eight projects are presented in Table 2.    Data collection. For each requirement entry, we collected requirement ID, text information, and process information, and other metadata fields (e.g., Section 6.1), including type, summary, description, labels, components, creator, assignee, and create_time (creation time).
All requirement texts (including text descriptions of historical requirements and new requirements) of the project were used to construct a corpus to pre-train the BERT model. For making the constructed dataset more suitable for practical scenarios, a released version from each project was selected, and the requirements before and after the release time were refined as historical and new requirements, respectively. The target splitting point should satisfy a condition: after the project dataset is split with this version, the ratio of trace links in the training set to those in the test set is between 2:1 and 4:1. Table 3 lists the information, such as the training set, test set, and specific dataset size of each project, the ratio of the trace links in the training set to those in the test set, and the splitting version. The trace space is the product of the numbers of HLRs and LLRs [44].

. Baseline Methods and Evaluation Indicators
Four baseline methods, namely classic information retrieval algorithms VSM [14] and LSI [15], relevance feedback technique RF [18], and TraceBERT [26], were selected. The primary reason for selecting VSM and LSI is that they can be used to calculate the text similarity between artifacts in a lightweight and intuitive manner and exhibit a good practical effect. The relevance feedback method is an improvement of VSM technology and can fully use historical trace links and optimize the text vector of the query statements. Thus, this method can achieve superior results to those of VSM when retrieving trace links [15,16]. We used TraceBERT as a baseline algorithm for two reasons. First, in this method, the online negative sampling strategy is used to solve the overfitting problem that is prone to occur when deep learning technology is applied to the field of requirement trace. Second, the algorithm improves the effect of trace link identification through domain knowledge transfer. Experiments have revealed that this method can produce results superior to those of information retrieval and other technologies (including recall and precision).
When applying RF to the identification of new trace links, the query vector of HLRs can be improved based on the trace links recorded in the system. When implementing the TraceBERT method, the second-phase pre-trained model provided by Ref. [26] was used to fine-tune the training set in Table 3 and then evaluate its performance in cross-level requirement trace tasks for each project on the test set.
We used F1 and F2 scores, which are the harmonic mean of precision and recall, as the evaluation indicators. They measure the ability of the algorithm to consider both precision and recall. These measures can be calculated using Equations (14) and (15). For a given HLR set and LLR set, T is used to represent the set of real trace links between them, C to represent the set of candidate trace links identified by the algorithm, and C r and C w to represent the correct-link set and incorrect-link set in the candidate link set, respectively. Thus, C = C r + C w . Next, we have the following expression: F1 and F2 scores are the harmonic mean of recall and precision: In the F1 score, the weight of precision and recall is the same, that is, β = 1. The weight of the recall rate in the F2 score is two times precision. The F2 score is used because a more complete but less accurate set of candidate trace links is more useful when assisting analysts to construct or update trace links. In this case, analysts only manually filter incorrect data from them. If the recall rate is insufficient, analysts identify a small number of requirement pairs with trace links from numerous requirement pairs, which requires considerable time and effort.

Evaluation Results and Analysis RQ1: Performances of DRAFT (the Proposed Method) and Baseline Methods in Trace Link Identification
The experimental results are presented in Table 4. The F2 score of DRAFT (the proposed method) is close to or exceeds 80% in five projects, and it is higher than 60% in the other three projects; its F1 score achieves 70-80% in five projects and 50-65% in the other three projects. The proposed method can achieve average F1 and F2 scores of 69.3% and 76.9%, respectively. The results revealed that DRAFT can identify high-quality trace link sets and provide automatic assistance for analysts. Additionally, the pair-wise Wilcoxon signed-rank test was used to test the significant difference between the results obtained by DRAFT and the four baseline methods and set two p-value thresholds: 5% and 1%. The results are presented in Table 5. DRAFT achieved the highest F2 score in all the eight projects of different scales. Its performance is significantly improved by 32% (p-value < 0.01) and 40% (p-value < 0.01), compared with the baseline methods VSM and LSI, respectively. In the FH project, the F2 score of DRAFT was up to 62.3% higher than that of the LSI method. The similarity in the text description and semantic information of cross-level requirements is the primary basis for determining the trace link between the two requirements. Conventional VSM and LSI are more intuitive and can be used to calculate the similarity by capturing the common vocabulary between crosslevel requirements. However, semantic information (e.g., inability to handle polysemy and close/synonymous words) is not considered. Compared with the baseline methods, in DRAFT, the BERT language model is used as a text feature embedding module. Therefore, the model can capture the implicit semantic association between words when calculating the semantic similarity of the required text [27]. In DRAFT, the language model obtained after the second-phase pre-training on the project corpus is used. Therefore, DRAFT can be adapted based on the project context, which leads to a performance superior to those of information retrieval methods such as VSM and LSI. The RF method improves the query statement vector of information retrieval based on historical trace links. In TraceBERT, domain knowledge transfer is realized by using the pre-training model, which enhances the semantic expression ability of the model, and its trace effect is improved compared with those of VSM and LSI. TraceBERT achieves an F2 score higher than 0.763 in the JBIDE project, of which the training data were the largest. DRAFT considerably outperformed RF and TraceBERT, with an F2 improvement of 22% (p-value < 0.01) and 33% (p-value < 0.01), respectively, because heuristic features based on metadata, in addition to the similarity in text semantics, should be considered during the creation of trace links. For example, in some requirement pairs (e.g., between AAH-1074 and AAH-1138, and between JBIDE-26680 and JBIDE-26652), capturing the similarity from the text semantics alone is difficult because of differences in the use of terms. However, the process features have some common properties (the same creator and component). Therefore, DRAFT can identify trace links, but the baseline methods miss them.
In terms of F1, the average improvement of the proposed method compared with the four baseline methods was 0.265 (p-value < 0.01), 0.344 (p-value < 0.01), 0.165 (p-value < 0.05), and 0.268 (p-value < 0.01). DRAFT achieved the highest F1 in seven projects. DRAFT was only slightly lower (1.3% lower) than the RF-keydim method in the KOGITO project. The highest F1 and F2 scores in each project are presented in bold in the table. DRAFT presents obvious superiority over VSM, LSI, RF-keydim, and T-BERT.
RQ1 can be answered as follows: DRAFT can identify high-quality cross-level trace link sets for new requirements, and its performance is significantly superior to those of the four baseline methods.

1.
Role of Second-Phase Pre-Training in DRAFT Table 6 presents the trace link identification results when the BERT pre-training model provided by Google and the model after second-phase pre-training are used for extracting the natural language description features of requirements. The improvements and gains (percentage of improvements) are displayed in the last two columns of the table. The results of eight projects reveal that second-phase pre-training improves the performance in identifying cross-level trace links, and the average F1 and F2 scores increased by 0.03 (5%) and 0.05 (7%), respectively. Here, F1 and F2 scores are improved after secondphase pre-training in five projects. Their improvements (F1: 16%; F2: 13%) are the most obvious in the BEAM project. In the PROJQUAY project, the F2 score improves by 0.07, but the F1 score decreases slightly (−2%). In the FH project, the performance is similar to that of the original pre-training model, but F1 and F2 scores are slightly reduced after the second-phase pre-training (−3%, −2%).
The dataset size of each project in Table 3 reveals that in projects with small requirement sets, such as AAH, PROJQUAY, and FH, the effect of second-phase pre-training is not as obvious as that in other projects because the pre-training effect of the language model is directly related to the size and quality of the corpus. BooksCorpus (Zhu et al., 2015) and English Wikipedia are used for BERT pre-training. BooksCorpus contains 800 million words, whereas English Wikipedia contains 2500 million words. In both corpora, grammatically standardized natural languages are used. In second-phase pre-training, the requirement documents of the project were used to develop an expected corpus, of which the size is small. For convenience, open-source projects do not have restrictions on the standardization of the text description of the requirement issue log. Therefore, the quality of the corpus is poor. Figure 9 displays two examples of requirement descriptions that are less standard. In Figure 9a), the requirement description contains considerable debugging information; in Figure 9b), the summary field is too short, and the description field has no available text content. Therefore, a limited improvement is achieved in the design of the requirement trace link identification, and slightly worse results were obtained on a small number of datasets. Another reason for the different performance of DARFT is the varied quality of requirement descriptions. In terms of semantics, DRAFT can capture the textual similarities and identify the trace links better for the projects whose traced requirement pairs are described with more consistent terminology usage. For example, when taking the HLR KEYLOACK-7445 ("Test performance of Authorization Services") and LLR KEYLOACK-7620 ("Generating performance datasets for authorization services") as input, DRAFT could yield a high semantic similarity score between their textual description and create their trace link. In contrast, for the projects (e.g., AAH) whose traced requirement pairs share fewer semantically close terminologies, the trace links are more challenging to identify.
Therefore, RQ2 can be answered as follows: second-phase pre-training plays a positive role in trace link identification, and F1 and F2 scores for most projects are improved. However, the performance of cross-level requirement trace link identification for one project degrades slightly.

Metadata-Based Heuristic Features in DRAFT
To verify the effect of heuristic features (Section 6), an ablation experiment was designed to compare the effects on trace link identification for new requirements between the case of using text features only and the case of using complete heuristic features. The results are presented in Table 7. Compared with using text features only, DRAFT considerably improves precision, recall, F1 score, and F2 score in all projects when complete heuristic features are used. The average improvements of F1 and F2 scores are 0.274 (72%) and 0.325 (84%), respectively. These two scores improve the most in the PROJQUAY project; their improvements are 0.44 (130%) and 0.50 (150%), respectively.   Another reason for the different performance of DARFT is the varied quality of requirement descriptions. In terms of semantics, DRAFT can capture the textual similarities and identify the trace links better for the projects whose traced requirement pairs are described with more consistent terminology usage. For example, when taking the HLR KEYLOACK-7445 ("Test performance of Authorization Services") and LLR KEYLOACK-7620 ("Generating performance datasets for authorization services") as input, DRAFT could yield a high semantic similarity score between their textual description and create their trace link. In contrast, for the projects (e.g., AAH) whose traced requirement pairs share fewer semantically close terminologies, the trace links are more challenging to identify. Therefore, RQ2 can be answered as follows: second-phase pre-training plays a positive role in trace link identification, and F1 and F2 scores for most projects are improved. However, the performance of cross-level requirement trace link identification for one project degrades slightly.

Metadata-Based Heuristic Features in DRAFT
To verify the effect of heuristic features (Section 6), an ablation experiment was designed to compare the effects on trace link identification for new requirements between the case of using text features only and the case of using complete heuristic features. The results are presented in Table 7. Compared with using text features only, DRAFT considerably improves precision, recall, F1 score, and F2 score in all projects when complete heuristic features are used. The average improvements of F1 and F2 scores are 0.274 (72%) and 0.325 (84%), respectively. These two scores improve the most in the PROJQUAY project; their improvements are 0.44 (130%) and 0.50 (150%), respectively. The results revealed that these heuristic features extracted from metadata contain important commonalities between cross-level requirements, which provide a basis for the creation of trace links. For example, the find_coms feature in Section 6.1 reflects whether the components involved in a pair of cross-level requirements were identical. If the two features are submitted for the same component, a trace link can exist between them. The commonality in these processes is a supplement of the commonality in the requirement text description. DRAFT comprehensively analyzes the semantic similarity and process information commonality between cross-level requirements and provides a comprehensive and sufficient basis for trace link creation.
RQ3 can be answered as follows: by fusing metadata-based heuristic features, DRAFT can comprehensively capture the commonality in cross-level requirement pairs, which helps to create higher-quality trace links.

Validity Threats
External validity risks typically originate from the selection of test items and the construction of datasets. To mitigate external validity threats, eight open-source software were selected from various domains. We collected cross-level requirements and their trace links from corresponding JIRA issue log trackers and constructed datasets for experimental evaluation. In addition to JIRA, other widely used issue log trackers, such as Github and Bugzila, are used for requirement acquisition and management. Although the method of storage and requirement management in these methods differs from that in JIRA, the process-related features of the requirements selected in this study (such as the requirement creator) are reflected on these platforms and are all available. Therefore, the proposed method provides a reference for the construction and evolution of cross-level requirements on these platforms.
To eliminate the internal validity threats, we selected the release date of the actual historical version in the project as the splitting point when splitting the training set and the test set to ensure the experimental scenario was as close as possible to the requirement trace practice. We selected four automated trace methods (including information retrieval-based and deep-learning-based methods) from previous studies as baseline methods to reuse opensource codes to avoid implementation errors and ensure accurate experiment execution.
To alleviate the threats of structural validity, we selected the most widely used indicators, such as precision, recall, F1 score, and F2 score, in requirement trace in the experimental evaluation stage. Finally, the trace link identification results can be comprehensively and objectively evaluated in terms of accuracy and completeness.

Limitations
A second-phase pre-training was performed for the BERT language model on the project requirement text corpus, and a project-specific language model that encodes the semantic knowledge of projects was obtained. When the corpus is of large size and high quality, the pre-training effect of the language model can be maximized. In this study, the data size for the second-phase pre-training is considerably smaller than the data size for the initial pre-training (0.5-15.5 MB vs. 800 M and 2500 M words), and the data are also less standard. The second-phase pre-training still plays a positive role in most projects in this study. In the future, superior results can be obtained by collecting more project-related data (not limited to requirements) and performing fine-tuned preprocessing.
The proposed requirement trace link update method applies to new requirements in the process of project evolution. In practice, in addition to new additions, deletions and modifications of requirements should also be considered. As mentioned in Section 2.2, traceability changes caused by the deletion of requirements are simple, and traceability updates caused by modification of requirements can be performed based on the proposed method. However, no related experiments were conducted. In the future, the comprehensive support of the proposed method should be studied for the three evolution scenarios of cross-level requirement trace.

Conclusions
During the iteration process of open-source software projects, new requirements are frequently added. Therefore, cross-level trace links should be updated in a timely manner. To address this problem, a cross-level requirement trace method fusing heterogeneous features, that is, DRAFT, was proposed to fully use historical data and abstract the trace link identification method. First, we investigated the project-specific second-phase pre-training method based on BERT. This method can enhance the ability of the pre-training to represent project-related terms and extract the text features in the natural language requirement description that integrates context information. Second, we studied the heuristic feature extraction method for process data to obtain comprehensive feature representations of requirement entries. This method can extract direct features between candidate requirement pairs and extended features based on historical trace list. We then explored the neural network architecture that can fuse heterogeneous features. This architecture can train the requirement trace link identification model, providing automated support for analysts. Finally, we collected cross-level requirements and trace links between them from real-world open-source projects, developed datasets based on the trace link update scenario for new requirements in practical scenarios, and verified the application effect of DRAFT. The experiment results revealed that DRAFT outperformed baseline methods such as VSM, LSI, and TraceBERT in the trace link update task. Although DRAFT is designed to trace requirements at different levels, it can also be referenced to the construction or evolution of trace links between other textual software artifacts (e.g., design documents, UML models, and test cases). Moreover, the architecture of DRAFT can be easily extended to incorporate more features of the related artifacts.

Data Availability Statement:
The data that support the findings of this study are available within the article.