To address the issues of complex information, unstructured formats, and the high consistency requirements for designers to understand requirement documents, the system employs SysML’s definition of requirements for document management. It classifies system requirements (SRs) into five major categories: functional requirements, interface requirements, performance requirements, physical requirements, and design constraints. In response to the differences in specific elements contained within each category, generalized information extraction templates are designed for each category. These templates clearly define key information elements in the requirement documents, such as subjects, objects, and actions. Based on these templates, prompt phrases are designed to standardize the large language model’s extraction of elements, transforming unstructured text into structured data, including nodes (subject, attribute, object), edges (action), and textual elements (function, condition, principle). Finally, a domain-specific requirement representation knowledge graph is constructed using the extracted structured data, systematically presenting the semantic relationships and hierarchical connections between the requirements, thereby providing intuitive support for requirement analysis and management.
3.1. Data Preprocessing
In natural language processing tasks, data cleaning is a crucial step to ensure data quality and enhance model performance [
41,
42]. Given the characteristics of requirement documents, this study designs and implements three data cleaning strategies to ensure data consistency and usability: (1) Removal of irrelevant numbering and markings: Requirement documents are typically organized by chapters and often contain irrelevant noise, such as chapter numbers, section markers, and other formatting symbols. If this noise is included in the input, it could interfere with the subsequent text parsing performed by the large language model. (2) Text format normalization: The raw documents frequently contain extra spaces, line breaks, and other special characters, which may cause errors in text segmentation or parsing difficulties. (3) Maintaining consistency in requirement text context: Given that requirement documents often have multi-level nested structures, ensuring the coherence and integrity of the requirement descriptions is crucial. For sub-items or specific functionality entries that follow a main requirement description, their content is concatenated with the primary description to ensure consistency in the subject matter. This approach ensures that the requirement subject remains consistent throughout.
After completing the data cleaning process, this paper annotated the requirement document data according to the five types of requirement labels defined in SysML: functional requirements, interface requirements, performance requirements, physical requirements, and design constraints. These labels comprehensively cover the main types of requirements in complex system design and provide clear standards for the classification and analysis of requirement documents. The document descriptions corresponding to these five requirement labels are collectively referred to as system requirements (SRs). By classifying the scattered SRs into categories, this process not only improves the organization of information but also lays the foundation for the subsequent construction of information extraction templates. The specific definitions and examples of the five requirement labels are provided in
Table 1. This study strictly adheres to the definition standards for each requirement label and systematically annotates the SRs to ensure the accuracy and consistency of the annotation results, thus providing a high-quality corpus for subsequent research.
3.2. Construction of SRs Information Extraction Templates
The information extraction templates serve as the core foundation for enabling large language models to perform information extraction tasks, and their completeness directly impacts the accuracy and reliability of the extracted information. To achieve the precise extraction of information from different categories of system requirements (SRs), this paper designs corresponding structured information extraction templates based on the five requirement labels defined by SysML. Each template clearly defines the key information elements that should be present for each requirement type, aiming to ensure consistency and completeness in the extraction process. The design of these templates is based on a comprehensive analysis of the requirement categories, thoroughly considering the unique characteristics and shared attributes of various requirements in complex system design, ensuring the accurate extraction of the core components of each requirement category from the requirement documents.
The elements within the templates encompass various types of characteristic information, including the following:
Subject: The subject of each requirement represents the initiator, executor, or the primary system components involved. In the five types of requirements, the subject is a core component that clarifies the responsible party or related system elements, helping to define the key roles and their functions within the requirement. In the process of requirement data extraction using large language models, the subject is a necessary information element.
Attribute: In some SRs, in addition to the subject, specific attributes related to the subject may also appear, typically expressed in the form of “AA’s BB”, where BB is the core subject of the requirement. These attributes are used to further describe the specific characteristics or functionalities of the subject.
Object: In SRs, the object is the target entity involved when the subject performs certain actions or behaviors. It refers to the specific thing or system component that the described operation or function in the requirement acts upon.
Action: In the requirement document, the action refers to the operation or action performed by the subject on the object, specifically describing the interaction between the subject and the object.
Function: In functional requirements, the function refers to the behavior or operational capability that a system or component should possess. It specifies the tasks that the system needs to perform or the services it needs to provide. For functional requirements, functionality is an essential key information element.
Value: In performance and physical requirements, the value refers to specific performance metrics or physical attributes that a subject or object must achieve.
Principle: In design constraints, the principle refers to the standards, norms, or guidelines that must be followed during the system design process. For design constraints, principles are essential key information elements.
Condition: In the requirements document, a condition refers to the prerequisite or specific environment that must be met to fulfill a particular requirement.
For the five types of requirement text content in the requirement documents, this paper selects the key elements that each category should contain. The specific template structure is shown in
Table 2, where ✓ indicates the key elements that should be present for each requirement type. Additionally, for a given SR, certain elements (such as subject, object, function, etc.) may appear multiple times. A single SR may involve multiple subjects, objects, or functions. To further incorporate the structured data as part of the knowledge graph construction, this paper categorizes the elements of the SR into three types based on the characteristics of the information extraction templates: nodes, edges, and text. As shown in
Figure 2, the elements corresponding to nodes are as follows: subject, attribute, and object, which are used to describe key entities in the requirements. The elements corresponding to edges are as follows:: action, which is used to represent the relationships between different nodes, with the specific content of the behaviors recorded within the edges. And the elements corresponding to text are as follows: function, value, principle, and condition, which are used to further describe the specific content of the requirements. During the graph construction, by progressively combining nodes, edges, and text, the domain knowledge graph is continuously expanded to achieve a comprehensive representation of requirement information.
3.3. Design of GPT-4 Prompts for SRs Information Extraction
With its powerful natural language understanding and generation capabilities, GPT-4 excels in processing complex texts, recognizing contextual relationships, and generating structured outputs. Especially when dealing with technically demanding texts such as requirement documents, GPT-4 is capable of accurately identifying key elements within the text and effectively transforming them into structured data. Based on this, GPT-4 is selected as the core model for information extraction in this paper, and prompts are designed to guide it in performing the information extraction tasks, with these prompts being grounded in the information extraction templates.
The construction of prompts is a critical step in guiding large language models to perform specific tasks. Although several studies have explored how to design efficient prompt templates [
43], creating a highly effective template remains a process that requires continuous iteration and experience accumulation. To achieve the precise extraction of SRs, this paper designs the following prompt templates, aiming to ensure that GPT-4 can efficiently and accurately complete the information extraction tasks. The design of this prompt is primarily divided into the following modules:
- (1)
Task description: The task description clearly defines GPT-4’s core responsibility, which is to extract key information from the requirements document based on the provided <description, category> using the defined SRs information extraction template.
- (2)
Task objective: The task objective involves tokenizing the text, removing stop words, retaining the main content and key terms, and ultimately transforming the text into structured data. During this process, GPT-4 will strictly follow the SRs information extraction template to simplify the complex content into a structured output that is easy to understand and process. This objective ensures that the model can efficiently and accurately extract core information from technical documents.
- (3)
SRs information extraction template: This section outlines the SRs information extraction template established based on the characteristics of the requirement documents. The template offers specific extraction elements for the five types of requirements, such as subject, object, action, and conditions, ensuring that the extraction results are structured and accurate.
- (4)
Output format: To ensure that the extracted structured data are easy to process and interpret in subsequent stages, GPT-4 is instructed to output the results in a standardized JSON format. The output format includes a “category” label and “structured data”, where the “structured data” section is filled in accordance with the SRs information extraction template. This approach ensures that all extracted content maintains consistency and clarity.
- (5)
Examples: To further guide GPT-4 in understanding the task, few-shot learning [
44] is employed by providing multiple examples. Each example includes input text and its corresponding expected output, ensuring that the model can generate structured results in accordance with the template.
After using GPT-4 for SRs information extraction, the originally unstructured requirement texts are transformed into labeled tuples. Each structured data tuple fully represents all of the information within the requirement text. These structured SR tuples and their corresponding category labels are then used as input for further construction of the knowledge graph for requirement representation.
3.4. Construction of SRs Requirement Representation Knowledge Graph
Requirement analysis for complex systems often involves intricate interactions and relationships between multiple entities. To better understand and manage these requirements, a knowledge graph serves as an effective visualization tool, depicting key elements and their interrelationships in system requirements with nodes and edges [
18]. The knowledge graph not only visually represents the structure of the requirements, but also uncovers the underlying dependencies and functional interactions between them, aiding designers in gaining a comprehensive understanding of the overall SRs.
Upon analyzing the structure of the five types of requirements, it was found that physical requirements define the components of the system and their interrelationships, exhibiting static structural characteristics. Therefore, physical requirements can serve as the foundational framework for constructing the requirement representation knowledge graph. By describing the relationships between subjects, objects, and their attributes, physical requirements provide a clear structural backbone for the knowledge graph. Based on this framework, other types of requirements (such as functional requirements, interface requirements, performance requirements, and design constraints) will gradually expand and refine the graph. Each requirement category further enriches the graph content based on its specific elements (e.g., functions in functional requirements, interface requirements in interface specifications, numerical indicators in performance requirements, etc.), ensuring that the logical relationships between different types of requirements are fully represented. The relationships between various nodes are shown in
Table 3, and the SRs representation knowledge graph is constructed based on these node relationships to form the domain knowledge graph.
The framework for constructing the knowledge graph based on physical requirements is shown in
Figure 3. The first step of the algorithm is to create an empty directed graph G to store all of the entities and their interrelationships within the physical requirements. As the requirement data are parsed, nodes and edges are incrementally added to the graph. In the second step, the algorithm iterates through each row of the physical requirement sheet, adding the values from the “subject” column as nodes to graph G. If the “attribute” column in the row has a value, it indicates that the subject has specific attributes. The algorithm adds these attributes as nodes to the graph and connects the subject node to the attribute node via directed edges. Next, if the “object” column in the row contains a value, the algorithm adds the object as a node to the graph and connects it to the corresponding subject or attribute node through directed edges. If the row contains the “action”, “value”, and “condition” columns, the corresponding text is attached to the connecting edges between the subject and the object, between the subject and the attribute, between the attribute and the object, or the self-loop edge of the subject, to describe the specific supplementary description of the connection between the data points. Finally, the graph G is output.
The basic framework for constructing a domain-specific requirement knowledge graph based on physical requirement data is outlined, followed by the stepwise integration of functional requirements, performance requirements, design constraints, and interface requirements into the graph. The specific algorithmic steps are shown in
Figure 4. The first step involves importing the knowledge graph G built from the physical requirements and traversing each row of the functional requirements sheet. The algorithm checks the “subject”, “attribute”, and “object” columns to see if their values already exist in graph G. If any values are missing, they are added as nodes to the graph G. If the “attribute” column contains functional attributes, directed edges are created to connect the subject to the attribute. If the “object” column contains values, directed edges are used to connect the subject or attribute nodes to the object node. The “function”, “action”, and “condition” columns are attached as hover text on the edges. The second step addresses performance requirements. The algorithm traverses each row of the performance requirements sheet, checking whether the values in the “subject” and “attribute” columns exist in G. If any values are missing, they are added as nodes to the graph G. If the “attribute” column contains performance attributes, directed edges are created to connect the subject to the attribute. If the attribute contains specific values or conditions, this information is attached as hover text on the edges. The third step involves processing design constraints. The algorithm traverses the design constraints table, checking whether the values in the “subject” and “attribute” columns exist in G. If any values are missing, they are added as nodes to the graph G. If design constraint attributes are present in the “attribute” column, these are added as nodes to the graph and are connected to the subject nodes via directed edges. Descriptions of “action”, “principle”, or “condition” are appended as hover text on the edges, clearly displaying the design constraint requirements. The fourth step focuses on interface requirements. The algorithm traverses each row of the interface requirements sheet, checking whether the values in the “subject” and “object” columns exist in G. If any values are missing, they are added as nodes to the graph G. Subsequently, if the “object” column contains values, directed edges are used to connect the subject node to the object node. Descriptions of the interface requirements actions and conditions are attached as hover text on the edges.
Based on the aforementioned steps, a complete knowledge graph for the domain-specific requirement documentation can be constructed. Considering the designer’s focus on specific nodes, a node extraction algorithm is designed to meet these requirements. This algorithm can extract all of the relationships and associated nodes of a target node from the knowledge graph, thereby presenting all of the related requirement information in a comprehensive manner. The algorithmic logic is depicted in
Figure 5. The first step is to import the domain-specific requirement knowledge graph G and initialize two empty structures: one for storing node information and the other for storing edge information. The second step involves extracting the properties of the target node from the network objects and storing them in the node information structure. In the third step, the algorithm traverses all adjacent nodes pointed to by the target node, collects their properties, and adds them to the node information. Simultaneously, it retrieves the edge information between the adjacent nodes and the target node, storing this in the edge information structure. The fourth step entails recursively traversing all neighbors of the adjacent nodes, collecting their properties and the edge information between them, continuing this process until no further related nodes are found. The final step is to plot the relationship graph G’ of the target node, based on all of the collected node and edge information.