BiLSTM-CRF-KG: A Construction Method of Software Requirements Specification Graph

Wang, Zhengdi; Pan, Jeng-Shyang; Chen, Qinyin; Yang, Shuangyuan

doi:10.3390/app12126016

Open AccessArticle

BiLSTM-CRF-KG: A Construction Method of Software Requirements Specification Graph

¹

Department of Software Engineering, Xiamen University, Xiamen 361005, China

²

College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China

³

Xiamen City University, Xiamen 361005, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(12), 6016; https://doi.org/10.3390/app12126016

Submission received: 19 April 2022 / Revised: 7 June 2022 / Accepted: 7 June 2022 / Published: 13 June 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

A requirement analysis is the basis and source of software system development, and its accuracy, consistency and completeness are the keys to determining software quality. However, at present, most software requirements specifications are prepared manually, which has some problems, such as inconsistency with business description, low preparation efficiency, being error prone and difficulty communicating effectively with business personnel. Aiming at the above problems, this paper realizes a construction model of the software requirements specification graph BiSLTM-CRF-KG by using natural language processing and knowledge graph technology. Simulation experiments on 150 real software system business requirements description corpora show that the BiSLTM-CRF-KG model can obtain 96.31% functional entity recognition accuracy directly from the original corpus, which is better than the classical BiSLTM-CRF, IDCNN-CRF and CRF++ models, and has good performance on different kinds of data sets.

Keywords:

software requirement specifications; BiLSTM-CRF-KG; knowledge graph

1. Introduction

The software requirement analysis [1,2] is the basis and source of software system design and implementation, and its accuracy and completeness are the keys to determining software quality. However, in the process of compiling software project requirements specifications, manual methods are often used. In the process of requirements research, there are often inconsistencies and ambiguities in the understanding of requirements among business personnel, requirements researchers and system developers, such as omissions in requirements research or unclear business processes [3]. This leads to fatal errors in the design of subsequent system functional modules and greatly increases the projected R & D cost [4]. At the same time, the manually prepared software requirements specification documents are mainly for computer professionals. Their readability is poor, and it is difficult for business personnel to understand. As a result, business personnel cannot effectively confirm whether the requirements specification truly reflects the business requirements. In the final development process, there are often a large number of requirements changes and rework, resulting in a significant increase in development time and cost. Finally, the manually prepared requirements specifications are cumbersome, error prone and make it difficult to iterate the version.

At present, the preparation methods of requirements specification documents mainly include the manual preparation of Unified Modeling Language (UML), generation of requirements specifications based on extraction rules [5,6] and automatic generation of requirements specifications based on natural language processing [7]. Although manual preparation has a strong personalized customization ability to generate demand specifications, it is usually difficult to solve the problems of cumbersome manual preparation, being error prone and difficulty communicating effectively with business personnel, which are easily affected by the subjective ability of the writer [8]. The requirement specification generation method based on extraction rules can effectively extract the requirement specification information from the text, but the corresponding extraction rules will be slightly different for different functional requirements. Theoretically, the formulation of extraction rules cannot cover all software scenarios.

The development of natural language processing (NLP) technology provides a new idea for the automatic extraction of requirements specifications. On the one hand, it can effectively improve the efficiency of the formation of software requirements specifications; on the other hand, it can effectively avoid the manual errors of the writer. At present, the application of NLP in English knowledge extraction has been relatively mature and effective, and there are also applications in the field of Chinese knowledge extraction [9,10]. In the software requirements specification document, the most involved parts are user roles, functional requirements, data tables and business processes. Business process is often the flow process of data tables and operation instructions. Function, data and the relationship between them are the core of software requirements specification. If NLP technology can be used to quickly and automatically extract relevant functions, data tables and effectively express the relationship between them, it will effectively improve the efficiency and level of compiling software requirements specification documents.

Based on the above problems, this study focuses on the inconsistency, ambiguity, weak readability, and cumbersome and error prone problems in the preparation of software requirements specification documents and defines the automatic generation model of software requirements specification BiLSTM-CRF-KG (Bi-Directional Long Short-Term Memory-Conditional Random Field-Knowledge Graph) by borrowing the knowledge extraction ability of NLP, the visual expression of a knowledge graph and the ability to remove ambiguity.

The main contributions and innovations of the study are as follows:

Introducing natural language processing and knowledge graph technology into the software requirements specification. Furthermore, this paper, using the knowledge unity and visual expression characteristics of a knowledge graph, can effectively eliminate the fuzziness, ambiguity and possible lack of research in the preparation of software requirements specification, which shorten the level of business understanding of requirements analysts and business personnel.
Improving the traditional U/C matrix into an S/U/C matrix, which is used to effectively express the data form flow relationship between the creator of the data form, the sender and the user. It can effectively solve the problems of function omission, lack of data form and fuzzy business process in the process of some demand research and analysis.

This study conducts simulation experiments on 150 real software system business requirements, and experimental result shows that the BiSLTM-CRF-KG model can obtain 96.48% functional entity recognition accuracy directly from the original corpus, which is more than 5 percentage points higher than the classical BiSLTM-CRF, IDCNN-CRF and CRF++ models, and has a good performance in different kinds of data sets.

This paper consists of five sections. Section 1 introduces the research background and significance, and briefly introduces the research content. Section 2 is related work and research. Section 3 describes the theory and framework process of BiLSTM-CRF-KG. Section 4 is the simulation experiment and its results. Section 5 is the conclusion and future work.

2. Related Work

A range of methods have been introduced in previous research to (semi-)automatically extract functional requirements written in natural language text.

Ghosh et al. [11] used the method of semantic analysis to extract the intermediate representation table, then used a set of heuristic post-processing rules to transform the intermediate representation table into a formal specification. Roth et al. [12] proposed the technology from semantic role annotation to high-level ontology to describe the concepts and relationships of static software functions. However, this technology cannot capture more fine-grained structured information in the requirements specification. Zhou, ZD et al. [13] proposed mathematical methods to analyze customer needs from the perspective of information, and to obtain, classify and process the information. Li, Y et al. [14] used NLP technology to match the defined pattern to achieve information extraction. However, they did not analyze the semantics of fine-grained elements. Using machine learning, natural language processing, semantic analysis and other methods, Yinglin Wang [15] proposed a method to automatically extract the structural information of functional requirements from natural language software requirements description and compared the performance of two different named entity recognition models. However, this study only carries out the experiment at the sentence level and does not perform a comprehensive analysis of the requirements according to the text paragraph and the context information.

Pavlova et al. [16] conducted a semantic analysis on the requirements of the Internet of Things system, increased the capacity of the Internet of Things system, ensured the sufficiency of the measurement information in the software requirements, and improved the quality of the Internet of Things system software. Li, MY et al. [17] presented a novel approach named RENE, which employs the LSTM-CRF model for requirement entity extraction and introduces the general knowledge to reduce the demands for labeled data. However, due to the reduction of data annotation, the performance of the model was unsatisfactory. Through clustering and entry point selection, Ieva, C et al. [18] proposed a method to automatically extract the main functions of the program and carried out feature extraction and tracking. Park, BK et al. [19] proposed a linguistic analysis method based on the semantic analysis of the Fillmore’s textual approach. This method extracts use-cases from informal requirement specifications.

For Chinese knowledge extraction, there are also some scholars engaged in related research. Sun, X et al. [20] proposed a method for Chinese medical named entities recognition on the combination of Chinese radicals and etymon features in the classic character-based BiLSTM-CRF (Bi-Directional Long Short-Term Memory-Conditional Random Field), which performs better than the state-of-art deep learning models in our experiment. Liu, WM et al. [21] proposed a Chinese named entity recognition method based on rules and conditional random fields, which could effectively identify the named entities and improve the processing speed and efficiency. In addition, there is a certain practical value. Dao, AT et al. [22] proposed an approach to improve monolingual named entity recognition systems by exploiting an existing unannotated English–Chinese bilingual corpus.

3. Structural Information Extraction Model of Functional Requirements

Aiming at the problems of inconsistency, ambiguity, being cumbersome and error prone in manually writing software requirements specifications and that which cannot be expressed visually, this study proposes a construction method of software requirements specification graph BiLSTM-CRF-KG based on the BiLSTM-CRF model and knowledge graph technology. The framework design is shown in Figure 1.

The main contents of building the software requirements specification graph BiLSTM-CRF-KG are as follows:

Preprocessing of the original corpus: it mainly realizes the Chinese sentence segmentation, word segmentation, part of speech tagging, named entity and relation tagging of the original business requirements description corpus.
Entity relationship extraction: input the segmentation annotation results into the BiLSTM-CRF model to extract the relationship between functional entities and functions, then generate a preliminary entity relationship set.
Entity disambiguation and hidden relation learning: use context to disambiguate functional entities such as pronouns and abbreviations and learn the hidden relation between functional entities in the preliminary entity relation set, so as to further optimize the entity relation set.
Generation of functional structure graph: use the hierarchical structure of multi tree to transform the optimized entity relationship set into the demanded relationship tree. Then, using the breadth first search strategy, the requirement relationship tree is transformed into a (parent function, parent–child relationship, child function) function structure triplet, which is saved in the neo4j database to realize the automatic embedding of the function structure graph.
Study the method of converting the traditional U/C matrix into the S/U/C matrix to reflect the real flow relationship between the creator/sender/user of the business data table.
Transform the business data table description into the S/U/C matrix.
Use the relationship between the S/U/C matrix mapping function and the data table to generate the (function-create/send/use relationship-data table) triplet and save it in the neo4j database to realize the automatic embedding of the function-data graph.
Integrate the function structure graph and function-data graph to generate the software requirements specification graph.

3.1. Preprocessing of the Original Corpus

The steps of preprocessing the original corpus are shown in Figure 2.

Before entity recognition and relation extraction, this project must carry out sentence segmentation, word segmentation and part of speech tagging on the original corpus.

First of all, Chinese sentence segmentation needs to be carried out, which is based on regular matching on the original corpus. The characteristic characters should be counted at the end of Chinese text sentences. A regular expression group should be constructed that matches a complete sentence, and it is utilized as the input parameter of the split() segmentation function to carry out Chinese sentence segmentation.

Secondly, the Jieba tool is utilized to segment Chinese words and label parts of speech of each clause. This is because a part of speech is used in the following, named entity pruning and disambiguation operations. The word attribute of each participle represents the content of the participle, and the flag attribute represents the participle’s part of speech.

Then, the BMEWO named entity and 0-1 relation dimension are utilized for each participle. BMEWO annotation is the most common method of naming entity annotation, among which B represents the entity start word, M stands for the entity intermediate word, E represents the entity ending word, W represents the entity whole word and O represents the non-entity word. For the relational words, the 0-1 annotation method is mainly used. Among them, 1 represents the relation word and 0 represents the non-relational word.

Finally, the original corpus is divided into independent words, and the features of each word are recorded as

w_{i}

.

w_{i} = w (w o r d, f l a g, e, r)

(1)

Among them,

w o r d

represents the content of word segmentation,

f l a g

represents the part of speech of word segmentation, e represents the attribute of named entity and r represents the attribute of relation. The implementation of this part is shown in Algorithm 1.

Algorithm 1. Preprocessing of the Original Corpus

Input: Corpus
Output:

w

1: sentences = Corpus.split()
2: i = 0
3: for s in sentences do
4: temp = jieba.cut(s)
5:

w_{i} . w o r d = t e m p . w o r d

6:

w_{i} . f l a g = t e m p . w o r d

7:

w_{i} . e = B M E W O (w_{i} . w o r d)

8:

w_{i} . r = r e l a t i o n t a g g i n g (w_{i} . w o r d)

9: i++
10: end for
11: return w

3.2. Named Entity Recognition and Relation Extraction

In this study, the BiLSTM-CRF model is used for named entity recognition and relationship extraction. The BiLSTM-CRF model is a combination of the traditional BiLSTM and CRF models, which can not only consider the correlation between sequences as CRF, but also have the feature extraction and fitting ability of LSTM. Among them, BiLSTM is composed of forward LSTM and backward LSTM. The structure of the BiLSTM-CRF model is shown in Figure 3.

The LSTM model is composed of input words

X_{t}

, cell state

C_{t}

, temporary cell state

\tilde{C_{t}}

, hidden state

h_{t}

, forgetting gate

f_{t}

, memory gate

i_{t}

and output gate

o_{t}

. The calculation process of LSTM can be summarized as follows: by forgetting and memorizing the information in the cell state, the useful information for the subsequent calculation can be transferred, while the useless information is discarded, and the hidden layer state

h_{t}

will be the output at each time step. Forgetting, memory and output are controlled by forgetting gate

f_{t}

, memory gate

i_{t}

and output gate

o_{t}

, respectively, which are calculated by the hidden layer state

h_{t - 1}

of the last time and the current input

X_{t}

.

The traditional LSTM named entity recognition model cannot get the basic information from the back to the front when coding each statement data. In the classification of more fine-grained word segmentation, such as the classification of different degrees of commendatory or derogatory emotional words, we need to pay special attention to the logical relationship between each word. BiLSTM integrates two groups of LSTM layers with opposite directions, one in sentence order and the other in reverse order, so that words contain both historical information and future information, which are more conducive to capturing two-way semantic dependence.

The process of named entity recognition is shown in Figure 4. When each word segment in a sentence is input into the BiLSTM-CRF model, the embedding layer will map it to an initial word vector, and then transfer it to the BiLSTM layer to obtain the forward and backward vectors of the sentence. These forward and backward vectors are spliced as the hidden state vector of the current vocabulary. Although a softmax layer can be used to predict the label of each hidden vector h directly, the predicted label does not consider the information of the front and back labels.

For example, when the named entity attribute of

w_{1}

is “O”, the named entity attribute of

w_{2}

cannot be “M” and “E”. Therefore, this project inputs a hidden state vector hidden to the CRF layer, and then gives more reasonable named entity recognition and relationship extraction results by combining current input information and context information. The result of the named entity recognition is recorded as

p w

. The implementation of this part is shown in Algorithm 2.

Algorithm2. Entity relationship extraction based on BiLSTM-CRF

Input: w
Output:

e

1: pw = BiLSTM-CRF(w)
2: i = 0; temp=’’; j=0;
3: for c in pw do
4:

i f

c.flag == ‘x’ && c.e != ‘O’ then
5:

c . e = = ‘ O ’

6:

i f

c.e == ‘B’ then
7:

i f {pw}_{i + 1} . e = = ‘ M ’ t h e n

8:

{pw}_{i + 1} . e = = ‘ B ’

9: else

i f {pw}_{i + 1} . e = = ‘ E ’ t h e n

10:

{pw}_{i + 1} . e = = ‘ W ’

11: end if
12: end if
13: end if
14: i++
15: end for
16: for d in pw do
17:

i f

d.e == ‘W’ then
18:

e_{j} . entity = d . word; e_{j} . type = 0;

j++;
19: else

i f

d.e == ‘B’ || d.e == ‘M’ then
20:

temp + = d . word;

21: else

i f

d.e == ‘E’ then
22:

temp + = d . word; e_{j} . entity = temp; e_{j} . type = 0;

j++;temp = ‘’;
23:

e l s e i f

d.r == 1 then
24:

e_{j} . entity = d . word; e_{j} . type = 1;

j++;
25: end if
26: end for
27: return e

3.3. Entity Disambiguation and Hidden Relation Learning

The structure of entity disambiguation and hidden relation learning is shown in Figure 5.

Firstly, based on the part of speech that is word segmentation, the experiment disambiguates the recognition result of the named entity of word segmentation. Secondly, the entity and relationship of the disambiguated results are generated based on regular matching. Thirdly, after the entity relation set is generated, we need to align the abbreviations and pronouns, and further learn the hidden relations in the original corpus, so as to deeply mine the structured information of the functional requirements.

3.3.1. Word Sense Disambiguation Model Based on Part of Speech

In the BiLSTM-CRF model, although CRF is based on context information, it can obtain reasonable named entity labels. However, there are some errors in the entity label. For example, numbers or punctuation marks are often mistakenly identified as “B” labels (relational words usually do not have such problems because they are mostly composed of single words). Therefore, some parts of speech features are utilized as a screening condition to correct the labels which should be marked as named entity words, but not be positioned as named entity words. The correction definition is shown in Formula (2).

{pw}_{i} . e = O, {if pw}_{i} . flag =^{} {‘ x ’ and pw}_{i} . e! =^{} ‘ O ’

(2)

After the word sense disambiguation model, if the symbol marked as “B” is corrected to “O”, there will be no word marked as “B” in front of the segmentation labeled as “M” or “E”. Therefore, the output of the model needs to be filtered by the CRF model to prevent unreasonable entity tag collocation.

3.3.2. Entity Generation Based on Regular Matching

After word sense disambiguation, the named entity label of each participle can be obtained. In this project, entity generation based on regular matching is carried out according to entity and relationship composition rules, and the generation rules are shown in Formulas (3) and (4).

e_{j} . word = \{\begin{array}{l} {pw}_{if (e =^{} ‘ B ’)} + n * {pw}_{if (e =^{} ‘ M ’)} + {pw}_{if (e =^{} ‘ E ’)} \\ \cup {pw}_{if (e =^{} ‘ W ’)} \\ \cup {pw}_{if (r = 1)} \end{array}

(3)

e_{j} . type = \{\begin{matrix} 0, if {pw}_{j} . r = 0 \\ 1, if {pw}_{j} . r = 1 \end{matrix}

(4)

A named entity is composed of a participle marked with ‘B’, several participles marked with ‘M’ and a participle marked with ‘E’; or, it can also be composed of a participle marked with ‘W’.

Here,

t y p e = 0

means that the item is an entity and

t y p e = 1

means that the item is a relationship. Because of the sequence of entities and relations in sentences, they cannot be stored separately.

3.3.3. Alignment of Abbreviations and Pronouns Based on Context Information

In the description of business requirements, abbreviations and pronouns are common. For this kind of entity, this project needs to align abbreviations and pronouns based on context information. For the demonstrative pronoun, the “principle of proximity” is adopted to correct the demonstrative pronoun to the nearest entity before the entity. The definition is shown in Formula (5).

e_{pron} . word = e_{pron - k} . word e_{pron - k} {is the nearest entity in front of e}_{pron}

(5)

For abbreviations, the content can be considered as the entity contained in the subject in the previous sentence. If the subject in the previous sentence does not contain an entity, it is considered as the nearest entity, as shown in Formula (6).

e_{abbr} . word = \{\begin{array}{l} e_{subject} . word, if there is entity in subject . \\ e_{abbr - k} . word, otherwise . \end{array}

(6)

3.3.4. Entity Hiding Relation Learning Based on Punctuation and Number

In the process of relation extraction, it is not enough to extract key words. Relationships are often hidden within and between sentences. The most representative is the order and hierarchy information hidden by numbers and punctuation. If it is not fully utilized in relation extraction, it is easy to break the relation chain. Therefore, the use of hidden relation words is considered in this project, such as punctuation marks and numbers for relation extraction and disambiguation.

Numerical ranking is often seen in the requirement description corpora, which includes both juxtaposition and inclusion. For example, “(1)” and “(2)” represent juxtaposition, while the order of “1”, “(1)” or “①” represents inclusion. Therefore, the extraction rules for the hidden relationship of numbers are shown in Formula (7).

{add e}_{j} = \{\begin{array}{l} (include, x), if non - parallel relation \\ (and, x), if parallel relation \end{array}

(7)

Every time a number such as a serial number appears, a relation node needs to be added to the e set. It is assumed that the primary functions in the corpus are labeled with “1”, “2”. The secondary functions are labeled with “(1)” “(2)”. The three-level functions use labels such as “①” and “②”. Then, when the number level is the same, it represents the parallel relationship between the next entity and the previous entity; when the number levels are different, the number represents the inclusion relationship between the next entity and the previous entity at the same level.

Punctuation, as the connector of two sentences or phrases, also carries the hidden entity relationship. In this project, the enumeration method is used to extract relations, and the definition of rules is shown in Formula (8).

{add e}_{j} = \{\begin{array}{l} (and, e_{j - 1} . type), {if pw}_{number} . word = “ 、 ” \\ (end, 1), {if pw}_{number} . word = “ 。 ” \cup “; ” \end{array}

(8)

Every time a punctuation mark appears, it is necessary to add a relation node to the e set. When “、” appears in the sentence, it represents the juxtaposition between the next entity and the previous entity of the same level, so the type of the entity needs to be the same as the previous entity; when “。” or “;” is in the sentence, the end relation node is inserted to facilitate the construction and extraction of triples. The implementation of this part is shown in Algorithm 3.

Algorithm3. Entity disambiguation and hidden relation learning

Input: e, refword, abbword, includeword, andword, numericsign
Output:

e

1: for i = 0:

e . length

do
2:

i f e_{i} . w o r d i n r e f w o r d

then
3:

e_{i} . w o r d

=

e_{i - 1} . w o r d

4:

e l s e i f e_{i} . w o r d i n a b b w o r d

then
5:

e_{i} . w o r d

= subject(i)
6: end if
7: end for
8: for i = 0:

e . length

do
9:

i f e_{i} . w o r d i n i n c l u d e w o r d

then
10:

e_{i} . w o r d

=

“ i n c l u d e ”

11:

e l s e i f e_{i} . w o r d i n a n d w o r d

then
12: e.delete(i)
13: end if
14: end for
15: for i = 0:

e . length

do
16:

i f

isnumeric(

e_{i} . w o r d

) then
17:

i f e_{i} . w o r d

=1 &&

e_{i - 1} . w o r d

not in numericsign then
18: e.splice(i+1,1,{word:include,type:1})
19: else

i f e_{i} . w o r d

=1 &&

e_{i - 1} . w o r d

in numericsign then
20: e.splice(i,1,{word:include,type:1})
21: end if
22: end if
23: end for
24: return e

3.4. Generation of Functional Structure Graph

3.4.1. Generation of Requirement Relation Tree

Since the entity relation set is actually a sequential sequence, the extraction sequence can be generated based on the idea of stack, so as to establish the requirement relation tree RTree.

First, start to stack from the first element of the entity relationship set, that is, the begin node, until the end node is encountered. Then, start to build the requirement relation tree according to the elements in the stack: perform the stack out operation on the elements in the stack successively, assuming that the stack out element is e_i. If e_i.type = 0, traverse the RTree of the requirement relation tree. If the e_i already exists in the requirement relation tree, it is located on the tree node. Then continue to jump below, if the next e_i.type = 0, the extraction sequence error. If e_i.type = 1, record the relationship node information, and then continue to jump below. If the next e_i.type = 1, the extraction sequence is wrong.

In this way, a part of the requirement relation tree RTree is obtained. Repeat the above operations again, and finally obtain the complete requirement relation RTree. The name of the software system is the root node of the requirement relation tree, and each relation node is both the child node of the parent function and the parent node of the child function.

3.4.2. Embedding of Functional Structure Graph

In the embedded part of the function structure graph, the search is carried out layer by layer according to the breadth first strategy of the tree, and the triples are constructed step by step. The head node is

{RTree}_{j - 1}

, the relationship node is

{RTree}_{j}

, and the tail node is

{RTree}_{j + 1 : n}

.

After generating the function-function triplet, this study uses the py2neo module of Python to embed the function structure graph and save it to the neo4j database. The implementation of this part is shown in Algorithm 4.

Algorithm4. Generation of functional structure graph

Input: e
Output:

Function_Function_KG

1: tree = {}
2: for i = 0:

e . length

do
3: while

e_{i} . w o r d! = ‘ e n d ’

do
4:

s t a c k . p u s h (e_{i})

5: i++
6: end while
7: while

s t a c k . p e e k () . w o r d = ‘ b e g i n ’

do
8: if

s t a c k . p e e k () . t y p e = 1

then
9: if

s t a c k . p e e k () . w o r d n o t i n \{‘ b e g i n ’, ‘ e n d ’\}

then
10: if

! T R E E_E X I S T (s t a c k . p e e k (i - 1))

then
11: relationnode = TREE_CREATENODE(stack.pop)
12: end if
13:

n o d e = s t a c k . p e e k (i - 1)

14: TREE_ADDNODE(node,

s t a c k . p e e k (i + 1 : s t a c k . l e n g t h - 1)

)
15: end if
16: end if
17: end while
18: stack.empty()
19: end for
20: tree = Concact_Tree()
21: relationNode = BFS(tree.root,type=1)
22: for node in relationNode do
23: triple.append((FatherNode(node),node,ChildNode(node)))
24: end for
25: Function_Function_KG = triple2KG(triple)
26: return Function_Function_KG

3.5. Method of Converting Traditional U/C Matrix into S/U/C Matrix

The U/C matrix can only reflect the relationship between creator and user, but it cannot reflect the flow of business data tables to multiple users at the same time. For example, function A generates data1, and function B and C use data1. However, the U/C matrix cannot determine whether function A is really sent to function B and C successfully. If this situation is not solved, it will often lead to problems such as the omission of functions and lack of data tables in the process of demand research and analysis.

In order to solve this problem, this study disassembles the U/C matrix into multiple send/use/create (S/U/C) matrices. The main operations are as follows: based on a functional entity, only retain its corresponding entire row information and data column information with “C”, and then modify all “U” in the column with “C” to “S”. The process is shown in Figure 6.

In the process of demand research and demand analysis, the S/U/C matrix can be used to record the relationship between functions and data, so as to find the possible problems such as function omission and data loss in the process of demand research. The implementation of this part is shown in Algorithm 5.

Algorithm5. Converting traditional U/C matrix into S/U/C matrix

Input: UC
Output:

SUC

1: for f in functionSet do
2: Cset=set()
3: for i in functionSet do
4: for j in dataSet do
5: if i==j then
6: SUC[f][i][j] = UC[i][j]
7: if UC[i][j]==’C’ or UC[i][j]==’CU’ then
8: Cset.add(j)
9: end if
10: end if
11: end for
12: end for
13: for k in CSet do
14: for j in functionSet do
15: if UC[j][k] == ’U’ then
16: SUC[f][j][k] = ’S’
17: end if
18: end for
19: end for
20: end for
21: return SUC

3.6. Generation of S/U/C Matrix

Using the improved S/U/C matrix, this study extracts the mapping relationship between business functions and data tables in the original business requirements’ description. First, expand according to the functions. Each function corresponds to a two-dimensional S/U/C table. For each table, all data tables are taken as the first row of the S/U/C matrix and all functions as the first column of the S/U/C matrix. The use, create and send relationship between functions and data in the data table extraction results is represented by “C”, “S” and “U”, and filled in the corresponding position of the table to form a complete S/U/C matrix.

3.7. Generation of Functional-Data Graph

This study extracts the function data relationship pairs with “C”, “S” and “U” relationships in the S/U/C matrix; generates the (function-relationship-data table) triplet (the relationships are create, send and use); and uses the py2neo module of Python to save it to the neo4j database to realize the embedding of the function-data graph.

The head node corresponding to the “create” and “send” relationship is the function entity, and the tail node is the data table entity; the head node corresponding to the “use” relationship is a data table entity, and the tail node is a function entity. The implementation of this part is shown in Algorithm 6.

Algorithm6. Generation of function-data graph

Input: SUC
Output: Function_Data_KG
1: for f in functionSet do
2: for x in functionSet do
3: for y in dataSet do
4: if SUC[f][x][y] == ‘C’ then
5: triple.append(x,’create’,y)
6: else if SUC[f][x][y] == ‘U’ then
7: triple.append(y,’use’,x)
8: else if SUC[f][x][y] == ‘S’ then
9: triple.append(x,’send’,y)
10: end if
11: end for
12: end for
13: end for
14: Function_Data_KG = triple2KG(triple)
15: return Function_Data_KG

3.8. Generation of Software Requirements Specification Graph

In this study, the function structure graph and function-data graph are fused to form the software requirements specification graph.

4. Experimental Results and Analysis

4.1. Method of Converting Traditional U/C Matrix into S/U/C Matrix

The experimental data come from 150 real software system description documents collected manually to form the original business corpus, which can be divided into four categories according to the characteristics of the original business description corpus, as shown in Table 1.

In this experiment, the Python based deep learning framework tensorflow is utilized to realize the simulation of the BiLSTM-CRF model. The specific software and hardware configuration of the experimental environment is shown in Table 2.

4.2. Typical Experimental Results and Evaluation

Because the experiment in this study belongs to the problem of classification prediction, the accuracy rate, precision rate, recall rate and F1-score are used as evaluation indexes.

After preprocessing the original corpus, the experiment divides the data into the training set, verification set and test set according to the ratio of 6:2:2. Then, the features and labels of each clause in the training set are input into BiLSTM-CRF to train the model. The training parameters are shown in Table 3.

The training loss is calculated by the loss_function interface function of the CRF model, and the change curve is shown in Figure 7.

When the epoch value is 9, the loss of the training set tends to converge, while the loss of the verification set has begun to increase, indicating that the model has been over fitted; therefore, the epoch in the training parameters is 9.

After the model training, the performance of the model is tested through the test set, and then named entity recognition and context-based word meaning and entity disambiguation are carried out. Finally, the function-data graph is generated based on the tree model. The results of the typical experimental process are shown in Figure 8.

Figure 8 shows the process from the original corpus to the generation of the function structure graph. Figure 8a is a paragraph from the original corpus, which describes the business requirements of the software in natural language; Figure 8b is a preliminary entity relationship set generated after named entity recognition. Each element in the set contains the content of the entity (relationship) and its type (whether the element is an entity or a relationship). Figure 8c is the entity relation set after entity disambiguation and hidden relation learning based on the preliminary entity relation set. The meaning of each element is consistent with Figure 8b; Figure 8d shows the function structure graph. The label of each node in the graph represents the content of the function node.

At the same time, the experiment extracts the data table from the business requirements description, arranges it into the S/U/C matrix, extracts the triples and finally generates the function-data graph. Typical experimental results are shown in Figure 9.

Figure 9 shows the process from data table extraction results to function-data graph generation. Figure 9a is the extraction result of the data table, and each row represents the relationship between the data table and the function. The first column shows the name of the data table, the second column is the name of the relationship between the data table and the function and the third column is the name of the function. Figure 9b is the S/U/C matrix generated from the results of Figure 9a. Each two-dimensional table corresponds to a function dimension. The rows in the table represent functions and the columns represent data. The table content represents the relationship and includes “Send”, “Create” and “Use”. Figure 9c shows the function-data graph. Each yellow node in the graph is a function entity node, and its label represents the content of the function node. Each pink node is a data entity node, and its label represents the content of the data node.

After merging the two parts of the atlas, the final software requirements specification graph is shown in Figure 10.

4.3. Analysis of Experimental Results

4.3.1. Effectiveness Analysis

In this study, 150 original software business requirements documents are embedded in the software requirements specification graph. The precision and recall scatter diagram of each sample is shown in Figure 11 and Figure 12.

It can be seen that the performance of the BiLSTM-CRF-KG model is very stable, and the precision rate and recall rate are stable at about 96%. However, the performance of two samples is significantly lower than that of other samples. The experiment defines them as sample 1 and sample 2. After analyzing the original business requirements description of the two samples, it is found that the low performance of sample 1 is due to the long distance between the system name and the function module in the original business requirements description, and the subject is omitted in the statement describing the parent–child relationship between functions, resulting in performance problems. The reason for the low performance of sample 2 is that the serial number adopts the ancient Greek alphabet format, which is not defined in the hidden relationship learning.

In order to further verify the effectiveness of the BiLSTM-CRF-KG model, the experimental results of BiLSTM-CRF-KG and the other three named entity recognition models are compared and analyzed. The evaluation results of named entity recognition are shown in Table 4.

The simulation results show that:

From the perspective of named entity recognition alone, the performance of the BiLSTM-CRF model for the functional entity recognition of Chinese software requirements is better than the IDCNN-CRF and CRF++ models.
The performance of BiLSTM-CRF-KG is much better than that of BiLSTM-CRF and the other models. It is proven that the word sense disambiguation network, entity disambiguation and hidden relationship learning network greatly improve the performance of the model, and especially make up for the low recall rate of BiLSTM-CRF.

4.3.2. Applicability Analysis

In order to further verify the applicability of the model, experiments are carried out on data sets with different characteristics. The overall performance comparison is shown in Table 5. It can be seen that there is little difference in the performance of the BiLSTM-CRF-KG model under different data sets.

Figure 13 shows that in the named entity recognition stage, due to the large number of abbreviations and pronouns, the recall rate of the second set of data sets is far lower than the average level; the performance of the first data set is the best, much higher than the average, while the performance of the fourth data set is the worst.

In order to further verify the performance of each group of data sets in different stages, simulation experiments are carried out for each data set in different stages. The specific F1-score is shown in Figure 14.

The experimental results show that the performance of each data set is improving with the deepening of the experiment. Moreover, each step of the model will significantly improve the performance of a set of data sets. For example, the model of acronym and pronoun alignment greatly improves the performance of the second set of data sets, while there are more acronyms and pronouns in the second set of data sets, which prove the necessity of knowledge alignment.

The changes of each evaluation index of this data set before and after the model of acronym and pronoun alignment are shown in Figure 15.

It can be seen from the performance comparison results that although the alignment of abbreviations and pronouns cannot greatly improve the accuracy of the model, the improvement of the recall rate is huge. Through the observation and analysis of the original corpus, it is found that for acronyms and pronouns, especially pronouns, the named entity recognition model is prone to omission in the recognition process, resulting in a low recall rate.

At the same time, it can be found that for the hidden relationship learning steps, the performance of the fourth set of data set is better than that of the third set of data set. This experiment shows that learning the implicit relationship between numbers and punctuation can effectively promote the extraction and embedding of function requirement structure information and avoid the problem of a lack of relationship between parent–child functions.

5. Conclusions

Aiming at the problems of inconsistency between the manual preparation of software requirements specification and business description, low preparation efficiency, being error prone and difficulty communicating effectively with business personnel, this paper proposes a construction model of software requirements specification graph BiLSTM-CRF-KG. The model combines the knowledge graph technology with the requirements specification and makes use of the visual expression characteristics of the knowledge graph, which can greatly eliminate the possible inconsistency of understanding in the preparation process of the requirements specification and facilitate the needs communication between business personnel and technicians. This paper conducts simulation experiments on 150 real software system business requirements description corpora. The experimental results show that the BiLSTM-CRF-KG model can obtain 96.31% functional entity recognition precision directly from the original corpus, which is better than the classical BiLSTM-CRF, IDCNN-CRF and CRF++ models, and has good performance on different kinds of data sets.

The BiLSTM-CRF-KG model combines the knowledge graph technology with the requirements specification and makes use of the visual expression characteristics of the knowledge graph, which can greatly eliminate the possible inconsistency of understanding in the preparation process of the requirements specification and facilitate the need for communication between business personnel and technicians.

At the same time, this paper makes an effective supplement to the field of Chinese named entity recognition, and further improves the accuracy of the automatic extraction of software requirements specification information, which plays a great role in the (semi-)automatic production at the level of software requirements.

The future work of this paper includes: (1) the requirements specification graph constructed in this paper only realizes the embedding of functions, data and the relationship between them, but has not realized the embedding of user roles and other entities; (2) UML diagrams such as the software function hierarchy diagram and data flow diagram can be automatically generated according to the software requirements specification graph.

Author Contributions

Conceptualization, S.Y. and J.-S.P.; methodology, Q.C.; software, Z.W.; validation, Z.W., S.Y. and Q.C.; formal analysis, J.-S.P.; investigation, Q.C.; resources, S.Y.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, S.Y. and Z.W.; visualization, Q.C.; supervision, S.Y.; project administration, J.-S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Adikara, F.; Anggarani, A. Qualitative Requirements Analysis Process in Organization Goal-Oriented Requirements Engineering (OGORE) for E-Commerce Development. Lect. Notes Electr. Eng. 2018, 449, 318–324. [Google Scholar]
Boyarchuk, A.; Pavlova, O.; Bodnar, M.; Lopatto, I. Approach to the Analysis of Software Requirements Specification on Its Structure Correctness. CEUR Workshop Proc. -Ser. 2020, 2623, 85–95. [Google Scholar]
Pamuji, R.; Saputra, Y.A.; Ramdani, F. The Relation of Fit Tables and Textual Requirements to Understand Software Requirements. In Proceedings of the 3rd International Conference on Sustainable Information Engineering and Technology, Malang, Indonesia, 10–12 November 2018; pp. 142–146. [Google Scholar]
Bilal, M.; Gani, A.; Liaqat, M.; Bashir, N.; Malik, N. Risk assessment across life cycle phases for small and medium software projects. J. Eng. Sci. Technol. 2020, 15, 572–588. [Google Scholar]
Shimada, H.; Nakagawa, H.; Tsuchiya, T. Constructing a Goal Model from Requirements Descriptions Based on Extraction Rules. Commun. Comput. Inf. Sci. 2018, 809, 175–188. [Google Scholar]
Murugesh, S.; Jaya, A. An integrated approach towards automated software requirements elicitation from unstructured documents. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 3763–3773. [Google Scholar] [CrossRef]
Bashir, N.; Bilal, M.; Liaqat, M.; Marjani, M.; Malik, N.; Ali, M. Modeling Class Diagram using NLP in Object-Oriented Designing. In Proceedings of the 2021 IEEE National Computing Colleges Conference, Taif, Saudi Arabia, 27–28 March 2021; p. 155. [Google Scholar]
Stanik, C.; Maalej, W. Requirements Intelligence with OpenReq Analytics. In Proceedings of the 2019 IEEE 27th International Requirements Engineering Conference, Jeju, Korea, 23–27 September 2019; pp. 428–483. [Google Scholar]
Jia, S.B.; Shijia, E.; Li, M.Z.; Xiang, Y. Chinese Open Relation Extraction and Knowledge Base Establishment. Acm Trans. Asian Low-Resour. Lang. Inf. Process. 2018, 17, 1–22. [Google Scholar] [CrossRef]
Wu, G.Q.; Hu, S.J.; Wang, Y.H.; Zhang, Z.; Bao, X.Y. Subject Event Extraction from Chinese Court Verdict Case via Frame-filling. In Proceedings of the 11th IEEE International Conference on Knowledge Graph, Nanjing, China, 9–11 August 2020; pp. 12–19. [Google Scholar]
Ghosh, S.; Elenius, D.; Li, W.; Lincoln, P.; Shankar, N.; Steiner, W. Automatically extracting requirements specifications from natural language. arXiv 2004, arXiv:1403.3142. [Google Scholar]
Roth, M.; Klein, E. Parsing software requirements with anontology-based semantic role labeler. Lang. Ontol. 2015, 15, 15–21. [Google Scholar]
Zhou, Z.; Xiao, Z.; Liu, Q.; Ai, Q. An analytical approach to customer requirement information processing. Enterp. Inf. Syst. 2013, 7, 543–557. [Google Scholar] [CrossRef]
Li, Y.; Guzman, E.; Tsiamoura, K.; Schneider, F.; Bruegge, B. Automated requirements extraction for scientific software. Procedia Comput. Sci. 2015, 51, 582–591. [Google Scholar] [CrossRef] [Green Version]
Wang, Y.; Zhang, J. Experiment on Automatic Functional Requirements Analysis with the EFRF’s Semantic Cases. In Proceedings of the 2016 IEEE International Conference on Progress in Informatics and Computing, Shanghai, China, 23–25 December 2016; pp. 36–642. [Google Scholar]
Pavlova, O.; Hovorushchenko, T.; Boyarchuk, A. Method of Activity of Intelligent Agent for Semantic Analysis of Software Requirements. In Proceedings of the 2019 10th IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems-Technology and Applications-IDAACS, Metz, France, 18–21 September 2019; pp. 902–906. [Google Scholar]
Li, M.Y.; Yang, Y.; Shi, L.; Wang, Q.; Hu, J.; Peng, X.H.; Liao, W.M.; Pi, G.Z. Automated Extraction of Requirement Entities by Leveraging LSTM-CRF and Transfer Learning. In Proceedings of the IEEE International Conference on Software Maintenance, Adelaide, Australia, 28 September–2 October 2020; pp. 208–219. [Google Scholar]
Ieva, C.; Gotlieb, A.; Kaci, S.; Lazaar, N. Discovering Program Topoi via Hierarchical Agglomerative Clustering. IEEE Trans. Reliab. 2018, 67, 758–770. [Google Scholar] [CrossRef] [Green Version]
Park, B.K.; Kim, R.Y.C. Effort Estimation Approach through Extracting Use Cases via Informal Requirement Specifications. Appl. Sci. 2020, 10, 3044. [Google Scholar] [CrossRef]
Sun, X.; Man, Y. Enhance Chinese Medical Name Entity Recognition with Etymon Features. DEStech Trans. Comput. Sci. Eng. 2018, 291, 490–494. [Google Scholar] [CrossRef]
Liu, W.M.; Yu, B.; Zhang, C.; Wang, H.; Pan, K. Chinese Named Entity Recognition Based on Rules and Conditional Random Field. In Proceedings of the 2nd International Conference on Computer Science and Artificial Intelligence (CSAI)/10th International Conference on Information and Multimedia Technology, Shenzhen, China, 8–10 December 2018; pp. 268–272. [Google Scholar]
Dao, A.T.; Truong, T.H.; Nguyen, L.; Dinh, D. Improving Named Entity Recognition using Bilingual Constraints and Word Alignment. IOP Conf. Ser. -Mater. Sci. Eng. 2018, 435, 58–67. [Google Scholar] [CrossRef]

Figure 1. Overall framework design.

Figure 2. Preprocessing of the original corpus.

Figure 3. Structure of BiLSTM-CRF model.

Figure 4. The process of named entity recognition.

Figure 5. The structure of entity disambiguation and hidden relation learning.

Figure 6. Convert U/C matrix to S/U/C matrix.

Figure 7. Loss curve.

Figure 8. Typical experimental result of function structure graph.

Figure 9. Typical experimental result of function-data graph.

Figure 10. Typical result of software requirements specification graph.

Figure 11. Precision scatter diagram.

Figure 12. Recall scatter diagram.

Figure 13. Model performance of named entity recognition in different data sets.

Figure 14. F1-score of each data set in different stages.

Figure 15. The changes before and after model of acronym and pronoun alignment.

Table 1. Data set grouping results.

Data Set Name	Data Set Characteristics	Number of Documents	Number of Sentences
Data set 1	Normal	46	69,941
Data set 2	More words and sentences ambiguity	42	50,019
Data set 3	More implicit relations (punctuation)	25	36,189
Data set 4	More implied relations (numbers)	37	43,851

Table 2. Configuration of experimental environment.

Hardware Environment	CPU	Intel(R) Xeon(R) CPU E5-2620 v4 @2.10GHz × 32
	GPU	GeForce GTX 1080 Ti/PCIe/SSE2
	Memory	64G
	Video Memory	10.91G
Software Environment	Operating System	Ubuntu20.04
	Deep Learning Framework	Tensorflow 2.0
	Tool Kit	CUDA 9.0.176
	Development Language	Python 3.6
	Development Tool	JetBrains PyCharm Community Edition 2020.2
	Operating System	Ubuntu20.04

Table 3. BiLSTM-CRF parameter.

Parameter	Choice
Token dimension	50
POS tag dimension	30
Learning rate	0.1
Decay rate	0.1
Gradient clipping	3
Peepholes	No
Batch size	256
Max number of epochs	20

Table 4. Comparison of named entity recognition models.

	Accuracy	Precision	Recall	F1-score
IDCNN-CRF	89.13%	88.90%	89.60%	89.25%
BiLSTM-CRF	90.88%	91.20%	90.68%	90.94%
CRF++	85.94%	85.89%	87.28%	86.58%
BiLSTM-CRF-KG	96.48%	96.31%	96.17%	96.24%

Table 5. Performance comparison of BiLSTM-CRF-KG under different data sets.

	Accuracy	Precision	Recall	F1-Score
Data set 1	97.12%	97.40%	97.04%	97.22%
Data set 2	95.41%	95.59%	95.68%	95.63%
Data set 3	96.40%	96.27%	96.00%	96.13%
Data set 4	94.97%	95.43%	93.19%	94.30%
Overall	96.48%	96.31%	96.17%	96.24%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Pan, J.-S.; Chen, Q.; Yang, S. BiLSTM-CRF-KG: A Construction Method of Software Requirements Specification Graph. Appl. Sci. 2022, 12, 6016. https://doi.org/10.3390/app12126016

AMA Style

Wang Z, Pan J-S, Chen Q, Yang S. BiLSTM-CRF-KG: A Construction Method of Software Requirements Specification Graph. Applied Sciences. 2022; 12(12):6016. https://doi.org/10.3390/app12126016

Chicago/Turabian Style

Wang, Zhengdi, Jeng-Shyang Pan, Qinyin Chen, and Shuangyuan Yang. 2022. "BiLSTM-CRF-KG: A Construction Method of Software Requirements Specification Graph" Applied Sciences 12, no. 12: 6016. https://doi.org/10.3390/app12126016

APA Style

Wang, Z., Pan, J.-S., Chen, Q., & Yang, S. (2022). BiLSTM-CRF-KG: A Construction Method of Software Requirements Specification Graph. Applied Sciences, 12(12), 6016. https://doi.org/10.3390/app12126016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BiLSTM-CRF-KG: A Construction Method of Software Requirements Specification Graph

Abstract

1. Introduction

2. Related Work

3. Structural Information Extraction Model of Functional Requirements

3.1. Preprocessing of the Original Corpus

3.2. Named Entity Recognition and Relation Extraction

3.3. Entity Disambiguation and Hidden Relation Learning

3.3.1. Word Sense Disambiguation Model Based on Part of Speech

3.3.2. Entity Generation Based on Regular Matching

3.3.3. Alignment of Abbreviations and Pronouns Based on Context Information

3.3.4. Entity Hiding Relation Learning Based on Punctuation and Number

3.4. Generation of Functional Structure Graph

3.4.1. Generation of Requirement Relation Tree

3.4.2. Embedding of Functional Structure Graph

3.5. Method of Converting Traditional U/C Matrix into S/U/C Matrix

3.6. Generation of S/U/C Matrix

3.7. Generation of Functional-Data Graph

3.8. Generation of Software Requirements Specification Graph

4. Experimental Results and Analysis

4.1. Method of Converting Traditional U/C Matrix into S/U/C Matrix

4.2. Typical Experimental Results and Evaluation

4.3. Analysis of Experimental Results

4.3.1. Effectiveness Analysis

4.3.2. Applicability Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI