A Corpus-Based Sentence Classifier for Entity–Relationship Modelling

Šuman, Sabrina; Čandrlić, Sanja; Jakupović, Alen

doi:10.3390/electronics11060889

Open AccessArticle

A Corpus-Based Sentence Classifier for Entity–Relationship Modelling

by

Sabrina Šuman

^1,*

,

Sanja Čandrlić

² and

Alen Jakupović

¹

Business Department, Polytechnic of Rijeka, Vukovarska 58, 51000 Rijeka, Croatia

²

Department of Informatics, University of Rijeka, 51000 Rijeka, Croatia

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(6), 889; https://doi.org/10.3390/electronics11060889

Submission received: 11 February 2022 / Revised: 4 March 2022 / Accepted: 9 March 2022 / Published: 11 March 2022

(This article belongs to the Special Issue Recent Advances in the IoT and Smart City Based on Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Automated creation of a conceptual data model based on user requirements expressed in the textual form of a natural language is a challenging research area. The complexity of natural language requires deep insight into the semantics buried in words, expressions, and string patterns. For the purpose of natural language processing, we created a corpus of business descriptions and an adherent lexicon containing all the words in the corpus. Thus, it was possible to define rules for the automatic translation of business descriptions into the entity–relationship (ER) data model. However, since the translation rules could not always lead to accurate translations, we created an additional classification process layer—a classifier which assigns to each input sentence some of the defined ER method classes. The classifier represents a formalized knowledge of the four data modelling experts. This rule-based classification process is based on the extraction of ER information from a given sentence. After the detailed description, the classification process itself was evaluated and tested using the standard multiclass performance measures: recall, precision and accuracy. The accuracy in the learning phase was 96.77% and in the testing phase 95.79%.

Keywords:

text mining; data processing; ER data modelling; pattern recognition; classification; formal language; controlled natural language; linguistic corpus

1. Introduction

In our research, we used methods from the NLP (natural language processing) field in the development of an automated (knowledge-based) system to support the creation of ER (entity–relationship) data models. ER conceptual data modelling is part of the design phase of information system development. To analyse the natural language more deeply, a linguistic corpus was created in the previous research phase, which contains the repository of business descriptions (BDs), BD sentences, words and POS (part-of-speech) tags. The purpose of creating the corpus was to define a set of translation rules that enables the translation of text-expressed BDs into a text-expressed (formal language) ER data model. All BD sentences were formulated to be syntactically correct according to the specific grammar rules (presented in Table 1 of Section 2.4). The phrase structure grammar for BDs and the data model language (EARC; entity attribute relationship cardinalities language) are described in papers [1,2].

The language of BD is considered to be a form of controlled natural language (CNL)—a subset of a natural language with restricted grammar and vocabulary that maintains the majority of its natural features. It represents an intermediary between natural and formal languages, reducing the ambiguity and complexity of natural language, and is suitable for knowledge representation and reasoning processes [3,4,5]. Our version of controlled natural English is named CEN (controlled English).

Our research followed DSR (design science research) methodology, and this paper describes the activities conducted in the third phase of the methodology—design and development [6].

The main method of translating one language (BD in CEN) into another (formal EARC language) is based on translation rules. Translation rules are defined after analysing BD sentences from the linguistic corpus and recognising particular string patterns semantically related to EARC constructs. This is a very specific domain with complex semantics and a task that is difficult to accomplish using standard machine learning methods applied to text. Although the rule-based approach is considered basic and is nowadays rather replaced by some advanced machine learning algorithms such as artificial neural networks, this approach still has its advantages.

The purpose of this paper is to explain the need for creating a sentence classifier that helps in the successful translation of BDs into the ER model. The research questions are as follows:

(1): Which translation issues occur when translating a BD sentence into an EARC sentence?
(2): Is it possible to build an EARC classifier for natural language sentences?
(3): How does the classifier help in the translation process?

The objective of this article is to present a description of the sentence classifier as well as its development methodology and performance.

After the Introduction, in Background and Related Work, we list some important concepts from the area of classification methods, pattern recognition and text analysis approaches. In addition, a brief overview of existing solutions related to the automated translation of natural language BDs into an ER data model is provided. In the same section, we present some important artefacts created in our previous work that are necessary to understand our methods and techniques. In the section Research Motivation, the issues related to automatic text translation into a data model based on pattern recognition as well as the motivation for conducting this research are presented. A methodology for creating an EARC sentence classifier is provided in the Research methodology. The Results and Discussion section presents the classifier content and the performance evaluation based on testing the classifier. The limitations of the research are documented as well as further research plans. The answers to the research questions are synthesised in the Conclusions.

2. Background and Related Work

This section describes some important concepts such as classification, pattern recognition, and recent text analysis and classification methods, as well as some existing solutions for automated translation of natural language text into an ER data model. Additionally, some fundamental parts of our previous work that contribute to a better understanding of this research are included at the end of the section.

2.1. Classification

Classification can be seen as a process or a product. According to Wheaton [7], classification is a process of systematic allocation of entities according to groups or categories in relation to certain criteria, but it also refers to a formally structured set of classes or categories that have been created in the process of classification. The application of a classification method (e.g., via program code) results in the creation of a classifier, which assigns the inputs (objects of the area of interest) belonging to a certain class. Classifications, whose aim is specific, are typically created by an individual (e.g., a scientist in order to describe a part of reality that is being explored).

After determining the purpose, it is necessary to choose classification criteria, fundamental characteristics that are used to differentiate, describe and classify entities/objects into groups or classes.

After determining the criteria, it is necessary to specify a classification method, that is, a way in which certain entities can be allocated unambiguously to (become members of) a certain category (qualitative and quantitative) based on a set of definitions and rules [8].

2.2. Pattern Recognition

Patterns are forms of language, and pattern recognition is a part of the machine learning domain. By defining and describing natural language patterns, it is possible to train and teach the computer to comprehend sentences and communicate in natural language [9].

Research on pattern recognition usually utilises knowledge on how recognition problems are solved by humans and then formalised through automated pattern recognition [10].

Pattern classes could be used to classify linguistic elements (text, sentence, string). Pattern recognition could be based on syntactic, morphological or semantic characteristics of the sentence or text, depending on what is observed. However, in discovering the semantics of a linguistic element, the syntax helps as well and may become the indicator of its semantics [11].

Various classification techniques can be used to classify patterns depending on the domain of a problem and the types of inputs. In cases when there is a lack of information in the training data, the best method for successful classification is to incorporate problem domain knowledge [10].

2.3. Related Work

As we base the conceptual data model creation on textual business requirements translation, we begin this section with a discussion of requirements engineering.

There are many systems that provide automated (or semi-automated with user interaction) conversion from natural or a kind of restricted natural language into the ER model (textual or diagram). All the methods are based on finding the elements of the text that can be translated in a certain EARC construct. The first rules regarding such translation were presented in [12]. Lo and Choobineh [13] provide the list of systems for automated data modelling from 1982 to 1998. Thus, we analysed the related works from 1999 and forward. Some of them use translation rules based on lexical string patterns, such as ER generator [14], CABSYDD [15], ER-Converter [16], DBDT [17], ERD generator [18] and ER generator using NLP [19]. Some of the tools provide graphical solution support, such as ABCM [20] and KERMIT [21]. The data model validator [22] converts a textual BD to EARC constructs based on reasoning rules, which combine the given data model formalisation and ontology mapping. HBT and EIPW tools [23] use the created entity instance repository (EIR) and the relationship instance repository (RIR) to translate a BD to the ER model. Some of the tools, such as BizData [24], provide pre-processing of initial BDs for applying translation rules. The systems’ targeted users are usually novice designers, students or domain users without data modelling knowledge. In their research, Mich et al. [25] emphasise the demands and benefits of approaches that integrate a linguistic analysis of requirements in order to create conceptual models of them in a more automated way. The same authors synthesise these approaches based on common characteristics: (a) how “natural” the language is, which is usually a kind of controlled language; and (b) the degree of automation in translating the requirements in a conceptual data model.

The gathering of users’ requirements occurs in two phases [26], capturing and analysis–optimisation, which should result in concise, clear and sufficient information to proceed with the software development process. This process is usually performed by communicating and writing in natural language [25], so the use of NLP techniques to automate this requirements specification is crucial. In this way, requirements gathering can be automated or semi-automated (providing support for analyst or business domain specialist), reducing ambiguities, incompleteness and semantic errors. Such a revised, more formalised text of requirements is a prerequisite for automating subsequent phases, such as conceptual data modelling or process modelling.

In their work, Ref. [27] presents a tool for automated conceptual model generation. They employ a user story format for requirements specification, expressing basic elements in a “Who? What? Why?” structure, and minimise human supervision by proposing a fully automated software tool. It is important to emphasise that the authors conclude that it is vital to create highly accurate automated models, defining and using heuristics that capture a holistic view of the requirements while ignoring overly fine-grained details (which result in overall low accuracy) [28].

The limitations of the observed systems have the same denominator: for higher accuracy, they should use very simple forms of sentences and a very restricted vocabulary. We believe that, with an additional processing layer of classification of the sentences, higher accuracy could be achieved. In this case, a classifier should be very reliable.

In recent years, text analytics methods have evolved considerably. However, they offer only a partial implementation possibility for the mentioned problem of automatically creating a data model from a natural language text specification.

For example, Named Entity Recognition (NER) categorizes named entities from a text into defined categories such as the names of people, organizations, places, etc. [29]. Text vectorization methods such as word2vec [30], TF-IDF [31], and GloVE [32] assign a score to each word in the analysed text corpus based on its overall generic relevance. When vectorization is required with respect to a specific vector and should be more narrowly defined (e.g., in the field of data modelling), a machine learning algorithm must usually be used that performs the following steps: POS -tagging (NLTK or Stanford POS tagger [33,34]), a ML -algorithm (usually from a deep learning category), and the optimization phase (another model or vectorization, e.g., exclusion or non-exclusion of some POS -categories, stopwards, etc.). This approach generally provides acceptable accuracy in terms of the amount of semantics extracted but is not suitable for specific problems in certain domains, such as building an automated data model from a natural language text with a business description.

Sentiment analysis as a classification method also captures some general semantic features of a text, usually using dictionaries such as VADER [35], Senti-WordNet [36], Textblob [37], or Flair [38]. For example, VADER sentiment analysis is based on a dictionary that maps lexical features to emotion intensities—sentiment scores [39].

The main drawbacks of this approach to perform a particular classification are the lack of context semantics (“not bad at all” is a positive expression, although VADER would score it as negative for “not” and negative for “bad”), word ambiguity, and existing pre-trained models that are not well suited for other domains or purposes (e.g., only for social media).

The research approach of [40] is aligned with the aforementioned—they use the rule-based approach and emphasize that the effort required for the rule-based approach should be less than the effort required to create a labelled training dataset in the machine learning approaches [41]. The authors of [40,42] also believe that the rule-based approach leads to higher accuracy.

Ref. [43] provides an overview of supervised and unsupervised methods and algorithms for text semantics extraction with their strengths and weaknesses. The same authors propose a novel approach, also used by [44,45], which combines two methods to improve the semantic relevance of words: Word embedding models (e.g., word2vec) with clustering algorithms to better capture the important and relevant parts of the text.

Based on the observed approaches for automatically transforming text into data models, novel text analysis tools, and machine learning algorithms, we consider that applying a rule-based approach (such as our classification proposal) in one of the text analysis steps will increase accuracy in a particular text domain such as data modelling.

The observed systems, tools, and methods cannot extract the relevant semantics from ER with acceptable accuracy. None of the observed systems used translation rules based on lexical string patterns. Furthermore, none of the observed systems used a classifier that classifies each sentence of a business description. The classifier proposed in this paper provides a more focused and precise approach that can capture the appropriate semantics for ER constructs’ identification in the sentences.

2.4. Previous Work Artefacts

We use the artefacts (corpus-based lexicon and CEN PSG-Phrase structure grammar) from previous works [1,46] to obtain valid BD sentences that represent the system input. The MULTEXT-East lexicon and its morpho-syntactic descriptions are used because it covers a broad range of syntax categories and supports the Croatian language (planned in further research). Word classes of the lexicon are represented with the following MULTEXT-East lexicon MSD (morpho-syntactic description) category [47]. Furthermore, we use a previously created EARC formal language grammar and lexicon to textually express the ER model. CEN PSG is presented in the form of a finite set of productions, using the modified extended Backus–Naur form (EBNF). All production rules are of the following form and are listed in Table 1:

A → β, where A is a nonterminal symbol, and β is a sequence of nonterminal and/or terminal symbols and special (meta) symbols. Meta symbols are:

The operator “|” is for α or β;
Parentheses ( ) enable a grouping of operators “|”;
[α] means that α could be left out or written once;
{α} means that α could be left out or written any number of times.

The ER method constructs that are used in the research are Entities, Attributes, Relations and Cardinalities. These will be outlined by providing appropriate terminal and non-terminal symbols in the EARC formal language. By translating the BD into the EARC formal language, a textual representation of an ER model is created, which has the following sentential form (SF):

S_EARC → e_L((Ap_L), (A_L))(c_L, C_L) r (c_R, C_R) e_R ((A_pR), (A_R))

Phrase structure grammar productions of EARC language are given in the following, with element descriptions provided in Table 2:

SEARC		→ e_L ((Ap_L), (A_L)) (c_L, C_L) r (c_R, C_R) e_R ((Ap_R), (A_R))
Ap_L	→	ap_L \| ap_L, Ap_L
A_L	→	a_L \| a_L, A_L
Ap_R	→	ap_R \| ap_R, Ap_R
A_R	→	a_R \| a_R, A_R
c_L	→	0 \| 1
c_R	→	0 \| 1
C_L	→	1 \| M
C_R	→	1 \| M
M	→	n \| many

3. Research Motivation

The reason for creating the BD, word and sentence corpus was to identify rules for translating CEN into the EARC language. While defining translation rules, it has been noted that, if the rules are based only on syntactic categories of words (e.g., nouns), the translation to EARC language constructs is often incomplete and ambiguous. Hence, to obtain a more accurate translation, it is necessary to determine the semantics of input text beyond its syntax. As stated before, we ground our translation rules on string pattern recognition, where a string is a part of the sentential form. Here, a problem occurs when the same patterns in a different sentence context have different semantics. In that scenario, it is impossible to carry out an unambiguous translation of sentence parts in the EARC language constructs.

When creating a data model, designers/experts identify different information from individual sentences of a BD. By extracting important ER model constructs from the sentences, human experts gradually build a model.

We could say that human experts perceive each sentence as a special category that holds some of the characteristics related to the ER model. Therefore, the experts identify entities and relationships between the entities (possibly even some cardinalities) from one sentence. They also identify the primary key(s) and/or attribute(s) of an entity, cardinalities and information that indicates specialisation or a unary relationship from some other sentence. Hence, if only some constructs can be extracted from every sentence, we could define sentence classes and a method of classifying each sentence into the class. Similarly, identifying relationships between two entities in one sentence does not necessarily mean that the sentence does not provide information about certain cardinalities. Moreover, it is possible to identify both attributes and primary key(s) of a certain entity from only one sentence.

Recognising potential EARC information that a certain sentence carries will allow focused application of patterns and translation rules, which will significantly decrease the complexity of the translation algorithm itself. If we do not observe patterns in a context, then the application of rules such as the following can result in inaccurate translations (see sentence 1 and sentence 2 from Table 3):

Rule 1: “The noun block of a general form ({Nc} [Sp] [Dd |Di]) | {Af} Nc) before the main verb (Vm) is a translated to a left entity and the first noun block after the main verb is a right entity”.

In addition, observing another generic rule cannot provide an accurate translation (see sentence 3 from Table 3):

Rule 2: “If there are more noun blocks in a row linked with comma and/or conjunction (formally represent with {[Cc] | [,]{Nc}Nc} then they are attributes of the entity detected from the opposite part of the main verb”.

Some of these problems are shown in Table 3 and discussed below.

The first sentence above presented a regular textbook example and produced an accurate translation. When Rule 1 was applied to the sentence 2, it incorrectly detected the right entity and relationship and failed to identify the personal name attribute of an entity person. The application of Rule 2 to the third sentence resulted in the incorrect detection of attributes instead of entities patient and medical worker and no detection of possible hierarchical relations between a person, patient and medical worker. Applied to sentence 4, Rule 2 failed to identify the key = account ID. These are only a small sample of the issues that can occur, so it is evident that, for successful detection of constructs from a certain sentence, the number of different checking patterns should be high.

Identifying the cardinality is further complex because it demands additional if–then–else rules (we identified “more than one” string, but the additional rule should translate it to value M). Similarly, sentence 4 would be translated accurately if it were recognized as the attribute carrier sentence and primary key carrier sentence. If sentence 3 was recognised as entities and relationship carrier, the result would not be the list of attributes. In addition, if sentence 2 was perceived as the attribute(s) carrier, the relationships and correct entity would not be extracted.

The authors of the paper believe that classifying a BD sentence, following its lexical and syntactic analysis, can simplify the process of translation and decrease the number of mistakes in creating the ER model. By creating classes, every sentence has a label attached that indicates what EARC constructs could potentially be identified from it. Accordingly, the number of patterns and appropriate rules for translation would be reduced, focusing only on those closely connected to the sentence structure.

4. Research Methodology

The creation of an entity–relationship sentence classifier is based on capturing the knowledge of data modelling experts and representing it formally as logic blocks for each class. In an attempt to follow the reasoning of a human expert, two types of classes were created: Primary and Translation classes. Translation classes derive from primary classes through the application of some synthesising rules, which will be described further in this section. Primary classes (called ER, EA, EID, C, c and D) were created based on data model basic constructs using the ER method, while D is a primary class that was added to classify sentences that have a descriptive character. However, it is possible, by observing the classes to identify certain semantics, which is necessary for the translation. If a sentence has characteristics of a primary class EA, it means that it expresses attributes of a certain entity. If the same sentence could also be classified as a primary class EID, it means it also expresses identifier(s) of a certain entity.

The idea of creating two related types of classes emerged during sentence classification through the analysis of the reasoning process of four information system design experts. For example, the sentence “Every user can have 0 or more loans” was classified as primary class C (max. Cardinality) and primary class c (min. Cardinality). Then, based on the conclusion that cardinality is associated with entities and entities with relationships, the experts assigned the final label as ECR, so the translation class of the sentence became ECR. The principle that helped the experts to initially notice characteristics of primary classes as well as what they concluded about the final translation class were applied and formalised in the classifier.

In creating these two types of classes, the authors were led by the idea that it is easier to formalise the characteristics of primary classes, which are then, through the use of additional rules, transformed into a final translation class. Translation classes enable the final classification of sentences, after which translation rules can be applied (into the EARC language) for a certain sentence class. A sentence can belong to only one translation class.

To clarify the difference between the two terms translation class and primary class used by authors in developing a classifier, a list of labels for both types of classes is presented below.

The authors have chosen the following labels for possible primary classes:

ER (Entity and Relationship),
EA (Entity with Attributes),
EID (Entity with Identification–Primary key/keys),
C (max. Cardinality),
c (min. Cardinality) and
D (the sentences of a descriptive type could contain relevant semantics of a business description useful for translation, e.g., database needs to store information about employees).

The following labels were chosen for translation classes:

ECR (Entity Cardinality Relationship),
EA (Entity with Attributes),
EID (Entity with Identification–Primary key/keys),
EA/EID (EA+EID), and
D (Description).

Because of the complexity, the selector of the primary class ER is implemented using the “else” method. It consists of first checking whether a sentence belongs to one of the primary classes whose selectors were easier to implement (EA, EID, C, c, D). This is visible through the LINK-PrimaryClass operator, which assigns an ER primary class in the event that none of the remaining five primary classes were assigned.

Using the rules for determining the final translation class (Class Decision attribute in a classifier), the sentence is assigned to one of the classes. Examples of sentences that belong to a certain translation class are given below:

Translation class labels	Examples of sentences
ECR	One book can have one or more authors.
EA	A local department has a name and address.
EID	Departments are identified by a department ID.
EA/EID	The bill has a unique code, date and a total amount.
D	A company database needs to store information about departments.

A classification model was developed using 50 BDs from the previously defined corpus. Thirty randomly selected descriptions (310 sentences) were used to create the classifier. To create the classifier, 30 BDs had to be processed to identify linguistic sentences. The operator Process documents from files and a tokenizer (linguistic sentences) were applied in RapidMiner software. The output of the sentence list was converted to data (operator Wordlist to data) to generate the specific value of the sentence attribute for each sentence. Subsequently, this output was generated into an Excel file.

All 310 sentences from the Excel file were analysed by four independent experts. The experts assigned label(s) to every sentence that indicated belonging to one or more translation classes (e.g., in case of belonging to both the EA and EID classes, the experts marked the sentence with the label EA/EID). Experts’ labels were introduced using the attribute expert.

It should be noted that the experts classified sentences exclusively based on data from a sentence and not on other information or conclusions based on previous experiences. Expectedly, the values of the final translation classes that were assigned by the experts were not the same for all sentences. After the group decision-making by all four experts, a consensual solution on belonging to a final translation class for every sentence was created. This represented the value of the attribute expert. An Excel document with a list of sentences (the attribute sentence) and the attribute expert were the inputs for the classifier creation process using RapidMiner software.

5. Research Results and Discussion

Figure 1 shows a classifier developed in RapidMiner. To create the classifier, all 310 sentences were entered for the learning process (operator Learning 310) and the expert attribute was defined as a dependent variable (Expert column from the Excel spreadsheet) to determine the performance of the classifier. Then, the operator Subprocess was used, which was renamed to Classifier. To identify the features of the primary classes and the final translation classes, seven new attributes were created (using the Generate Attributes operator). The first five attributes were for the primary classes EA, EID, C, c, and D. The sixth attribute, LINK -Primary-class, was created to provide rules for the sentences’ final translation class and to assign an ER category to a primary class resulting from the elimination of other classes. The rules were created by the process of generating the seventh, ClassDecision attribute, and, in this way, a final translation class was obtained.

In the first part of the classifier, the sentences are analysed, and characteristics that define a certain primary class can be observed. Sentence characteristics are detected based on certain strings (one or more words with certain characters, such as blank spaces or commas). The characteristics of classes are described using the logic and textual functions of RapidMiner software. As a result of the analysis, the sentence is classified into one or more primary classes, according to the specific characteristics of a certain primary class.

The expressions that are used to describe belonging to a certain primary class or connecting primary classes (Link-Primary Class attribute) as well as the rules for determining a final class (ClassDecision attribute) for translation based on primary classes and some specific words (e.g., each, every, various…) are given below:

Primary Class EA—membership criteria description

if((((contains(sentence,” have”)||contains(sentence,” has”)) && !(contains(sentence,”unique”)) && !(contains(sentence,”may”))&&

!(contains(sentence,”could”)) && !(contains(sentence,”can “))) ||

contains(sentence,”categor”) || contains(sentence,”name”) || contains(sentence,”address”) || contains(sentence,”date”) ||

contains(sentence,”title”)||

((contains(sentence,”,”) || contains(sentence,” and”)) &&

(contains(sentence,” include”)||contains(sentence,” has “) ||

contains(sentence,” have “) ||contains(sentence,” described”) ||

contains(sentence,” record”) ||contains(sentence,” store”) ||

contains(sentence,”name”) || contains(sentence,”date”) ||

contains(sentence,”address”) || contains(sentence,” type”) ||

contains(sentence,”Type”)))),”EA”,”“)

Primary Class EID—membership criteria description

if((contains(sentence,”unique”) || contains(sentence,”uniquely”) ||

contains(sentence,” identified “) || contains(sentence,” id, “) ||

contains(sentence,”-id”) || contains(sentence,” id.”) ||

contains(sentence,” id “) || contains(sentence,”_id”) ||

contains(sentence,” ID”) || contains(sentence,”id,”) ||

contains(sentence,” identifier”) || contains(sentence,” identification”)),”EID”,”“)

Primary Class C—membership criteria description

if((contains(sentence,” most “) || contains(sentence,” every”) ||

contains(sentence,” most”) || contains(sentence,”Many “) ||

contains(sentence,” many “) || contains(sentence,” any “) ||

contains(sentence,” more “) || contains(sentence,” or more “) ||

contains(sentence,” exactly “) || contains(sentence,” several”) ||

contains(sentence,”only one “) || contains(sentence,” exactly one “) ||

contains(sentence,” exactly 1 “) || contains(sentence,” most 1 “) ||

contains(sentence,” most one “) || contains(sentence,” maximum 1”) ||

(contains(sentence,” one”) && !(contains(sentence,” least “) ||

contains(sentence,” most “) || contains (sentence,” more “) ||

contains(sentence,” one of “) || contains(sentence,” maximum “))) ||

(contains(sentence,” 1”) && !(contains(sentence,” least “))) ||

contains(sentence,” most “) || contains (sentence,” more “) ||

contains(sentence,” maximum “) || contains (sentence,” various “)),”C”,”“)

Primary Class c—membership criteria description

if(((contains(sentence, “ one “) && !(contains(sentence,” more than one “) ||

contains(sentence, “one of “))) || contains(sentence,” none “) ||

contains(sentence,” minimum “) || contains(sentence,” exactly “) ||

contains(sentence,” every”) || contains(sentence,”Exactly “) ||

contains(sentence,” zero “) || contains(sentence,” one or “) ||

contains(sentence,” only one “) || contains(sentence,”Only one “) ||

contains(sentence,” any “) || contains(sentence,” any number “) ||

contains(sentence,”Any number”) || contains(sentence,” least “) ||

contains(sentence,” some “) || contains(sentence,”Some “)),”c”,”“)

Primary Class D—membership criteria description

if( (contains(sentence,” information”) || contains(sentence,”Information “) ||

contains(sentence,” to know “) || contains(sentence,” be known “) ||

(contains(sentence,” keep”) && contains(sentence,” past “)) ||

contains(sentence,” previous “) || contains(sentence,” historical “) ||

contains(sentence,” history “) || contains(sentence,” record “) ||

contains(sentence,” document “) || contains(sentence,”Document”) ||

contains(sentence,” is a “)),”D”,”“)

The description of the attribute LINK creation is given below, which connects labels of the existing primary classes and assigns the label ER if the connection results in an empty string.

A functional expression of the attribute LINK is as follows:

LINK=if (concat(EA,EID,C,c,D)==“”, “ER”, concat (EA, EID, C, c, D)).

Based on the value of the attribute LINK and additional word expressions in a sentence, the description of an expression, which unambiguously determines a translation class label to which the given sentence belongs, is given below:

if(((LINK==“EAC”||LINK==“EAc”||LINK==“EAEIDC”||LINK==“EACc”||LINK==“EAEIDCc”) &&

(!((contains(word,”every”) || contains(word,”each”) || contains(word,”different”) ||

contains(word,”several”) ))) || LINK==“C”||LINK==“Cc”||LINK==“c”), “ECR”,

“if((LINK==“EAEID” && (contains(word,”and”) && !(contains(word,”identified”)))), “EA/EID”,

if((LINK==“EAEID” && ((contains(word,”and”)) && contains(word,”identified”))),”EID”,

if(LINK==“ER”,”ECR”,

if((LINK==“EIDC”||LINK==“EIDc”||LINK==“EIDCc”),”EID”,

if(LINK==“EAD”,”EA”,

if(((LINK==“EAC”||LINK==“EAc”||LINK==“EAEIDC”||LINK==“EACc”||LINK==“EAEIDCc”) &&

(contains(word,”every”)||contains(word,”each”)||

contains(word,”different”)||contains(word,”several”))),

concat(EA,EID),LINK)))))))

In Figure 1, the operator named expert-attribute sets the target role label (represents a dependent, goal variable). The target role of the attribute ClassDecision, which is generated based on classification conditions, is set as a prediction role. In this way, operator classification performance can determine multiclass performance using the parameters precision, recall and accuracy.

As it is clear from Figure 2 that the class performance parameters are high (aside from category D, precision and recall are higher than 95%), which proves that class characteristics are, to a large extent, encompassed by a classifier (for sentences of 30 BDs). Although the entire accuracy of the learning process and classifier building is 96.77%, it was not that high in the first version of the classifier. The procedure for tracking the classifier performance results and subsequently all the adjustments were carried out in several steps. The initial performance of classifier creation was around 90%. This reduced accuracy was the result of, for example, overlooking some characteristics of a class (a semantic slip), inaccurate description of the observed characteristics (a syntactic slip), a failure to include keywords and expressions or not precisely determining characteristics due to functional expression errors using logical operators.

Therefore, some sentences are entirely falsely classified, while some others are incompletely classified into final translation classes or only partially accurately classified.

A list of sentences whose characteristics the classifier did not identify in the same manner as the expert is given in Table 4.

The first four sentences in Table 4 are examples in which the experts classified the sentences as EA and the classifier as ECR. Hence, it can be concluded that the experts, based on the meaning of the noun after the verb, determined that it was not the second entity in the relationship but rather an attribute (so, they used their previously stored experience and knowledge). Currently, such a semantic characteristic has not been installed in the classifier. Such semantics could be identified after all the BD sentences have been classified and translated (when some previously identified “entity” resulted in having no attributes → then this “entity” would be transformed in an attribute). The classifier classified the sentences that were classified as ECR either as EA (first and foremost due to the verb to have, which characterises attribute sentences) or as D, given that it detected one of the characteristics of the class description (“is a” or “information”). In relation to incompatibility, the classifier does not recognise plurals, and thus it could not conclude that the combination of the verb “to have” and the plural noun refers to ECR characteristics. The experts considered the sentence “A primary doctor is a doctor” to be ECR because it refers to the entity hierarchy primary doctor and doctor. The classifier labels this situation as class D-description. Still, translation rules that would be applied to category D sentences would also lead to detecting the hierarchy of the entities doctor and primary doctor. Thus, despite an apparent accuracy of 96.77%, by adding a POS tag as a predictor value and resolving “is a” hierarchical relationships with the rules applied for a D and not for the ECR class (where other hierarchical relationship rules will be applied, as described in Our system novelties), the classifier learning performance would be almost 100%.

5.1. Testing the Sentence Classifier

To test the classifier with new sentences, the remaining 20 descriptions were used (214 sentences), processed in exactly the same way as when the classifier was created (the input procedure with text processing, the conversion of word lists into a range of attribute values, and printouts in the Excel database). In addition, the primary and final classes were assigned by the experts, followed by a group decision and the creation of the attribute named expert, to verify the success of the testing.

The testing was performed by applying the previously created Classifier subprocess to loaded sentences of 20 BDs. The results of the testing are shown in Figure 3 (overall accuracy of 95.79%).

The sentences to which the classifier did not assign the same class as the experts did are shown in Table 5.

As shown in Table 5, the experts classified the first seven sentences as ECR, whereas the classifier (besides the third sentence, which was classified as D) classified the sentences as EA. In this phase, the classifier did not recognise plurals or the syntax category of numbers, and thus it did not assign ECR class as the experts did. Other incompatibilities (sentences 8 and 9) are disputable, given that the structure of the eighth sentence could actually refer to the entity hierarchy in the ECR category, and the ninth sentence could refer to additional information (that is not registered within the data model) in relation to the entity visitor.

5.2. An Example of Applying the Translation Rules

The basic idea of the authors in the translation of BDs in CEN into a data model in the EARC language is the application of translation rules based on patterns for a certain sentence class.

An example of implementing translation rules after the classification of the following sentence shows this more clearly:

“A program has a name, a program identifier, the total credit points and the starting year”.

(1)

The sentential form of the sentence is:

Di Nc Vm Di Nc, Di Nc Nc, Dd Af Nc Nc Cc Dd Nc Nc

(2)

The classifier assigned this sentence to the EA/EID class; thus, it has EA and EID class characteristics.

In relation to the EA class, the following patterns (Table 6) have been identified, which are used in translation into an EARC sentence:

By generalising, the following translation rules for category EA can be established for this type of sentence:

Given that the sentence also possesses characteristics of the EID class, pattern characteristics of that class were observed. Therefore, in this sentence, the word identifier was identified as a predictor value–characteristic that assigned the sentence to the EID class (Table 7).

a_{i} = {\begin{matrix} {A f} {N c} N c i f S = * V m ? {A f} {N c} N c, f o r i = 1 \\ {A f} {N c} N c i f S | * V m * a_{(i - 1)} [, | C c] ? {A f} {N c} N c [, * | .], f o r i = 2, \dots, n \end{matrix}

(3)

Furthermore, since the pattern for identifying attributes does not exclude attribute/attributes that comprise a primary key, there is an additional necessary rule:

If the attribute that makes up the primary key (ap_i) is also in the list of m attributes of a certain entity E_i that are not part of the primary key, then it is excluded from the group of attributes that do not make up the primary key; a set A_i excludes that element ap_i_.

i f a p_{i} \in A_{i} = {a_{i j} : i = 1, \dots, n, j = 1, \dots, m}, t h e n A_{i} = A_{i} {a p_{i}}

(4)

The final translation of sentence (1) into an EARC sentence is provided below. The complete sentence in the EARC language (all EARC constructs are known) takes the following form:

eL(ApL, AL)(cL, CL) r (cR, CR)eR(ApR, AR)
eL = program
ApL = {program identifier}
AL = {name, total credit points, starting year}

Therefore, an EARC sentence is:

Program ({program identifier}, {name, total credit points, starting year}) (cL, CL) r (cR, CR) eR (ApR, AR)

(5)

An ER model that resulted from sentence classification and the translation are shown in Figure 4.

Parts of an EARC sentence that remained untranslated will be translated in one of the following ways:

translating other sentences of a BD and integrating identified/translated EARC constructs,
getting users’ responses to specifically formulated targeted questions for every missing EARC construct (our KB system will have a question-answering system),
using examples from previous cases (case-based reasoning, explained in the next section our system’s novelties).

5.3. Our System’s Novelties

We agree with Lucassen et al. [27,28], who argue that accuracy is more important than trying to formalise and automate the translation of every detail of the requirements. In this section, we present the main translation problems of our approach—in general, the main problems in translating requirements texts into data models. We also present the proposed translation rules and highlight the features, differences, and innovations of our approach. Our translation and pre-translation rules are expressed using strings of morphosyntactic word descriptions. Some of our rules are similar to those we have collected in [27], but most of them are completely new or have been refined. We present only some of the most important rules and principles, since there are many for each translation class. Attempting to formalise natural (or some kind of constrained natural) language and provide the desirable 100% automation inevitably leads to a model with many language semantics that depend on context, and thus to a huge number of rules. In general, the less constrained a natural language is with respect to the requirements expressed in the input text, the more rules are needed for complete and accurate translation.

In our approach, we followed principles synthesised in [48] where they highlight some principles of automated transformation of requirements description to further software development models. Regardless of which format is used, the requirements should be easy to understand to facilitate communication between different stakeholders (e.g., users, developers):

proposed analysis models should be complete;
the number of transformation steps should be minimised;
the approach should be automated;
the approach should support traceability management.

Our KB system is composed of four main subsystems that are specialised to perform its main functions.

1.

Textual requirements for data model translation

Requirements update;
Syntactic analysis and validation of requirements;
ER construct identification and ER model formation (textual form);
Automated question generation to ask users about missing constructs and integrate the answers into the model.

2.

A data model for textual requirements translation

Data model update;
Syntactic analysis and validation of the data model;
Word/phrase identification and textual requirements formation.

3.

View, explanation and analysis—provides the user with a complete list of all the steps performed from the input to the output and represents a core part of the learning (educational dimension).

4.

Knowledge base updates—as the knowledge base with all its constituents (translating rules, PSGs, lexicons) represents the brain and the engine of the system, it must be monitored and updated by the KB system engineer(s).

The KB system has a system–user interaction module. We expect to reduce the user interaction with the KB system by expanding and updating its knowledge base. Using the classifier described in this paper, we expect that the translation steps should be minimised. We also expect to achieve completeness of the model with knowledge base evolution during the testing and use of the KB system, although on explicitly expressed requirements (in this phase of the research, we cannot translate implicit semantics). Traceability will be enabled by storage of all requirements sentences with their corresponding ER sentences and all translation steps from the original sentence to the ER sentence (specifying which rules were applied).

What follows is a brief description of the main translation issues and how we dealt with them in our approach.

Input format (CEN restricted language). We described our controlled natural language used for BD with the phrase structure grammar and 70,000 words of a MULTEXT-East Lexicon. However, for the person (analyst or a business domain specialist) who wants to write requirements (BD) conforming to our CEN language, we can simplify the restrictions in the form of rules and a less rigorous set of guidelines.

Use sentences with only one main verb in the present or in the past tenses, in the infinitive, passive or conditional. If there is a need to express more relations among more entities, do this with more sentences. Example: the sentence “Students enrol in a course that is taught by only one teacher” can be written as two sentences:
- Sentence 1: “Students enrol in a course”, and
- Sentence 2: “A course (or each course) is taught by only one teacher”.
Auxiliary and modal verbs are not seen as main verbs, so expressions such as “could be”, “is given”, “is giving”, “can be sold” or “could be selling” are all acceptable.
Use expressions that quantify whenever possible: use “at least” or “more than” quantifiers for including semantics that will provide valuable information about cardinalities.
Generally, use the singular form when listing attributes and/or primary key(s).
Limit expressions to third singular and plural person.
Avoid explanations and rich descriptions. If there is some important information, express it with a separate sentence.

Compounds. We call these noun blocks and define them with a string class of the form ({Nc} [Sp] [Dd |Di])|{Af} Nc). For example: “name of the student” (Nc Sp Dd Nc), “student name” (Nc Nc), “account development manager” (Nc Nc Nc). Only if the various types of entity cases are encountered (e.g., “existing user”, “new user”, “senior user”) will the attribute named type be defined of the entity user.

Attributes. For the purpose of easier rule formulation and code implementation, we introduce a general sentence format α V β, where V represents a verb expression. Then, we define the following rule: attributes are identified from noun blocks separated with a comma and/or conjunction: {[Sp][Dd |Di] {Nc}Nc [,| Cc]}, which can be in:

α part (then, a noun block from β represents the entity whose attributes are identified),
β part (then, a noun block from the α part represents the entity whose attributes are identified).

Example: “A customer has a name, a surname and a date of birth”. The sentence is classified as EA class, and the attributes are found in the β part. Applying the rule, we receive a part of an EARC sentence: e_L(A_pL, A_L), customer (A_pL,{name, surname, date of birth}).

Primary key(s). The rules for primary key(s) identification are the same as for attributes. Hence, if the sentence is classified as EID, we can encounter one or more primary key elements. If there is more than one, they can be in the α or in β part as noun blocks separated with a comma and/or conjunction. Example: “Every room is identified with a floor and a number”. The sentence is classified as EID class, and primary keys are found in the β part. Applying the rule, we receive a part of an EARC sentence: e_L (A_pL, A_L), room (,{floor, number}, A_L }).

Verbs. In the grammar production of the CEN language, verbs can have the following form: [Vo | [Vo] Va [Rmp]] Vm. Thus, the following combinations are all supported: Vo Vm (“can call”), Vo Va Vm (“could have called”), Vo Va RmpVm (“could have totally sold”), Va Vm (“is calling” or “is sold”), Vo Sp Vm (“need to sell”) and Vo Sp Va Vm (“need to be stored”).

Conjunctions. Two different situations can be described with conjunctions:

listing attributes; for example: “The loan has a title, starting day and an expiration day” (where attributes of the entity loan are title, starting day, expiration day)
listing entities (requires a rule of sentence splitting); for example, splitting the sentence “The catalogue contains food and beverages” is performed by leaving the α V part of the sentence and adding one noun block in each new sentence separated by a comma or conjunction (from the β part):
- Split sentence 1: “The catalogue contains food”.
- Split sentence 2: “The catalogue contains beverages”.

Adverbs and adjectives. They are supported within the CEN language, but adjectives are rarely used (as they should represent values of some attribute, we assume that important attributes should be declared specifically in the requirements). Adverbs, however, are frequently used (see the cardinalities issue below).

Syntax errors. To check for syntactically correct sentences in our PSG, we implement finite state automata transition tables (using Python). Syntax errors are attributed to specific sentence parts, so the user has the ability to rewrite the sentence and then repeat the syntax analysis.

Hierarchy cases. The hierarchy-capturing implementation follows in the next research phase, although some of the cases are described with rules. We present three hierarchy scenarios:

(First scenario)

IF

α = [Every | Each | Di | Dd] {Nc} Nc

V = ((is a) | (is also a) | (is one of [Dd]) | (can | could) be)

β = [Dd |Di] {Nc} Nc [Sp [Rmp] Mc]{Nc}

THEN

eL = {Nc} Nc from α part

eR = [Dd |Di] {Nc} Nc [Sp] {Nc} Nc from β part

Cardinalities are: cL = 0, CL = 1, cR = 1, CR = 1.

The relationship name: (is a |is also a| is one of [Dd] | (can | could) be).

EXAMPLE: “The doctor is also a staff of the hospital”.

(Second scenario)

IF

α = {Nc} Nc

V = (are | are also)

β = {Nc} Nc, and both in α and β parts are nouns in plural

THEN

e_L = {Nc} Nc from α part

e_R = {Nc} Nc from β part

Cardinalities are: c_L = 0, C_L = 1, c_R = 1, C_R = 1.

IF α = “Some [of the]{Nc} Nc”, β and V are translated in the same manner, only cardinalities are changed in: c_L = 1, C_L = 1, c_R = 0, C_R = 1.

EXAMPLE: “Some employees are also managers”.

(Third scenario)—Hierarchy sentences with conjunctions must first be processed with the rule for sentence splitting (mentioned above), and then the appropriate hierarchy rule should be applied to each split sentence. Those sentences take the form [Dd | Di] {Nc}Nc (are | is) {[Cc] [Dd |Di] {Nc}Nc }, where the number of derived sentences equals the number of Cc-conjunctions in the original sentence. For each split-derived sentence, the second scenario for entities and relationship is applied, and the cardinalities are: c_L = 1, C_L = 1, c_R = 0, C_R = 1.

Example of a sentence: “A staff member is either a professor or an administrator”.

Split sentence 1: “A staff member is a professor”.

Split sentence 2: “A staff member is an administrator”.

Ternary relationships. In the sentences where three entities are encountered (one entity in the α and two in the β part or vice versa), we can observe string classes such as E = “[Dd |Di] {Nc} Nc Sp [Rmp Mc] |Dd | Di] {Nc} Nc”, where entities are identified before and after Sp. We now substitute some parts of E with E1 and E2, so E = E1 Sp [Rmp Mc] E2. The original sentence must be split into two sentences based on the following rules:

(a): If E is in β, then the first sentence has the form α V E1 and the second sentence has the form E1 V E2, where V is derived from the verb “to be” + Sp.
Example sentence: “An instructor can be the head of only one department”.
Split sentence 1: “An instructor can be the head”.
Split sentence 2: “The head is of only one department”.
(b): If E is in the α part, then the first sentence has the form E1 V E2, where V is derived from a verb “to be” + Sp and the second sentence has the form E1 V β, where V is the main verb of the original sentence.
Example sentence: “A customer of the bank requires a loan”.
Split sentence 1: “A customer is of the bank”.
Split sentence 2: “A customer requires a loan”.

Aggregation (resolving many to many relationships). We propose a general principle to resolve M:M relationships. We introduce a new entity “Agg”, which is composed of left and right entities (Agg = e_L_e_R), and then we split the original sentence into two sentences assigning cardinalities and primary keys and attributes to a new entity applying ER modelling theory (Agg has a primary key composed of e_L and e_R primary keys). The first sentence is of the form e_L V Agg, with cardinalities c_L = 1, C_L = 1, c_R = 1, C_R = M.

The second sentence is of the form Agg V e_R, with cardinalities c_L = 1, C_L = 1, c_R = 1, C_R = M.

Example sentence: “Products are sold to customers”. The relationship is of the M:M type, so we apply the aggregation rule:

First sentence: “Product is sold to product_customer”.

Second sentence: “Product_customer is sold to customers”.

Cardinalities. We identified many rules for cardinalities, and listing them is beyond the scope of this paper. The semantics of cardinalities are categorised in three main classes: those that carry information about minimum cardinalities, maximum cardinalities and both. We mention only a few rules of each class:

Minimum cardinalities:

(a): the word least, generally appears in a string E = Sp least Mc, or E = Mc Sp least (at least four, five at least) → Rule: if E is in α, then c_L = Mc; if E is in β, then c_R = Mc.
(b): the word more, generally appears in a string E = more Cc |Cs Mc, E = Mc Cc more (more than one, one or more) → Rule: if E is in α, then c_L = Mc, C_L = M; if E is in β, then c_R = Mc, C_R = M.

Maximum cardinalities:

(a): the word most, generally appears in a string E = Sp most Mc, E = Mc * Sp most (at most four, four at most) → Rule: if E in α then C_L = Mc, if E in β then C_R = Mc.
(b): words approximately, around, nearly, generally appear in a string E = *approximately |around| nearly Mc (nearly three professors) → Rule: if E is in α, then C_L = Mc; if E is in β, then C_R = Mc.

Domain-specific. This NLP approach is not restricted to a specific domain and could be applied in any business or scientific domain.

Case-based. Our approach supports case-based reasoning in a sentential form matching manner. For every translated sentence, its sentential form and the EARC translated sentence are stored in a repository. Thus, when a sentence is entered that is syntactically correct but contains some “out of dictionary” words, if its sentential form corresponds to the sentential form that is already stored, the sentence is translated according to existing translation. An example of an existing and stored tuple of (sentence, sentential form, EARC translated constructs) is as follows:

Sentence: “A borrower can have more than one loan request”.

Sentential form: Di Nc Vo Vm Rmp Cs Mc Nc Nc.

EARC constructs: e_L = borrower, r = can have, c_R = 1, C_R = M, e_R = loan request. Case-based reasoning for a new case:

Sentence: “A student could have more than one thesis mentor”.

Sentential form: Di Nc Vo Vm Rmp Cs Mc Nc Nc.

EARC constructs based on a previous stored case: e_L = student, r = could have, c_R = 1, C_R = M, e_R = thesis mentor.

User intervention. Our approach supports user intervention as a guided process of targeted questions formulated by a system and the integration of the users’ answers.

6. Research Limitations and Further Research Activities

There were some research limitations, which include the following:

the classifier was built based on 310 sentences out of 30 BDs,
lexical and grammatical restriction of a BD (controlled natural language CEN),
the classifier does not utilise all word category tags for sentence classification.

For the next research phase, the authors have set the following goals:

additional improvement of the classifier by extending it with expressions that use POS tags (previously performed sentence lexical analysis), which will lead to expanding grammar with the number attribute (plural/singular) of a syntax category Nc (common noun), such as Ncnp (common noun, neutral, plural),
definition of some new rules for every translation class that will enable more accurate translation,
definition of integration and optimisation rules that enable revision of all the translated EARC sentences of the BD (e.g., multiple occurrences of entities) to obtain an optimised ER model and determination of the best way to integrate users’ responses (for every unidentified construct) into the ER model.

7. Conclusions

The paper presents a process for creating a sentence classifier to reproduce the classification performed by human experts. This is a possible direction for solving problems related to direct translation of BDs in the ER model. The work includes current research activities and results of the development of a knowledge-based system to support the creation of ER models. We focused on the translation from one language (controlled natural language) to another (formalised language of an ER data model) to obtain an ER model from the textual description. The translation method consists of a set of rules for translating parts of the sentential form into ER model constructs, depending on the textual and character patterns recognised in a BD.

We expect to solve the problem of direct translation of BD sentences by classifying those sentences based on what EARC information they might possess.

Our approach starts from the idea of formulating a kind of lean requirement specification i.e., business description (BD) in which each sentence can express some important ER semantics. The examples are: which entities (or main objects) are involved with their numerical participation in certain relations (ECR class), which characteristics must be stored for each entity (EA class), how each of the entities is uniquely identified (EID class), and additional descriptions, such as the hierarchical relationship of the entities involved or some other clarification (personal nouns or an expression that uses generic nouns such as system, database, or enterprise) (D class).

It has been noticed that experts gradually create the ER model from a BD by identifying EARC constructs in single sentences based on sentence content (one word or a string) in a specific context. That human knowledge-intensive task was a methodological basis for developing a sentence classifier. The research questions (Q1, Q2 and Q3) are answered (A1, A2 and A3) as follows:

A1. Before describing the classifier, we listed in Table 4 some issues we encountered in translating CEN to EARC language using some concrete examples. This provided the research motivation for developing the classifier.

A2. The data modelling experts’ reasoning processes and knowledge about particular sentence types in relation to the ER model constructs were observed and formalised into a set of class characteristics. During that process, we found a method to identify the features of each EARC class. We expressed and documented all recognised class features as a set of expressions and rules using textual and logical functions (Table 6). The classifier was created using 310 sentences from 30 randomly selected BDs. The features of the primary classes and the rules used to determine a final translation class are given, as well as the multiclass classifier performance results (total accuracy of 96.77%). The classifier was tested on sentences from the remaining 20 BDs (214 sentences), with a final class performance accuracy of 95.79%.

A3. After creating the classifier, we demonstrated the translation process with an example that resulted in an accurate translation into EARC elements. The problems and errors shown in Table 4 were eliminated after using the classifier, and the translation would allow accurate identification of EARC elements.

Overall, the use of the sentence classifier is expected to result in a higher number of correctly translated BD sentences and fewer ambiguous translations. We expect that, by creating classes and classifying sentences, experts (or other users, e.g., students) will identify EARC constructs faster by paying attention to only some parts.

Applying sentence classification prior to translation would allow a future KB system to analyse each sentence in an isolated specific context. In this way, translation rules specific to the class would be applied to a sentence, and the translation process would likely be faster and more precise.

Author Contributions

Conceptualization, S.Š.; methodology, S.Š., S.Č. and A.J.; software, S.Š. and S.Č.; validation, S.Š., S.Č. and A.J.; formal analysis, S.Š., S.Č. and A.J.; investigation, S.Š., S.Č. and A.J.; resources, S.Š., S.Č. and A.J.; data curation, S.Š.; writing—original draft preparation, S.Š. and S.Č.; writing—review and editing, S.Š. and S.Č.; visualization, S.Š.; supervision, S.Č. and A.J.; project administration, S.Č.; funding acquisition, No funding. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Acknowledgments

This work has been fully supported by the University of Rijeka under the project number uniri-drustv-18-140.

Conflicts of Interest

The authors declare no conflict of interest.

References

Šuman, S.; Jakupović, A.; Kuljanac Gržinić, F. Knowledge-based systems for data modelling. Int. J. Enterp. Inf. Syst. 2016, 12, 1–18. [Google Scholar] [CrossRef] [Green Version]
Šuman, S.; Jakupović, A.; Pavlić, M. Knowledge-Based Systems for Data Modelling: Review and Challenges. In Enterprise Information Systems and the Digitalization of Business Functions; Tavana, M., Ed.; IGI Global: Hershey, PA, USA, 2017; pp. 354–374. [Google Scholar]
Rolf, S. Controlled Natural Languages for Knowledge Representation. In Proceedings of the COLING ’10 23rd International Conference on Computational Linguistics, Beijing, China, 23–27 August 2010; pp. 1113–1121. [Google Scholar]
Njonko, P.B.F.; Cardey, S.; Greenfield, P.; El Abed, W. RuleCNL: A controlled natural language for business rule specifications. In Proceedings of the 4th International Workshop, CNL 2014, Galway, Ireland, 20–22 August 2014; pp. 66–77. [Google Scholar]
Fuchs, N.E.; Kaljurand, K.; Kuhn, T. Attemp to controlled english for knowledge representation. In Proceedings of the 4th International Summer School 2008, Venice, Italy, 7–11 September 2008; pp. 104–124. [Google Scholar]
Hevner, A.; Chatterjee, S. Design Research in Information Systems—Theory and Practice; Springer: Boston, MA, USA, 2010; Volume 22, p. 320. [Google Scholar]
Wheaton, G.R. Development of Taxonomy of Human Performance: A Review of Classificatory Systems Relating to Tasks and Performance; Technical Report No. 726-12/68-TR-1; AIR: Washington, DC, USA, 1968. [Google Scholar]
Vessey, I.; Ramesh, V.; Glass, R.L. A unified classification system for research in the computing disciplines. Inf. Softw. Technol. 2005, 47, 245–255. [Google Scholar] [CrossRef]
Kocaleva, M.; Stojanov, D.; Stojanovik, I.; Zdravev, Z. Pattern Recognition and Natural Language Processing: State of the Art. TEM J. 2016, 5, 236–240. [Google Scholar]
Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2001. [Google Scholar]
Brody, S. Cluster-Based Pattern Recognition in Natural Language Text. 2005. Available online: http://www.cs.huji.ac.il/labs/learning/Theses/Brody_MSc.pdf (accessed on 30 October 2021).
Chen, P.P.S. English sentence structure and entity-relationship diagrams. Inf. Sci. 1983, 29, 127–149. [Google Scholar] [CrossRef]
Lo, A.W.; Choobineh, J. Knowledge-Based Systems as Database Design Tools: A Comparative Study. In Intelligent Support Systems Technology: Knowledge Management; IRM Press: Hershey PA, USA, 2002. [Google Scholar]
Gomez, F.; Segami, C.; Delaune, C. A system for the semiautomatic generation of E-R models from natural language specifications. Data Knowl. Eng. 1999, 29, 57–81. [Google Scholar] [CrossRef]
Choobineh, J.; Lo, A.W. CABSYDD: Case-Based System for Database Design. J. Manag. Inf. Syst. 2004, 21, 281–314. [Google Scholar] [CrossRef]
Omar, N.; Hanna, P.; Mc Kevitt, P. Acquisition of Entity-Relationship Models from Natural Language Specifications using Heuristics. In Proceedings of the International Conference on 2005 Information Technology and Multimedia at UNITEN (ICIMU’05), Selangor, Malaysia, 1 November 2005. [Google Scholar]
Al-Safadi, L.A.E. Natural Language Processing for Conceptual Modeling. Int. J. Digit Content. Technol. Appl. 2009, 3, 47–59. [Google Scholar]
Shahbaz, M.; Ahsan, S.; Shaheen, M.; Nawab, R.M.A.; Masood, S.A. Automatic Generation of Extended ER Diagram Using Natural Language Processing. J. Am. Sci. 2011, 7, 1–10. [Google Scholar]
Btoush, E.S.; Hammad, M.M. Generating ER Diagrams from Requirement Specifications Based on Natural Language Processing. Int. J. Database Theory Appl. 2015, 8, 61–70. [Google Scholar] [CrossRef]
Lee, S.; Kim, N.; Moon, S. Context-adaptive approach for automated entity relationship modeling. J. Inf. Sci. Eng. 2010, 26, 2229–2247. [Google Scholar]
Suraweera, P.; Mitrovic, A. An intelligent tutoring system for entity relationship modelling. Int. J. Artif. Intell. Educ. 2004, 14, 375–417. [Google Scholar]
Kazi, Z.; Kazi, L.; Radulovic, B. Analysis of data model correctness by using automated reasoning system. Tech. Technol. Educ. Manag. 2012, 7, 1090–1100. [Google Scholar]
Thonggoom, O.; Song, I.; An, Y. Semi-automatic Conceptual Data Modeling Using Entity and Relationship Instance Repositories. Conceptual Modeling—ER 2011, 6998, 219–232. [Google Scholar]
Kim, N.; Lee, S.; Moon, S. Formalized Entity Extraction Methodology for Changeable Business Requirements. J. Inf. Sci. Eng. 2008, 24, 649–671. [Google Scholar]
Mich, L.; Franch, M.; Inverardi, P.N. Market Research for Requirements Analysis Using Linguistic Tools. Requir. Eng. 2004, 9, 40–56. [Google Scholar] [CrossRef]
Génova, G.; Fuentes, J.M.; Lorens, J.; Hurtado, O.; Moreno, V. A framework to measure and improve the quality of textual requirements. Requir. Eng. 2013, 18, 25–41. [Google Scholar] [CrossRef]
Lucassen, G.; Robeer, M.; Dalpiaz, F. Extracting conceptual models from user stories with Visual Narrator. Requir. Eng. 2017, 22, 339–358. [Google Scholar] [CrossRef] [Green Version]
Lucassen, G.; Dalpiaz, F.; van der Werf, J.M.E.M.; Brinkkemper, S. Improving agile requirements: The Quality User Story framework and tool. Requir. Eng. 2016, 21, 383–403. [Google Scholar] [CrossRef] [Green Version]
NER. Available online: https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da (accessed on 28 February 2022).
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the Neural Information Processing Systems Conference, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
TF-IDF. Available online: https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html (accessed on 28 February 2022).
GloVe. Available online: https://nlp.stanford.edu/projects/glove/ (accessed on 28 February 2022).
NLTK. Available online: https://www.nltk.org/ (accessed on 1 March 2022).
Stanford Tokenizer. Available online: https://nlp.stanford.edu/software/tokenizer.shtml (accessed on 1 March 2022).
Hutto, C.; Gilbert, E. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media, Ann Arbor, MI, USA, 1–4 June 2014. [Google Scholar]
SentiWordNet. Available online: https://github.com/aesuli/SentiWordNet (accessed on 1 March 2022).
TextBlob. Available online: https://textblob.readthedocs.io/en/dev/ (accessed on 2 March 2022).
Flair NLP. Available online: https://github.com/flairNLP/flair (accessed on 2 March 2022).
Polonijo, B.; Suman, S.; Simac, I. Propaganda Detection Using Sentiment Aware Ensemble Deep Learning. In Proceedings of the 44th International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia, 27 September–1 October 2021; pp. 199–204. [Google Scholar] [CrossRef]
Choi, Y.; Nguyen, M.D.; Kerr, T.N. Syntactic and Semantic Information Extraction from NPP Procedures Utilizing Natural Language Processing Integrated with Rules. Nucl. Eng. Technol. 2021, 53, 866–878. [Google Scholar] [CrossRef]
Zhou, P.; El-Gohary, N.M. Ontology-Based Automated Information Extraction from Building Energy Conservation Codes. Autom. Constr. 2017, 74, 103–117. [Google Scholar] [CrossRef]
Zhang, J.; El-Gohary, N.M. Semantic NLP-Based Information Extraction from Construction Regulatory Documents for Automated Compliance Checking. J. Comput. Civ. Eng. 2016, 30, 04015014. [Google Scholar] [CrossRef] [Green Version]
Gagliardi, I.; Artese, M.T. Semantic Unsupervised Automatic Keyphrases Extraction by Integrating Word Embedding with Clustering Methods. Multimodal Technol. Interact. 2020, 4, 30. [Google Scholar] [CrossRef]
Comito, C.; Forestiero, A.; Pizzuti, C. Word Embedding Based Clustering to Detect Topics in Social Media. In Proceedings of the 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI), Thessaloniki, Greece, 14–17 October 2019. [Google Scholar]
Hu, J.; Li, S.; Yao, Y.; Yu, L.; Yang, G.; Hu, J. Patent Keyword Extraction Algorithm Based on Distributed Representation for Patent Classification. Entropy 2018, 20, 104. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Jakupović, A.; Pavlić, M.; Dovedan Han, Z. Formalisation method for the text expressed knowledge. Expert Syst. Appl. 2014, 41, 5308–5322. [Google Scholar] [CrossRef]
Erjavec, T. MULTEXT-East—Morphosyntactic Specifications (Version 4). Published 2010. Available online: http://nl.ijs.si/ME/V4/msd/html/msd-en.html (accessed on 20 December 2015).
Yue, T.; Briand, L.C.; Labiche, Y. A systematic review of transformation approaches between user requirements and analysis models. Requir. Eng. 2011, 16, 75–99. [Google Scholar] [CrossRef] [Green Version]

Figure 1. An overview of a subprocess—Classifier with performance.

Figure 2. Classifier performance—Learning phase.

Figure 3. Classifier testing performance using sentences from 20 business descriptions.

Figure 4. A part of the ER model that resulted from sentence classification and translation.

Table 1. CEN production rules and terminals tags with descriptions.

		Productions	CEN PSG POS Tag	POS Category Description	Multext MSD
S	→	NP VP	Af	qualificative adjective	Af *
NP	→	[D\|NPs]N\|Np\|Pro\|Ni	Cc	coordinating conjunction	Cc *
D	→	Dd\|Di\|Dg\|Poss	Dd	demonstrative determiner	Dd *
Poss	→	NPs\|Ds	Dg	general determiner	Di *
NPs	→	([D\|NPs]N\|Np\|Pg)[“s\|”]	Di	indefinite determiner	Dg *
N	→	{AP}Nc{PP}	Ds	possessive determiner	Ds *
AP	→	{AP}([Rs\|AdvP]Af)	Mc	cardinal numeral	Mc
AdvP	→	[Rs]Rm	Nc	common noun	Nc *
PP	→	{PP}([PSpec]Sp(NP\|PP))	Np	proper noun	Np *
PSpec	→	Rmp\|McNc	Pg	general pronoun	Pg *
Ni	→	{NP}[Cc]NP	Pp	personal pronoun	Pp *
Pro	→	Pp\|Pg	Rm	modifier adverb	Rm *
VP	→	V[NP[NP\|PP\|AP[PP]]\|AP[PP]\|PP[PP]]	Rmp	positive modifier adverb	Rmp
V	→	[Vo\|[Vo]Va[Rmp]\|VmSp]Vm	Rs	specifier adverb	Rs *
			Sp	preposition adposition	Sp
			Va	Auxiliary verb	Va *
			Vm	main verb	Vm *
			Vo	modal verb	Vo *
			The sign “*” in the MSD column represents all the possible attributes of a specific word category

Table 2. EARC language elements.

Label	Description
S_EARC	The form of every EARC sentence
e_L	Left entity type
e_R	Right entity type
A_L = {a_L1, …, a_Ln}	Set of “non-primary key” left entity attributes
Ap_L = {ap_L1, …, ap_Ln}	Set of left entity attributes that represent the primary key
A_R = {a_R1, …, a_Rn}	Set of “non-primary key” right entity attributes
Ap_R = {ap_R1, …, ap_Rn}	Set of right entity attributes that represent the primary key
r	Relationship name
c_L	Minimum cardinality of left to right entity (the minimum number of eR that relates with one of the eL)
c_R	Minimum cardinality of right to left entity (the minimum number of eL that relates with one of the eR)
C_L	Maximum cardinality of left to right entity (the maximum number of eR that relates with one of the eL)
C_R	Maximum cardinality of right to left entity (the maximum number of eL that relates with one of the eR)

Table 3. Problems without sentence classification approach.

No	Sentence/POS Tags	Pattern	BD CEN POS Tags (Sentential Form Part)=> EARC Sentence Category=> Sentence/Expression
1	Every person can rent more than one city car.	NcVm*	Nc=> e_L=> person
	Dg Nc Vo Vm Af Cc Mc Nc Nc	* [Vo] [Va] Vm*	Vo Vm=> r=> can rent
		Vm{Nc}Nc	{Nc}Nc=> e_R=> city car
		Vm Af {Cc}Mc Nc	Af {Cc}Mc=> CL=> more than one (M)
2	Every person is described with a personal name.	NcVm*	Nc=> e_L=> person
	Dg Nc Va Vm Sp Af Nc	* [Vo] [Va] Vm*	Va Vm=> r=> is described
		Vm{Af}Nc	{Af}Nc=> e_R=> personal name
3	A person is either a patient or a medical worker.	NcVm*	Nc=> e_L=> person
	Di Nc Vm Cc Di Nc Cc Di Af Nc	* [Vo] [Va] Vm*	Vm=> r=> is, when there are attributes the relationship is ignored.
		Vm[, \| Cc] {Nc}Nc	Nc=> a_L1=> patient
		Vm[, \| Cc] {Nc}Nc	Nc Nc=> a_L2=> medical worker
4	A customer account has an account ID, description and date of creation.	Nc{Nc}Vm*	Nc Nc=> e_L=> customer account
4	Di Nc Nc Vm Di Nc Nc, Nc Cc Nc Sp Nc	Nc{Nc}Vm*	Nc Nc=> e_L=> customer account
		* [Vo] [Va] Vm*	Vm=> r=> has, when there are attributes that the relationship is ignored.
		Vm{Nc}[Sp]Nc*	Nc Nc=> a_L1=> account ID
			Nc=> a_L2=> description
			Nc=> a_LR=> date of creation

Table 4. Sentences that the classifier classified differently than the experts—Learning phase.

No	Sentence	Expert	Class Decision
1	A system presents an amount to pay.	EA	ECR
2	The home page lists the organisational purpose.	EA	ECR
3	The website contains a contact page.	EA	ECR
4	The website contains a member page.	EA	ECR
5	A hospital has patients.	ECR	EA
6	A primary doctor is a doctor.	ECR	D
7	EGAS has products.	ECR	EA
8	For each department, there is a head.	ECR	D
9	Hospitals have employees.	ECR	EA
10	The existing users access the loan information.	ECR	D

Table 5. Differently classified sentences—Testing phase.

No	Sentence	Expert	Class Decision
1	A bookstore has 2 employees.	ECR	EA
2	A library stores books and documents.	ECR	EA
3	Each dean is a member of administrators.	ECR	D
4	Patients have treatments.	ECR	EA
5	The company has 50 plants.	ECR	EA
6	The company has five worker unions.	ECR	EA
7	The company has 500 work stations.	ECR	EA
8	Artwork groups are portraits, still life, works by Picasso and works of the last century.	EA	ECR
9	Visitors with an annual card are customers.	D	ECR

Table 6. Identified translation patterns for the EA class applied to the sentence given in (1) and (2).

Construct	Sentential Form Part	Pattern	Matched (Associated) BD Part
E	Di Nc	?NcVm	program
a1	{Af}{Nc}Nc	Vm?{Af}{Nc}Nc	name
a2	{Af}{Nc}Nc	Vma1,?{Af}{Nc}Nc*	program identifier
a3	{Af}{Nc}Nc	Vma2,?{Af}{Nc}Nc,*	total credit points
a4	{Af}{Nc}Nc	Vma3 Cc ?{Af}{Nc}Nc.	starting year

Table 7. Translation considering belonging to the EID class with detected class characteristics.

Construct	Sentential Form Part	Pattern	Matched (Associated) BD Part
E	[Di\|Dd]Nc	?NcVm	program
ap1	{Nc} identifier	Vm?{Nc} identifier*	program identifier

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Šuman, S.; Čandrlić, S.; Jakupović, A. A Corpus-Based Sentence Classifier for Entity–Relationship Modelling. Electronics 2022, 11, 889. https://doi.org/10.3390/electronics11060889

AMA Style

Šuman S, Čandrlić S, Jakupović A. A Corpus-Based Sentence Classifier for Entity–Relationship Modelling. Electronics. 2022; 11(6):889. https://doi.org/10.3390/electronics11060889

Chicago/Turabian Style

Šuman, Sabrina, Sanja Čandrlić, and Alen Jakupović. 2022. "A Corpus-Based Sentence Classifier for Entity–Relationship Modelling" Electronics 11, no. 6: 889. https://doi.org/10.3390/electronics11060889

APA Style

Šuman, S., Čandrlić, S., & Jakupović, A. (2022). A Corpus-Based Sentence Classifier for Entity–Relationship Modelling. Electronics, 11(6), 889. https://doi.org/10.3390/electronics11060889

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Corpus-Based Sentence Classifier for Entity–Relationship Modelling

Abstract

1. Introduction

2. Background and Related Work

2.1. Classification

2.2. Pattern Recognition

2.3. Related Work

2.4. Previous Work Artefacts

3. Research Motivation

4. Research Methodology

5. Research Results and Discussion

5.1. Testing the Sentence Classifier

5.2. An Example of Applying the Translation Rules

5.3. Our System’s Novelties

6. Research Limitations and Further Research Activities

7. Conclusions

Author Contributions

Funding

Informed Consent Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI