Linguistic Patterns and Linguistic Styles for Requirements Speciﬁcation: Focus on Data Entities

: Requirements speciﬁcation includes technical concerns of an information system and is used throughout its life cycle. It allows for sharing the vision of the system among stakeholders and facilitates its development and operation processes. Natural languages are the most common form of requirements representation, however, they also exhibit characteristics that often introduce quality problems, such as inconsistency, incompleteness, and ambiguousness. This paper adopts the notions of linguistic pattern and linguistic style and discusses their relevance to produce better technical documentation. It focuses on the textual speciﬁcation of data entities, which are elements commonly referred to throughout different types of requirements, like use cases, user stories, or functional requirements. This paper discusses how to textually represent the following elements: data entity, attribute, data type, data entity constraint, attribute constraint, and even cluster of data entities. This paper shows concrete examples and supports the discussion with three linguistic styles, represented by a rigorous requirements speciﬁcation language and two informal controlled natural languages, one with a more compact and another with a more verbose, expressive, and complete representation. We analyzed how other languages cope with the representation of these data entity elements and complemented that analysis and comparison based on the PENS classiﬁcation scheme. We conducted a pilot evaluation session with nineteen professional subjects who participated and provided encouraging feedback, with positive scores in all the analyzed dimensions. From this feedback, we preliminarily conclude that the adoption of these linguistic patterns would help to produce better requirements speciﬁcations written more systematically and consistently.


Introduction
Requirements engineering (RE) is a discipline that intends to provide a shared vision and understanding of socio-technical systems among the involved stakeholders throughout their life-cycle [1,2]. The negative consequences of ignoring these early RE activities are extensively reported and discussed in the literature [3,4].
A system requirements specification (SRS, or just "requirements specification") is an important document that structures the concerns of such systems from the RE perspective. In many cases, these systems are only focused on software applications and respective databases, while in other circumstances, they may involve other elements and concerns. A good SRS offers several benefits such as reported in the literature [2,5,6]: it contributes to the establishment of an agreement and business contract between customers and suppliers; provides a common ground for supporting the project budget and schedule estimation and plan; supports the project scope validation and verification and may support deployment and future maintenance activities. It is usually recommended that an SRS be defined accordingly as a predefined SRS template as well as a set of recommendations on how to customize and use it. An SRS template prescribes a given document structure with supplementary practical guidelines. In general, these templates recommend the use of various views and constructs (e.g., actors, use cases, user stories) that might be considered "modular artifacts" in the sense of their definition and reuse. Because there are dependencies among these constructs, some authors argue that it is essential to minimize or prevent them, and some templates give support in this respect [7].
On the other hand, and because both technical and business stakeholders commonly use SRSs, they tend to be specified mostly in natural languages [8]. Indeed, natural languages reach an adequate communication level as they are flexible and universal, and humans are proficient at using them to communicate with each other. Additionally, they have minimal adoption resistance as a requirements documentation technique. However, they also exhibit some intrinsic features that frequently put them as the source of many quality problems such as inconsistency, incompleteness, and ambiguousness [1,2,9,10].
Due to these problems, some authors have proposed practical recommendations for better writing requirements [1,[11][12][13][14][15], including general guidelines like the following: the language should be simple, clear, and precise; it should follow a standardized format to give coherence and uniformity to all sentences; the sentences should be simple, short, and written in an affirmative and active voice style; the vocabulary should be limited. However, these recommendations are better supported by the guidance of controlled natural languages (further information in Section 2).
In this scope, linguistic patterns are grammatical rules that allow their users to write correctly in a common language (additional details in Section 3). From a linguistic perspective, grammar is a collection of rules, but also a set of blueprints that guide speakers in producing more comprehensible and predictable sentences [16]. Despite the diversity of terms found in the literature-for instance, "syntactic requirements pattern" [1], "requirements template" [17], or "standardized format" [15]-we adopt in this paper the terms "linguistic pattern" and "linguistic style" as discussed originally by Silva to mean, respectively, the definition and the representation of such grammatical rules [18]. In that paper, Silva focused on business-level constructs, namely: glossary terms, stakeholders, business goals, processes, events, and flows [18]. We also use the terms "linguistic style" and "writing style" as synonyms, which provide concrete and consistent representation of a given pattern. Although being inspired by that earlier work, this paper is unique because it is focused on the textual specification of data entities, and this was not addressed in that former work or by others as discussed in the related work (see Section 5).
Data entities denote conceptual or domain objects of the system-of-interest and they are commonly referred to in various types of requirements like use cases, user stories, functional, or even informal requirements. As suggested in Figure 1, data entities and associated elements (e.g., attributes, attribute values, enumerations of concrete values, data constraints) are usually referred to in requirements scattered throughout the text and, unfortunately, represented inconsistently with different writing styledsxxxxxxxxxxfs. For instance, as showed in Figure 1, the names "Products" (in sentence S3) and "product" (in S5) should refer to the same data entity; or "rate type" (in S4) and "rate" (in S4 and S4) are represented inconsistently and ambiguously because they should denote the same attribute.
In requirements specifications, data entities can be defined in a general way (e.g., by only referring to their names) or more exhaustively by enumerating their attributes, types, and even constraints. It might need to also specify the data entity's attributes by only defining their names or also by specifying their types and constraints. An attribute type might be just a predefined value, like Integer, Double, String, Date, Boolean, or still involve data enumeration or customized types. Additionally, an attribute constraint might define that the attribute shall be unique, not null, a foreign key to another data entity, or even a customized check constraint. Appl. Sci. 2021, 11, x FOR PEER REVIEW 3 of 34 In requirements specifications, data entities can be defined in a general way (e.g., by only referring to their names) or more exhaustively by enumerating their attributes, types, and even constraints. It might need to also specify the data entity's attributes by only defining their names or also by specifying their types and constraints. An attribute type might be just a predefined value, like Integer, Double, String, Date, Boolean, or still involve data enumeration or customized types. Additionally, an attribute constraint might define that the attribute shall be unique, not null, a foreign key to another data entity, or even a customized check constraint.
The key contribution of this paper is the proposal and discussion of an integrated set of linguistic patterns, which allow requirements engineers and product designers to write systematically and consistently data related aspects commonly found in requirements specifications, including data entities, attributes, related constraints, and even clusters of data entities, if relevant. To support the discussion, the application of these linguistic patterns is showed with a simple yet illustrative example, represented according to different linguistic or writing styles, namely the RSL, a rigorous requirements specification language, and two informal controlled natural languages, the CNL-A and the CNL-B.
Despite being significant, it is not in the scope of this paper to discuss other linguistic patterns and related concepts such as use cases, user stories, or functional requirements. Still, it is out of the scope to discuss approaches that exist to transform text specifications of data entities into domain models (e.g., represented with visual notations like ER diagrams [19,20] or UML class diagrams [21]), ontologies (e.g., formally represented with OWL, which is a family of knowledge representation languages [22,23]) or even system's database schemas (commonly represented with the SQL language [24]), and vice-versa. These aspects can be considered for future work as the foundation for the design of an intermediate language (IL), that would support automatic transformations from natural language to IL sentences, e.g., using NLP (natural language processing) techniques [10] and, on the other hand, from IL to visual models or other representations (such as ontologies) using model-driven techniques [25,26]. Some of these aspects were surveyed for example by Bozyiğit et al. [27], Bork et al. [28], Schön et al. [29], and Cabot [30].
This paper is structured into seven sections. Section 2 introduces the relevance of controlled natural languages to the writing of requirement specifications and introduces the languages used in this research: CNL-A, CNL-B, and RSL. Section 3 defines the core notions of linguistic pattern and linguistic style. Section 4 discusses the proposed data entities' linguistic patterns and supports the debate with an easy-to-understand running example. Section 5 refers to and discusses the related work. Section 6 presents the evaluation of the proposed linguistic patterns and styles conducted with a pilot user session. The key contribution of this paper is the proposal and discussion of an integrated set of linguistic patterns, which allow requirements engineers and product designers to write systematically and consistently data related aspects commonly found in requirements specifications, including data entities, attributes, related constraints, and even clusters of data entities, if relevant. To support the discussion, the application of these linguistic patterns is showed with a simple yet illustrative example, represented according to different linguistic or writing styles, namely the RSL, a rigorous requirements specification language, and two informal controlled natural languages, the CNL-A and the CNL-B.
Despite being significant, it is not in the scope of this paper to discuss other linguistic patterns and related concepts such as use cases, user stories, or functional requirements. Still, it is out of the scope to discuss approaches that exist to transform text specifications of data entities into domain models (e.g., represented with visual notations like ER diagrams [19,20] or UML class diagrams [21]), ontologies (e.g., formally represented with OWL, which is a family of knowledge representation languages [22,23]) or even system's database schemas (commonly represented with the SQL language [24]), and vice-versa. These aspects can be considered for future work as the foundation for the design of an intermediate language (IL), that would support automatic transformations from natural language to IL sentences, e.g., using NLP (natural language processing) techniques [10] and, on the other hand, from IL to visual models or other representations (such as ontologies) using model-driven techniques [25,26]. Some of these aspects were surveyed for example by Bozyigit et al. [27], Bork et al. [28], Schön et al. [29], and Cabot [30]. This paper is structured into seven sections. Section 2 introduces the relevance of controlled natural languages to the writing of requirement specifications and introduces the languages used in this research: CNL-A, CNL-B, and RSL. Section 3 defines the core notions of linguistic pattern and linguistic style. Section 4 discusses the proposed data entities' linguistic patterns and supports the debate with an easy-to-understand running example. Section 5 refers to and discusses the related work. Section 6 presents the evaluation of the proposed linguistic patterns and styles conducted with a pilot user session. Lastly, Section 7 presents the conclusion and discusses open issues. In Addition, Appendix A describes the running example used in Section 4 and includes a draft representation of that example with the discussed writing styles as well as with a UML class diagram. Appendix B summarizes the linguistic patterns, their recommended vocabulary, and provides a specification of the linguistic patterns represented with the CNL-B notation.

Controlled Natural Languages
The specification of requirements based on natural languages is very expressive, easy to be written and read by humans, but not very precise because natural languages are ambiguous and inconsistent by nature and hard to be automatically manipulated by computers [31]. The usage of formal language methods could overcome these problems [32]. However, this only addresses part of the question due to the difficulty in applying them into not-so-critical systems because they require specialized training and are time-consuming. In the attempt to get the best from both worlds, i.e., the familiarity of natural language and the rigorousness of formal language, some approaches have proposed controlled natural languages, which are engineered to resemble natural languages [33].
A controlled natural language (CNL) defines a constrained use of a natural language's grammar (syntax) and a set of terms (comprising the semantics of these terms) to be used in the constrained grammar [1,2]. The adoption of CNLs delivers the following benefits: CNL sentences are easy to understand since they are like sentences in natural language. However, they are less ambiguous than expressions in natural language since they have a simplified grammar and a predefined vocabulary with precise semantics; and are semantically correct and can be computational manipulated since they may have a formal grammar and predefined terms.
For the sake of the explanation of this paper, we consider three different linguistic styles represented by two informal CNLs (named CNL-A and CNL-B) and a rigorously defined CNL (the RSL language).

CNL-A and CNL-B Languages
CNL-A and CNL-B are informally defined as follows. CNL-A is intended to express statements in a more compact writing style, namely according to the following meta-pattern: "<id> is a <type> <element> [, extends <isA>]?.", in which "<id>", "<type>" "<element>" and "<isA>" are template fragments that are replaced by concrete text fragments. Some sentence examples with this language can be: KeyUser is a Person Stakeholder. Customer is a User Actor. CustomerVIP is a User Actor, extends Customer.
On the other hand, CNL-B intends to define statements in a more verbose, expressive, and complete writing style, based on the following meta-pattern: "<element> <id> [(<name>)]? is a <type> [ and a <isA>]? [, described as <description>]?.". Some equivalent sentences written with this language can be the following: Stakeholder KeyUser (Key User) is a Person, described as a user representative. Actor Customer is a User, described as a user that buys products from the eShop. Actor CustomerVIP (Customer VIP) is a User and a Customer, described as a user that buys products from the eShop with a special discount program.

RSL Language
RSL is a controlled natural language designed for the rigorous specification of requirements and tests [18,34]. RSL includes several constructs logically classified according to two dimensions [34]: abstraction level and specific concerns. The abstraction levels are Business, Application, Software, and Hardware levels. The concerns are active structure (subjects), behavior (actions), passive structure (objects), requirements, tests, relations and sets, and other concerns. From a syntactical perspective, any construct can be used in any type of system regardless of its abstraction level. That means it is possible to use a DataEntity construct at application or software levels but also business or hardware levels. However, the use of a DataEntity at the business level will be more general in comparison with its use at application or software levels, which shall be more detailed. Figure 2 shows the RSL partial metamodel that involves the constructs discussed in this paper. For example, it shows that: (1) a DataEntity aggregates a set of Attributes, which are classified as a primitive type or a data enumeration; (2) data constraints can be assigned to DataEntities or Attributes; (3) DataEntityCluster defines a set of inter-related data entities, each one with a predefined role such as main, child, or uses roles (further details on these concepts in Section 4).
with its use at application or software levels, which shall be more detailed. Figure 2 shows the RSL partial metamodel that involves the constructs discussed in this paper. For example, it shows that: (1) a DataEntity aggregates a set of Attributes, which are classified as a primitive type or a data enumeration; (2) data constraints can be assigned to DataEntities or Attributes; (3) DataEntityCluster defines a set of inter-related data entities, each one with a predefined role such as main, child, or uses roles (further details on these concepts in Section 4). The examples illustrated above can be represented with RSL as: Stakeholder KeyUser "Key User": Person [description "a user representative"] Actor Customer: User [description "a user that buys products from the eShop"] Actor CustomerVIP "Customer VIP": User [isA Customer description "a user that buys products from the eShop with a special discount program"]

Linguistic Patterns and Linguistic Styles
As discussed originally by Silva [18], a linguistic pattern is a set of rules that defines the elements and the vocabulary that will be used in the sentences of these requirements technical documents. An element rule consists in a set of element attributes (e.g., <id>, <name> or <type>) defined by the following properties: name, type, and multiplicity. On the other hand, a vocabulary rule defines a set of literal terms (e.g., ''User", 'ExternalSystem") used to categorize some element attributes and to restrict the use of a limited number of terms. For example, the linguistic pattern Actor is defined by the following set of rules (i.e., the Actor element rule and the ActorType vocabulary rule): The Actor element rule defines its element attributes (e.g., <id>, <name>, <type>, <isA>, <stakeholder>, <description>) and for each attribute the respective name (e.g., name, type, isA), type (e.g., ID, String, ActorType, Stakeholder) and multiplicity (e.g., '?') properties. The multiplicity of an attribute is not represented by default (and in this case means "1", a mandatory attribute), or can be represented by the following characters '?', '+', and '*' meaning, respectively, 0..1 (optional), 1..* (one or many), and 0..* (zero or many). The examples illustrated above can be represented with RSL as: Stakeholder KeyUser "Key User": Person [description "a user representative"] Actor Customer: User [description "a user that buys products from the eShop"] Actor CustomerVIP "Customer VIP": User [isA Customer description "a user that buys products from the eShop with a special discount program"]

Linguistic Patterns and Linguistic Styles
As discussed originally by Silva [18], a linguistic pattern is a set of rules that defines the elements and the vocabulary that will be used in the sentences of these requirements technical documents. An element rule consists in a set of element attributes (e.g., <id>, <name> or <type>) defined by the following properties: name, type, and multiplicity. On the other hand, a vocabulary rule defines a set of literal terms (e.g., "User", 'ExternalSystem") used to categorize some element attributes and to restrict the use of a limited number of terms. For example, the linguistic pattern Actor is defined by the following set of rules (i.e., the Actor element rule and the ActorType vocabulary rule): The Actor element rule defines its element attributes (e.g., <id>, <name>, <type>, <isA>, <stakeholder>, <description>) and for each attribute the respective name (e.g., name, type, isA), type (e.g., ID, String, ActorType, Stakeholder) and multiplicity (e.g., '?') properties. The multiplicity of an attribute is not represented by default (and in this case means "1", a mandatory attribute), or can be represented by the following characters '?', '+', and '*' meaning, respectively, 0..1 (optional), 1..* (one or many), and 0..* (zero or many). The type of an attribute can be a type (e.g., ID, String, Boolean); an element type (e.g., the Actor of the isA attribute); or a vocabulary type (e.g., the ActorType of the type of attribute).
The ActorType vocabulary rule is prefixed with the "enum" tag and defines the values of its parts, namely the literals 'User', and 'ExternalSystem'.
As shown in this simple example, a linguistic pattern defines two key aspects: a set of element attributes with respective name, type, and multiplicity; and a vocabulary, with a limited number of terms.
Furthermore, as suggested in Figure 3, a linguistic pattern can be represented in multiple manners depending on the needs and interests of its users. In this context, a linguistic style is a concrete representation of a linguistic pattern, which means that a linguistic Appl. Sci. 2021, 11, 4119 6 of 34 style is a specific template to which attributes of the linguistic pattern can be substituted. Thus, a linguistic pattern can be represented by multiple linguistic styles, such as RSL, CNL-A, CNL-B, and others. ues of its parts, namely the literals 'User', and 'ExternalSystem'.
As shown in this simple example, a linguistic pattern defines two key aspects: a set of element attributes with respective name, type, and multiplicity; and a vocabulary, with a limited number of terms.
Furthermore, as suggested in Figure 3, a linguistic pattern can be represented in multiple manners depending on the needs and interests of its users. In this context, a linguistic style is a concrete representation of a linguistic pattern, which means that a linguistic style is a specific template to which attributes of the linguistic pattern can be substituted. Thus, a linguistic pattern can be represented by multiple linguistic styles, such as RSL, CNL-A, CNL-B, and others. A linguistic style is an ordered set of two types of text fragments: literal text fragments (e.g., 'actor', '[', ']', ', described as'); and other template text fragments that are represented by the pattern "<element_name.attribute_name>", i.e., the element name followed by its attribute name, delimited between '<' and '>' (e.g., <actor.type>). For brevity, it is possible to omit the reference to the "element_name.". The multiplicity constraints are represented by the same characters referred to above (i.e., '?', '+' and '*').
Also, the syntactical rules that define RSL constructs are compliant with the following general linguistic style, as a set of formal rules defined by an Extended BNF grammar. This grammar is supported by the Xtext framework [35], which provides a ready-to-use integrated authoring tool, built on top of the Eclipse IDE. We can even use different formats to define the linguistic styles. For example, using a verbose representation (i.e., using complete names of the attributes), we might have: Or using a more compact representation (i.e., using unqualified names of the attributes) we may have the equivalent representation. For the sake of simplicity, this would be the adopted format to represent the linguistic styles CNL-A and CNL-B throughout this paper: While for the definition of the RSL, we use the following type of representation: A linguistic style is an ordered set of two types of text fragments: literal text fragments (e.g., 'actor', '[', ']', ', described as'); and other template text fragments that are represented by the pattern "<element_name.attribute_name>", i.e., the element name followed by its attribute name, delimited between '<' and '>' (e.g., <actor.type>). For brevity, it is possible to omit the reference to the "element_name.". The multiplicity constraints are represented by the same characters referred to above (i.e., '?', '+' and '*').
Also, the syntactical rules that define RSL constructs are compliant with the following general linguistic style, as a set of formal rules defined by an Extended BNF grammar. This grammar is supported by the Xtext framework [35], which provides a ready-to-use integrated authoring tool, built on top of the Eclipse IDE. We can even use different formats to define the linguistic styles. For example, using a verbose representation (i.e., using complete names of the attributes), we might have: Or using a more compact representation (i.e., using unqualified names of the attributes) we may have the equivalent representation. For the sake of simplicity, this would be the adopted format to represent the linguistic styles CNL-A and CNL-B throughout this paper: While for the definition of the RSL, we use the following type of representation:

Data Entities' Linguistic Patterns and Linguistic Styles
Data entities denote structural entities that exist in information systems, commonly associated with data concepts captured and identified from requirements elicitation and domain analysis processes. (These entities can later be represented by data structures, like tables or collections of some databases, but that is out of the scope of this paper.) As illustrated in the mind map of Figure 4, this suggests the existence and relationship of the following concepts: data entity, data attribute, data entity constraint, data attribute constraint, cluster of data entities, data enumeration, and data primitive type. Often these concepts appear in many specifications but scattered throughout the text and inconsistently with different names or different writing styles, as briefly introduced above in Section 1 and depicted in Figure 1. In this section, we argue that for all these concepts, there are patterns that might have diverse but consistent representations.
Data entities denote structural entities that exist in information systems, commonly associated with data concepts captured and identified from requirements elicitation and domain analysis processes. (These entities can later be represented by data structures, like tables or collections of some databases, but that is out of the scope of this paper.) As illustrated in the mind map of Figure 4, this suggests the existence and relationship of the following concepts: data entity, data attribute, data entity constraint, data attribute constraint, cluster of data entities, data enumeration, and data primitive type. Often these concepts appear in many specifications but scattered throughout the text and inconsistently with different names or different writing styles, as briefly introduced above in Section 1 and depicted in Figure 1. In this section, we argue that for all these concepts, there are patterns that might have diverse but consistent representations. This paper uses a running example to support the explanation and discussion. This example refers to the requirements of a fictitious information system called "BillingSystem", which is an invoice management system inspired and adapted from the example initially introduced by Silva [18]. Figure 5 depicts the main data entities and relationships of this example with a simplified UML class diagram: the BillingSystem must manage data entities like invoices with invoice lines, customers, users, products, etc. For the sake of simplicity, the examples showed in this section are only in CNL-A and CNL-B notations. However, Appendix A shows a complete description of these specifications in CNL-A, CNL-B and RSL, as well an equivalent UML model.  This paper uses a running example to support the explanation and discussion. This example refers to the requirements of a fictitious information system called "BillingSystem", which is an invoice management system inspired and adapted from the example initially introduced by Silva [18]. Figure 5 depicts the main data entities and relationships of this example with a simplified UML class diagram: the BillingSystem must manage data entities like invoices with invoice lines, customers, users, products, etc. For the sake of simplicity, the examples showed in this section are only in CNL-A and CNL-B notations. However, Appendix A shows a complete description of these specifications in CNL-A, CNL-B and RSL, as well an equivalent UML model.

Data Entities' Linguistic Patterns and Linguistic Styles
Data entities denote structural entities that exist in information systems, commonly associated with data concepts captured and identified from requirements elicitation and domain analysis processes. (These entities can later be represented by data structures, like tables or collections of some databases, but that is out of the scope of this paper.) As illustrated in the mind map of Figure 4, this suggests the existence and relationship of the following concepts: data entity, data attribute, data entity constraint, data attribute constraint, cluster of data entities, data enumeration, and data primitive type. Often these concepts appear in many specifications but scattered throughout the text and inconsistently with different names or different writing styles, as briefly introduced above in Section 1 and depicted in Figure 1. In this section, we argue that for all these concepts, there are patterns that might have diverse but consistent representations. This paper uses a running example to support the explanation and discussion. This example refers to the requirements of a fictitious information system called "BillingSystem", which is an invoice management system inspired and adapted from the example initially introduced by Silva [18]. Figure 5 depicts the main data entities and relationships of this example with a simplified UML class diagram: the BillingSystem must manage data entities like invoices with invoice lines, customers, users, products, etc. For the sake of simplicity, the examples showed in this section are only in CNL-A and CNL-B notations. However, Appendix A shows a complete description of these specifications in CNL-A, CNL-B and RSL, as well an equivalent UML model.

Data Entity
The construct DataEntity denotes an individual structural entity that might include the specification of attributes and data constraints. A type and an optional subtype categorize a DataEntity. DataEntity type values can be the following, as proposed for instance by Microsoft [36]: "Parameter", "Reference", "Master", "Document" and "Transaction"; based on their purposes and the type of data that they serve (see Table 1 for further information). Moreover, DataEntity subtype values can be defined as "Regular" and "Weak" (as suggested originally by Chen [19]), meaning, respectively, that the data entity has its relevance and identification (e.g., the Invoice entity) or that its existence depends on another entity (e.g., the InvoiceLine entity that depends on the Invoice entity). A DataEntity can also include a set of attributes (which denotes a particular structural property) as well as a set of data constraints.  [36]).

Type Description
Parameter Data entity used to define functional or behavioral parameters, for instance, required to set up a deployment or a module for a specific customer; usually, data that contain only one item, where the attributes are values for settings. Common examples of this type are company configuration or application-specific parameters.

Reference
Data entity is used to define simple reference data of small quantities, which is required to operate a business process. Common examples of this type are units, dimensions, postal codes, tax codes.

Master
Data entity is used to define data assets of the business; in general, these are the "nouns" of the business, which typically fall into categories such as people, places, and concepts.

Linguistic Pattern
The following rules define the linguistic pattern DataEntity (lp1) and Table 1 summarizes the possible values used to classify a DataEntity. Considering the BillingSystem example (see Appendix A for a complete description) and following a text analysis, we identify and annotate the data entities with bold text, and hence capture concepts like Invoice, Customer, Customer VIP, Product, VAT.
A VAT consists of the following information: VAT code, rate, name, and value. The VAT rates consider the following values: standard, reduced, and special.
[ . . . ] The system will maintain the following information for products: name, price, VAT rate, VAT value, and size category; a size category consists of one of the following sizes: Small, Regular, Large, ExtraLarge. [ . . . ] The system will maintain the following information for customers: name, fiscal id, image, bank information, and additional information such as address and personal contacts. A customer can also be defined as a customer VIP, and in this situation, there is a discount tax that can change over time. The operator shall create invoices (with respective details defined as invoice lines). An invoice will have the following information [ . . . ].
Furthermore, we classify each of these data entities according to some taxonomy (e.g., such as that proposed and referred to in Table 1). For example, with the knowledge obtaining from the domain analysis, we can classify the Invoice as a Document entity, and the Customer as a Master entity, etc. From this analysis and considering the linguistic styles defined above, the data entities captured from the text can be represented as follows: With the CNL-A Style (ls1-cnl-a): A DataAttribute denotes a particular structural property of a data entity. A data attribute has the following properties: identifier (id), name, type, an optional default value (i.e., the value assigned by default), and some predefined constraints (DataAttributeConstraint), such as NotNull, Unique, Derived, ReadOnly, Encrypted properties, multiplicity, or even other customized constraints.
A DataEnumeration denotes a customized data type with a predefined set of string values. A DataType (i.e., an attribute type) might be a primitive type (DataPrimitiveType, e.g., Integer, Double, String, Date, DateTime) or a customized data enumeration (DataEnumeration).

Examples
Considering the BillingSystem example, we can capture text fragments that might suggest attributes and other aspects associated with them. We identify and annotate these fragments as follows: data entity attributes (text marked with light gray) , e.g., product name, product values, VAT code and value, and data enumeration values ( text marked with light green ), for instance, Small, Regular and Large in the text fragment below: A VAT consists of the following information: VAT code, rate, name, and value .

Linguistic Pattern
A data constraint denotes a particular constraint on the values of an attribute or the data entity as a complete element. Some of these constraints can be assigned directly to a given attribute by setting a predefined property (e.g., NotNull or Unique); other constraints correspond to referential integrity among data entities; and still others can express customized constraints. The following rules define the linguistic patterns DataEntityConstraint and DataAttributeConstraint (lp3). DataEntityConstraint involves the setting of the following data entity's constraints: read-only (isReadOnly) and encrypted (isEncrypted); also customized constraints (checks) can be set.
ForeignKey is commonly used to specify a one-to-many relationship to another data entity (e.g., a Product shall have one VAT code, and a VAT is used to categorize several Products, so we shall define in the Product data entity the VATcode attribute, which will have a ForeignKey constraint to the VAT entity). The ForeignKey specification requires two properties: the data entity to which the attribute is related with (targetEntity), and an onDelete option (onDelete). The onDelete property defines what shall happen when an object referenced by a ForeignKey is deleted. Common strategies to deal with this situation are represented by the values of ForeignKeyOnDeleteType, namely: (1) CASCADE: Cascade deletes, which emulates the behavior of the SQL constraint ON DELETE CASCADE and deletes the object containing the ForeignKey; (2) PROTECT: Prevent deletion of the referenced object by raising an exception; (3) SET_NULL: Set the ForeignKey null, and this is only possible if the attribute is also defined as null.

Linguistic Styles
The following statements define different concrete representations for the DataConstraint patterns according to the CNL-A, CNL-B, and RSL languages.  With the Style of the CNL-B: DataEntity e_VAT (VAT Category) is a Reference attribute VATCode "VAT Code" is a Integer ( PrimaryKey ), attribute Rate "Rate" is a DataEnumeration VATRateKind, attribute VATName "VAT Class Name" is a String(30) ( NotNull ), attribute VATValue "VAT Class Value" is a Decimal(2.1)( NotNull ).

Cluster of Data Entities
A DataEntityCluster denotes a group of data entities that have logical arrangements among themselves. A DataEntityCluster can be a construct particularly relevant when associated with use cases, as suggested in Figure 6. Each data entity that belongs to a cluster has a specific role, namely: "main", "child", and "uses". The "main" role identifies the principal DataEntity of that cluster (e.g., the "Invoice" main entity as suggested in Figure 6); the "child" role identifies a part-of or child DataEntity (e.g., the "InvoiceLine" weak entity as suggested in Figure 6); and the "uses" role identifies a dependent or used DataEntity (e.g., the Customer entity as suggested in Figure 6).

Cluster of Data Entities
A DataEntityCluster denotes a group of data entities that have logical arrangements among themselves. A DataEntityCluster can be a construct particularly relevant when associated with use cases, as suggested in Figure 6. Each data entity that belongs to a cluster has a specific role, namely: "main", "child", and "uses". The "main" role identifies the principal DataEntity of that cluster (e.g., the "Invoice" main entity as suggested in Figure  6); the "child" role identifies a part-of or child DataEntity (e.g., the "InvoiceLine" weak entity as suggested in Figure 6); and the "uses" role identifies a dependent or used DataEntity (e.g., the Customer entity as suggested in Figure 6). A DataEntityCluster will aggregate just one data entity with the "main" role, and optionally other data entities with "child" and "uses" roles. A DataEntityCluster might be classified according to the type DataEntityClusterType, which values can be the same as those discussed above for data entities (see Table 1), and usually will correspond to the same type of its "main" data entity. When mapping conceptual or logical data models into physical data models the DataEntityClusters are usually converted into SQL "views" or "virtual tables" [37,38].

Linguistic Pattern
The following rules define the linguistic pattern DataEntityCluster (lp4), and Table 1 summarizes the types used to categorize it. A DataEntityCluster will aggregate just one data entity with the "main" role, and optionally other data entities with "child" and "uses" roles. A DataEntityCluster might be classified according to the type DataEntityClusterType, which values can be the same as those discussed above for data entities (see Table 1), and usually will correspond to the same type of its "main" data entity. When mapping conceptual or logical data models into physical data models the DataEntityClusters are usually converted into SQL "views" or "virtual tables" [37,38].

Linguistic Pattern
The following rules define the linguistic pattern DataEntityCluster (lp4), and Table 1 summarizes the types used to categorize it.

Examples
From the analysis of the BillingSystem example and considering the linguistic styles defined above, the clusters of data entities can be represented as follows: With the Style ls4-cnl-a: ec_Product Master DataEntityCluster with main e_Product, uses e_VAT. ec_Invoice Document DataEntityCluster with main e_Invoice, child e_InvoiceLine, uses e_Customer.

With the Style ls4-cnl-b:
DataEntityCluster ec_Product is a Master with e_Product as the main entity, e_VAT as the uses entity. DataEntityCluster ec_Invoice is a Document with e_Invoice as the main entity, e_InvoiceLine as child entity, e_Customer as the uses entity.

Related Work
The process of producing requirements specifications consists of creating descriptions of the application domain as well as a prescription of what the system should do, and other organizational, legal, or technological constraints [39]. In general, these requirements are specified in natural languages because of their higher expressiveness and ease of use. However, natural languages present drawbacks like ambiguity, inconsistency, and incompleteness [1,2,9,40], and so, their specifications are usually complemented by some sort of other requirements and models that use CNLs, formal or modeling languages. These languages provide constructs (e.g., data entity, actor, use case, or requirement) that define its abstract syntax and semantics, and address diverse concerns and abstraction levels. Also, these languages provide different notations or concrete syntaxes, such as textual, graphical, tabular, or form-based representations [28,41,42].
On the other hand, data modeling involves the definition of different data models or data schemas defined at different levels of abstraction [37,38]. In this respect, Ribeiro et al. [42] present an extensive discussion and give examples of data modeling and data analytics processes. They discuss data modeling at different abstraction levels (i.e., conceptual, logical, and physical data models) considering not only operational but also decision support and big data technologies [42].
Moreover, ontology engineering has proposed methods for building ontologies, which are formal representations of a set of concepts within a domain and the relationships between those concepts. For instance, as Tudorache discussed [43], ontology engineering has been strengthened by the adoption of standards about ontologies, by the development or extension of ontology software tools, and the by wider recognition of the importance of standardized vocabularies and formalized semantics. However, despite these advancements, ontology engineering is still a difficult process, and many challenges remain to be solved.
For instance, semantic web languages, especially OWL, have a high learning curve [44], even for technical people with software engineering or database backgrounds [43].
Although this paper is focused on CNLs and their intrinsic linguistic patterns, we discuss below other approaches with concrete languages and discuss their key characteristics in what concerns the specification of data entities and data modeling. Tables 2 and 3 summarize the comparison of the languages analyzed below. That includes the classification of each language grouped by the following language categories: natural language, formal language, controlled natural language, and modeling language. Table 2 compares these languages based on the following aspects: (1) the language application scope (e.g., general communication, software engineering, system engineering, or requirements engineering); (2) the language definition technique (e.g., grammar, metamodeling, or implicitly); (3) the language concrete syntax (e.g., textual, graphical, tabular, or form-based); and (4) the language classification based on the PENS schema.
PENS is a CNL classification framework proposed by Kuhn [33], based on the following dimensions: Precision, Expressiveness, Naturalness, and Simplicity; with each dimension classified in a 1 to 5 scale of classes. These PENS dimensions have the following meaning: Precision captures the degree to which the meaning of a text in a certain language can be directly retrieved from its textual form, i.e., the sequence of language symbols. Expressiveness describes the range of propositions that a given language can express. Naturalness describes how close the language is to a natural language in terms of readability and understandability to speakers of the given language. Simplicity is a measure of how simple or difficult is to comprehensively describe a given language, which will cover the description of its syntax and semantics; it is not a measure for the effort needed by a human to learn the language, but rather it is closely related to the effort needed to fully implement the syntax and semantics of a language by a computer program (because of that, the name "simplicity" can be misunderstood, and maybe "computability" could be a better name for this dimension; nevertheless, we keep the original name as proposed by Kuhn with this caveat).
For example, for Precision and Simplicity, English is on the bottom end of the scale in class 1, represented as P 1 and S 1 . On the opposite end of the scale, a formal language like B is in classes 5 and 4, represented with P 5 and S 4 . For expressiveness and naturalness, the roles are switched: English is at the top end (E 5 and N 5 ) and B formal language at the bottom (E 1 and N 1 ). As pointed out by Kuhn, in general, more is better for each of the PENS dimensions, however, this does not necessarily hold in practice. Indeed, a certain level in any of the dimensions is often good enough for a given application domain and going beyond that level will not bring additional benefit [33]. Apart from few languages (i.e., English, B, and ACE), which were previously discussed and evaluated by Kuhn, the PENS-based evaluation of the other languages was performed by us in a reflective approach and, therefore, subject to some subjectivity and debate.
Complementary, Table 3 summarizes the key constructs included in the respective language's abstract syntax that are mostly related to the scope of this paper, namely by analyzing if they support the specification of data entities, attributes, clusters of data entities, and enumerations, including the specification of additional aspects as discussed above in Section 4, such as data entity types, data entity constraints, and attribute constraints. We denote "NA" for some value or situation that is "not applicable" or "do not satisfy" some criteria, "Y" and "N", respectively, if the criteria are (Yes) or are not (No) satisfied by that language.

Natural Languages
The process of producing requirements documentation has consisted of creating natural language descriptions of what the system of interest should do and related constraints. A natural language refers to any language commonly used by humans in their daily communication acts which evolves in an unplanned manner through time. These changes are driven according to their users' needs and creativity. However, despite being in practice the language most frequently used to document requirements, because they are very expressive and easy to be written and read by humans, it is well-known that these specifications are ambiguous and inconsistent by nature, and hard to be interpreted and checked automatically by computers [1,2]. In what concerns the documentation of data entities, natural language-based specifications tend to be inconsistent and incomplete: only define the names of data entities and their data attributes, but, for instance, miss the explicit definition of the involved types, constraints, or even these names use to be written inconsistently as discussed originally in Section 1 and illustrated in Figure 1. For these reasons, a natural language, like English, is evaluated as P 1 E 5 N 5 S 1 (see Table 2).

Formal Method Languages
Formal methods were expected to solve some of the drawbacks referred to as natural language approaches, namely dealing with ambiguity and enabling the specification of software systems or just their components. Many formal methods (e.g., Z, VDM, B, Alloy) follow state-based formalisms [45], in which their languages provide constructs for defining modular abstractions that have an internal state. These methods were also expected to exhibit an increased expressivity beyond algorithmic descriptions. For instance, the B Method [32,46] is a popular representative of this class of approaches. It is a formal method that employs mathematical formalisms (such as set theory and logic) for iteratively developing software through stepwise refinements, from its initial abstract specification to its source code implementation. The B language is based on abstract machine notation, which is a specification language embedded with several high-level programming language notions. According to B specifications, the system of interest is represented as a set of abstract machines. The notion of the abstract machine allows one to encapsulate both the state and operations (on the state) of a system component and assign it a meaningful name. This notation allows one to define an invariant based on the machine's internal variables, which must not be violated regardless of the abstract machine's state. Besides, B supports the definition of constraints, which express requirements related to the application domain; and assertions, which consist of a set of theorems that supports the process of discharging proof obligations and are essential to ensure that the resulting source code can be proven to be consistent with the original specification. These invariants are equivalent to the (data entity and attributes) constraints discussed in this paper. However, the discussed constraints are not dependent on the behavior of the involved data entity and are more focused on the static or structural aspects. Despite the benefits arising from the usage of a sound theoretical foundation, formal methods present some limitations, namely their difficulty in properly using such formal languages, even for users with a technical background but lacking the necessary mathematical foundation. In general, the complexity of these languages often impairs these approaches from being cost-effective for documenting and developing software systems other than mission-critical ones [47,48]. Due to these reasons, B is evaluated as P 5 E 1 N 1 S 4 (see Table 2).

Controlled Natural Languages
A controlled natural language (CNL) is a constructed language that is based on some natural language, being more restrictive concerning lexicon, syntax, and semantics while preserving most of its natural properties [33]. CNL approaches have emerged in different application domains and various disciplines such as computer science, philosophy, linguistics, and engineering. CNLs have been designed for better supporting technical writing or knowledge engineering. CNLs have been applied to improve communication among humans, to improve translation, or to provide natural and intuitive representations for formal notations. CNLs can be classified into two general categories: human-oriented and machine-oriented CNLs. Human-oriented CNLs intend to improve the readability and comprehensibility of technical documentation and to simplify and standardize humanhuman communication for specific purposes. On the other hand, machine oriented CNLs intend to improve the translatability of technical documents and the acquisition, representation, and processing of knowledge. Since the structure of CNLs is usually simpler than the structure of natural language, CNLs are easier for a computer to process and more natural for humans to understand. An ideal CNL should also be effortless to write and expressive enough to describe the problem at hand.
Schwitter surveys machine oriented CNLs that can be used for knowledge representation and can serve as high-level interface languages to knowledge systems, namely [49]: Attempto Controlled English (ACE), Processable English (PENG), and Computer Processable Language (CPL). Also, based on the PENS classification scheme, Kuhn surveys an extensive list of 100 English-based CNLs, including, for example, Basic English, Caterpillar Fundamental English, SBVR Structured English, ACE, and Gherkin.
Among these CNLs, we highlight ACE as a relevant representative of this category of languages. ACE is a precisely defined subset of English that can automatically and unambiguously be translated into first-order logic [31]. ACE may appear to be completely natural but is a formal language; concretely, it is a first-order logic language with English syntax. ACE was first proposed as a language for software specifications, but later, its focus shifted towards knowledge representation and the semantic web. The essential features of ACE include complex noun phrases, plurals, anaphoric references, subordinated clauses, modality, and questions [31]. So, and contrary to this paper, which is focused on the definition of specific data entity elements that exist in requirements specifications, ACE is a more general-purpose language dedicated to knowledge representation that, of course, can represent all the discussed concepts, likewise a natural language. As showed in Table 2, ACE was originally classified by Kuhn as P 4 E 3 N 4 S 3 [33].
RNL-ER [50] is a CNL for specifying entity-relationship models that not only solve the verbalization problem for these models but also provides the benefits of automatic verification and validation, and semantic round-tripping which makes the communication process transparent between the domain experts and the knowledge engineers. However, the RNL-ER language is particularly aligned with ER models and hence simpler in comparison with the linguistic patterns and styles discussed in this paper, e.g., it does not allow a complete classification of data entities or data attributes or does not provide concepts like data enumeration, more complete constraints, or data entity clusters. For these reasons, RNL-ER is classified as P 3 E 2 N 4 S 3 .
SilabREQ [51] is another CNL for use case specification that allows describing user and system actions in a precise way. SilabREQ enables automated analysis and processing of software requirements and to achieve the generation of different parts of information systems by the adoption of model-driven techniques. SilabREQ is formally defined by a BNF grammar that includes constructs like actors, use cases, data concepts, and state machines. Contrary to other languages, SilabREQ proposes constructs to textually specify the details of use cases, including scenarios and use case actions. As shown in Table 3, SilabREQ also allows the specification of data entities with its construct "Concept" but only in a simple way. For instance, it does not support concepts like a cluster of data entities, data enumeration, data constraints. Due to that SilabREQ is classified as P 4 E 2 N 2 S 4 .

Modeling Languages
Requirements specifications have been often documented according to a blended approach, i.e., by using natural language (or CNL) textual sentences complemented with graphical models. These models are represented with general-purpose modeling languages, such as UML [21] or SysML [52], or with more specific modeling languages, such as RML [53] or XIS* [54][55][56]. Although these models can be more precise than plain textual specifications, they still leave room for ambiguity and inconsistency; thus, they are usually considered as being semi-formal.
UML (Unified Modeling Language) [21] is the general-purpose modeling language promoted by OMG and considered by many as the de facto standard for modeling and documenting object-oriented software systems. UML is defined explicitly by the MOF (OMG Meta Object Facility) and provides many types of diagrams, such as class, object, sequence, use cases, state machine, component diagrams, etc., with several of them applied at different abstraction levels. Considering its application to the specifications of require-Appl. Sci. 2021, 11, 4119 20 of 34 ments, UML provides relevant constructs such as actors, use cases, classes/objects, and state machines, which are commonly used as complementary representations.
SysML (Systems Modeling Language) [52] is another general-purpose modeling language mainly targeted for systems engineering. It supports the specification of an extensive range of systems and systems-of-systems, involved not only software but also hardware and other physical elements. SysML is defined as an extension of UML, using the UML profile mechanism, and mainly introducing new diagram types such as block definition, internal block, and requirements diagrams. On the other hand, SysML requirements diagrams allow to illustrate requirements graphically (i.e., represented as a particular box with a name, and a stereotype) and their relationships, which are divided into two categories: relationships between requirements (e.g., containment, derive, and copy) and relationships between requirements and other model elements (e.g., satisfy, verify, refine, and trace). Although SysML includes this new requirement diagram, it suffers from the same drawbacks referred above for UML, namely in what concerns the multiple interpretations of their expressive but semi-formal models.
In what concerns the data entity constructs, UML and SysML allow a flexible definition of entities, attributes, data enumerations, and even some simple and predefined constraints. However, foreign keys and more complex constraints are not easily defined graphically, sometimes are expressed as complementary annotations; still, the categorization of data entities is not very common, though we still may use stereotypes for that purpose. On the other hand, the models produced with these languages are not natural and easy to understand by non-technical users, namely in comparison with CNL-based approaches. For these reasons, both UML and SysML are classified as P 3 E 3 N 1 S 3 (see Table 2).
RML (Requirements Modeling Language) [53] is one of the largest RE-specific language that gathers many constructs and types of models, some of them found in popular modeling languages such as UML or SysML, e.g., business data diagram or state diagram, or in earlier languages like Structured Analysis, e.g., data dictionary, system flow, state diagram. On the contrary to these languages, RML specifications are not just a graphical language but provides also other representations such as tabular (e.g., data dictionary, roles and permissions matrix, state table), and form-based (e.g., use cases). RML models are classified into the following categories: objectives, people, systems, and data. The most related to this paper are the business data diagram (BDD) and the data dictionary (DD). BDDs look very similar to ER diagrams and show business data objects and respective relationships from a business stakeholder's perspective. Complementary, DDs define the fields that make up the business data objects by presenting in a tabular format the list of the respective attributes (fields), with the detail of their properties, such as id, name, data type, valid values, and data attribute constraints. However, RML is defined implicitly, and consequently, its models/specifications are simple and intuitive but less rigorous and precise in comparison with UML, SysML, or RSL (which are defined by metamodels or grammars). Also, RML does not consider the definition of data enumerations, data entity level constraints, or data entity clusters, as discussed in this paper. For these reasons, the data-specific models of RML (i.e., BDD and DD) are classified as P 2 E 3 N 2 S 2 (see Table 2). XIS* represents a set of modeling languages designed to model web and mobile applications in a platform-independent way, sharing a common structure with concrete languages like XIS-Mobile [54,55], XIS-Web [56], and XIS [57]. These languages have been implemented as UML Profiles and so their models can be considered UML extensions based on specific stereotypes. XIS* models are defined around four views [55,56]: entities view, architectural view, use cases view, and user-interfaces view. Regarding the focus of this paper, the relevant view for our analysis is the "entities view", in which entities and business entities and respective relationships are graphically represented. As shown in Table 3, many of the concepts discussed in this paper are provided by XIS*; however, XIS* languages do not allow the categorization of data entities or clusters of data entities, nor the specification of constraints as discussed in this paper. These XIS* models are not so natural for non-technical users, more precise than the equivalent UML models but also less expressive. For these reasons, XIS* are classified as P 4 E 2 N 1 S 3 (see Table 2).

CNLs Used in This Paper
For the sake of the explanation of this paper, we introduced two informal CNLs, CNL-A and CNL-B, and the RSL language. Both CNL-A and CNL-B were informally defined in Section 2.1 and extensively used to discuss different writing styles close to the natural language. CNL-A allows to express of sentences in a compact writing style, and despite its simplicity and compactness, it allows to represent of most of the aspects of the discussed linguistic patterns, except the details of constraints, the classification of data attributes, and some other details (e.g., it does not represent the elements' description and alternative name). On the other hand, CNL-B allows defining sentences in a more expressive and complete writing style according to the various aspects of the discussed linguistic patterns. We do not claim that they should be "the" informal CNLs to represent data entity's aspects most naturally, but they are good starting points for consideration and customization by different teams and individuals. According to the PENS classification scheme, as shown in Table 2, CNL-A and CNL-B are classified respectively as P 3 E 3 N 4 S 3 and P 3 E 4 N 4 S 3 ; their main difference is on the expressiveness dimension because CNL-B allows producing more expressive and complete sentences than CNL-A.
Finally, RSL [18,34] shares common concepts with SilabREQ, XIS* [55,56], and ASL [58], namely concerning the definition of use cases and data entities. However, SilabREQ, XIS*, and ASL are focused on software models while RSL on the specification of requirements. RSL addresses RE aspects by providing business-and system-level constructs and supporting multiple requirement types such as use cases, goals, quality requirements, or user stories. Concerning the definition of data entity elements, in the scope of requirements specification, RSL shows to be a rich language by supporting most aspects of the discussed linguistic patterns. Also, RSL shows a higher level in what concerns precision and simplicity (i.e., computability) dimension because it is a rigorous language and, hence, easy to be implementable by a software program. Due to its higher expressiveness and the fact that it is not easy to write RSL sentences without appropriate tool support, RSL is classified as P 4 E 4 N 3 S 4 according to the PENS scheme.

Other Approaches
The linguistic patterns discussed in this paper can be represented according to different concrete syntaxes as discussed above, mostly for the sake of the requirements quality, and hence, for the sake of their writers and readers. Recent work in the areas of ontology engineering [43] and knowledge graphs [59,60] deserves to be mentioned too.
A knowledge graph (KG) is a multi-relational graph composed of entities (nodes) and relations (different types of edges), usually defined at the data level [59]. Each edge is represented as a triple of the form (head entity, relation, tail entity), also called a fact, indicating that two entities are connected by a specific relation. Their uses range commonly from intelligent search to analytics, cataloging, data integration, and more, as well as in scenarios that require exploiting diverse, dynamic, large-scale collections of data. Hogan et al. [60] provide a very recent and extensive primer to KGs. They discuss various graph-based data models and query languages that are used for KGs and debate the roles of schema, identity, and context. Moreover, they review methods for the creation, enrichment, quality assessment, refinement, and publication of KGs, and overview popular open and enterprise KGs, their applications, and how to use the techniques. Wang et al. present a survey of approaches and applications of KG embedding [59], which are embed components of a KG, including entities and relations, into continuous vector spaces, that allow simplifying the manipulation while preserving the inherent structure of the KG.
Some work on ontologies and KGs has being researched in the RE domain, especially in the specification of requirements. For instance, Pan et al. [61] propose the adoption of a Pattern-Based KG-Embedding to represent non-functional requirements and respective relationships. Guo et al. [62] develop the KARA tool that analyses the result of automatic testing of Android applications to generate KGs of these tests, which can then be customized and enriched by users with their domain knowledge. Yanuarifiani et al. [63] propose a rule-based ontology framework for generating textual requirements specifications and BPMN models and present a simple example to support the discussion of their approach. Kim and Lee [64] propose a cognitive three-layered framework that integrates different modeling methodologies and knowledge sources related to cyber-security; their framework helps in understanding essential components of security and making recommendations for security requirements regarding threat analyses and risk assessments. Likewise, Gonçalves and Silva [65] discuss an approach that allows the definition and specification of securityspecific concerns like security requirements but also vulnerabilities, risks, or threats; first, they discuss the extensions of the RSL language to support such security-specific concepts, and then they discuss the relevance for a library of reusable security-specific specifications with concrete examples. Liu et al. [66] propose an approach that uses the data of application descriptions in the stage of RE efficiently, with the initial requirements used to retrieve additional information from the model constructed by mining the domain knowledge from app descriptions.
The analysis of these approaches raises new challenges for future work. For instance, we understand that the discussion around the specification of "data entities" can be extended to support the specification of "concrete data" or "data objects", and respective relationships, such as commonly required in domains of ontologies and KGs.

Evaluation Based on a Pilot User Session
To preliminarily evaluate the proposed linguistic patterns and styles and receive feedback from people not directly involved in this work, we conducted a pilot user session. This session involved a group of 19 participants (see Figure 7 for a graphical demographic analysis) with at least a BSc degree, namely 6 with a BSc, 9 with an MSc, 2 with a Ph.D. academic degree, and 2 Post Ph.D. All participants had professional experience, namely 5 participants with less than 3 years, 7 participants between 3 and 5 years, and 7 participants with more than 5-years of experience. Participants reported experience with the following IT roles: software developer (14), business analyst (6), quality assurance engineer (6), requirements engineer (4), and software tester (4).

User Session Setup
The user session was conducted under the following conditions: tests took place the participant's environment (e.g., office or home environment); realization of the ta without previous use and learning of the proposed patterns; main information sent email; users were free to think and share ideas if they wanted.
The user session and respective preparation processes involved the following task Task-1: (Preparation) We prepared a PDF document that described the proposed l guistic patterns; this document also included a shorter version of the "BillingSystem" e ample as discussed in the paper.

User Session Setup
The user session was conducted under the following conditions: tests took place at the participant's environment (e.g., office or home environment); realization of the task without previous use and learning of the proposed patterns; main information sent by email; users were free to think and share ideas if they wanted.
The user session and respective preparation processes involved the following tasks: Task-1: (Preparation) We prepared a PDF document that described the proposed linguistic patterns; this document also included a shorter version of the "BillingSystem" example as discussed in the paper.
The first three questions were focused on the general characterization of the participant.

2.
Five questions directly related to the assessment of the proposed patterns and their styles; we first asked participants to rate in a 5-Likert scale (i.e., from 0 to 5, 0-Do not know, 1-Very Low, 2 -Low, 3-Medium, 4-High, and 5-Very High) how does she rate the proposed linguistic patterns, and How does she rate each linguistic style (i.e., CNL-A, CNL-B, RSL), in what concerns different specific qualities (i.e., simplicity, expressiveness, readability, and completeness).

3.
Two additional questions were more time-consuming: the participant was challenged to specify some entities informally referred in the case study (and not included in the PDF sent), namely specify the "Invoice", "InvoiceLine" and "Customer" data entities; finally, the participants were invited to shortly explain her decisions and response.

4.
The final question asked if the participant had some previous contact with text-based UML notations (e.g., TextUML, Umple, yUML) used for describing domain entities and if she gives advantages to the proposed linguistic styles and notations.
Task-3: (Preparation) We prepared the list of participants and invited them to participate in the user session; an email with clear instructions on how to complete the survey was sent to each participant and it was asked to fill in the evaluation questionnaire.
Task-4: (Execution) Each invited participant read the PDF document with the additional instructions and performed autonomously the proposed tasks; in the end, they answered the evaluation questionnaire.
Task-5: (Analysis) We collected the responses submitted by the participants and analyzed their results.

Questionnaire Analysis
As introduced above the questionnaire involves several questions grouped into four dimensions of analysis.
First, after the questions related to the general characterization of each participant, the first question that requires the participants to rate was Q3: Q3: How do you rate the proposed linguistic patterns (Data Entities, Data Attribute, Data Constraints, Cluster of Data Entities)?
Responses to this question revealed good to very good grades, as summarized in Table 4: Data Constraints patterns were rated slightly below 4 (i.e., 3.74), while the others had the average ratings above or equal 4, namely: the average rating was 4.58 for DataEntity, 4.53 for DataEntityCluster, and 4.0 for DataAttribute patterns in the 5-Likert scale (in which 1 is Very Low and 5-Very High). We may also conclude that the more difficult to understand and to apply patterns (i.e., DataConstraint and DataAttribute patterns) were rated slightly lower than the others. Concerning simplicity of CNL-A, CNL-B, and RSL (see Table 5 and Figure 8), participants gave the highest score to CNL-A notation (the average score was 4.47), while the lowest score was given to RSL linguistic style (3.53). Additionally, CNL-B had the average score between these two scores (exactly 4). When expressiveness is considered, RSL was the best rated one, with a score of 4.21. CNL-A and CNL-B have similar average ratings with regards to readability (4.32), while RSL is rated as the best one when completeness is concerned (4.47).  According to the survey, ten (52.6%) participants responded that CNL-B was the most appropriate notation for describing the proposed patterns. This was followed by six participants (31.6%) that voted for CNL-A, and three (15.8%) voted for RSL. This result is a little surprising, considering that the average grades of all three notations are very close, but also understandable since RSL sentences are less simple and easy to read than the equivalents in CNL-A or CNL-B.
Third, the questionnaire included two open questions: Q9: According to the description of "BillingSystem" (See details in Appendix A) please specify the remaining entities (including the respective attributes, constraints, etc.), namely for "Invoice", "InvoiceLine" and "Customer", using your preferred notation. In particular, you may choose to use CNL-A, CNL-B, RSL, or even other textual notation.
Q10: Please, justify your decision regarding the previous question.
Not all participants specified the remaining entities described in the case study, but most did (15 of them). When performing a careful analysis of the submitted specifications, we noticed that participants mostly used CNL-B and RSL linguistic styles. All responses had the specification of the three asked entities: Customer, Invoice, and InvoiceLine. The style was generally fully complied with, but it was noted that a few of them did not ac- According to the survey, ten (52.6%) participants responded that CNL-B was the most appropriate notation for describing the proposed patterns. This was followed by six participants (31.6%) that voted for CNL-A, and three (15.8%) voted for RSL. This result is a little surprising, considering that the average grades of all three notations are very close, but also understandable since RSL sentences are less simple and easy to read than the equivalents in CNL-A or CNL-B.
Third, the questionnaire included two open questions: Q9: According to the description of "BillingSystem" (See details in Appendix A) please specify the remaining entities (including the respective attributes, constraints, etc.), namely for "Invoice", "InvoiceLine" and "Customer", using your preferred notation. In particular, you may choose to use CNL-A, CNL-B, RSL, or even other textual notation.
Q10: Please, justify your decision regarding the previous question. Not all participants specified the remaining entities described in the case study, but most did (15 of them). When performing a careful analysis of the submitted specifications, we noticed that participants mostly used CNL-B and RSL linguistic styles. All responses had the specification of the three asked entities: Customer, Invoice, and InvoiceLine. The style was generally fully complied with, but it was noted that a few of them did not accomplish the specification of constraints.
The analysis of the comments led to the conclusion that it would be helpful for the participants to have an appropriate tool editor that would better support the writing of the chosen style.
Fourth, the questionnaire still included a final question: Q11: Comparing to other text-based UML notations (TextUML, Umple, yUML, etc.) for describing the domain model do you give advantages to the proposed linguistic styles and notations (CNL-A, CNL-B, or RSL)?
Slightly more than half of the participants (11 of them) found it advantageous to proposed linguistic styles and notations than other text-based UML notations. Furthermore, 7 participants answered that the differences between these two groups are not that significant, while 1 participant was not sure about it. Also, some participants referred that they like to use graphical notation to specify domain model and that it would be a plus to have some software tool to automatically produce such visual models from textual specifications, and vice-versa.
To summarize, the results gathered from this initial evaluation were generally encouraging with positive scores in all the analyzed dimensions. Even that it can be stated that the number of participants of the session is relatively small, we consider this number is sufficient to take meaningful conclusions because usability experts have noted that a group of 5 testers is enough to uncover over 80% of the usability problems [67]. Also, since our questionnaire focuses on the usability of the patterns and writing styles, we think that 19 participants is a reasonable number for an exploratory assessment, at least to identify challenges on the usability of these aspects.

Conclusions
The specification of requirements describes technical concerns of a system and is used throughout several stages of its life cycle. A requirements specification document helps to share the system vision among the main stakeholders, as well as facilitate the development and operation processes. The use of natural language for these specifications is widespread and effective because humans can easily read and write them. However, natural language also exhibits inherent characteristics that often present as the root cause of many quality problems such as inconsistency, incompleteness, and ambiguousness.
Due to these problems, natural language specifications are often completed by some sort of other requirements and models that use controlled, formal, or semi-formal modeling languages, as extensively discussed in the related work section. These provide a set of constructs (e.g., data entity, actor, use case, or requirement) that explicitly or implicitly define its abstract syntax and semantics, and addresses different concerns and abstraction levels. Also, these languages provide different notations or concrete syntaxes, such as textual, graphical, tabular, or even form-based representations. Furthermore, some authors have proposed recommendations for better writing requirements; however, many of them are just general and abstract guidelines and hard to follow in practice.
We claim that it is needed better and more specific guidance to improve the way requirements engineers and product designers write and validate their specifications. We argue that improving the awareness and knowledge of linguistic patterns, like the ones discussed in this paper, may contribute to enhance this current situation. On the other hand, as showed throughout this paper with the CNL-A, CNL-B, and RSL languages, these linguistic patterns may be represented in practice by different writing styles, being compact (as with CNL-A), verbose and expressive (as CNL-B), or even based on a rigorous requirements specification language (as with RSL). This discussion was supported by a representative yet simple and straightforward running example that allows illustrating, indeed, both the proposed patterns and associated writing styles. Additionally, Appendix A includes a visual representation of this example using the UML notation. This discussion was still complemented and validated within a pilot user session with 19 participants, IT professionals not directly involved in the research. The feedback received was generally encouraging with positive scores in all the analyzed dimensions.
For future work, we plan to extend our research on linguistic patterns, namely for use cases, user stories, goals, and other types of requirement-related elements. Also, different writing styles shall be explored, including mixing textual with tabular and visual representations. We intend to explore and implement document automation features in this scope of RE, namely by automatically generate requirements specifications with different writing styles from intermediate formats like the one defined with RSL or similar. Moreover, the patterns discussed shall be extended to consider data-and knowledge-driven approaches, such as those in the areas of knowledge graphs or linked data as discussed in the related work section.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. The BillingSystem Running Example
This paper uses a running example to support the explanation and discussion of the proposed linguistic patterns and linguistic styles. This example refers to the requirements of a fictitious information system called "BillingSystem", which is a simple invoice management system. The following text partially describes a diversity of its informal requirements as adapted from the example initially introduced by Silva [18].

Appendix A.1. Informal and Annotated Informal Description
The following sentences informally describe the requirements of the BillingSystem: The BillingSystem is a system that allows its users to manage customers, products, and invoices. A user is someone that has a user account and is assigned to a user role, namely as operator, manager, administrator, and customer [ . . . ]. The administrator may register users, which involves the following information: first and last names, email address, username, password, and user role [ . . . ]. A VAT consists of the following information: VAT code, rate, name, and value. The VAT rates consider the following values: standard, reduced, and special.
[ . . . ] A product consists of the following information: name, price, VAT rate, VAT value, and size category; a size category consists of one of the following values: Small, Regular, Large, ExtraLarge. Product names are strings and prices are decimals. [ . . . ] A customer consists of the following information: name, fiscal id, image, bank information, and additional information such as address and personal contacts. A customer can also be defined as a customer VIP, and in this situation, there is a discount tax that can change over time. The operator shall create invoices with respective details defined as invoice lines. An invoice shall have the following information: customer id, dates (e.g., of creation, approval, and paid), status (e.g., created, approved, rejected, paid, deleted), total value with and without VAT. Also, an invoice line shall include product id, number of items, product value with and without VAT [ . . . ].
For the sake of readability, we may highlight special text fragments of these sentences as follows: candidate data entities are marked with bold text; data entity attributes are marked with light gray background color (e.g., product name, product values, VAT code, and value); data enumeration values are marked with light green background color (e.g., Small, Regular and Large): The BillingSystem is a system that allows its users to manage customers, products, and invoices. A user is someone that has a user account and is assigned to a user role, namely as operator, manager, administrator, and customer [ . . . ]. The administrator may register users, which involves the following information: first and last names, email address, username, password, and user role [ . . . ].
A VAT consists of the following information: VAT code, rate, name, and value . bank information, and additional information such as address and personal contacts . A customer can also be defined as a customer VIP, and in this situation, there is a discount tax that can change over time.
The operator shall create invoices with respective details defined as invoice lines.
An invoice shall have the following information: customer id, dates (e. g., of creation, approval, and paid), status (e.g., created , approved, rejected, paid, deleted), total value with and without VAT . Also, an invoice line shall include product id, number of items, product value with and without VAT [ . . . ].
The following text fragments describe the data entity's related elements specified with the CNL-A, CNL-B, RSL, and UML languages. For the sake of conciseness, these fragments present a complete specification for only the VAT, Product, Invoice, and InvoiceLine data entities (the remaining elements are suggested as an exercise to the interested reader).