Semantic Clustering of Functional Requirements Using Agglomerative Hierarchical Clustering

: Software applications have become a fundamental part in the daily work of modern society as they meet different needs of users in different domains. Such needs are known as software requirements (SRs) which are separated into functional (software services) and non-functional (quality attributes). The ﬁrst step of every software development project is SR elicitation. This step is a challenge task for developers as they need to understand and analyze SRs manually. For example, the collected functional SRs need to be categorized into different clusters to break-down the project into a set of sub-projects with related SRs and devote each sub-project to a separate development team. However, functional SRs clustering has never been considered in the literature. Therefore, in this paper, we propose an approach to automatically cluster functional requirements based on semantic measure. An empirical evaluation is conducted using four open-access software projects to evaluate our proposal. The experimental results demonstrate that the proposed approach identiﬁes semantic clusters according to well-known used measures in the subject.


Introduction
Software has become an essential and fundamental part of modern society. Every software system is built to meet a set of needs and conditions. This set is known as Software Requirements (SRs) [1]. Generally, there are two types of SRs [2,3]. The first one concerns with the functional behavior of the software (software services) and it is called functional requirements (FRs). An example of FR from MHC-PMS (Mental Health Care-Patient Management System) is [1]: "The MHC-PMS shall generate monthly management reports showing the cost of drugs prescribed by each clinic during that month". Non-functional requirements (NFRs), the second type, focus on system restrictions and everything that is not related to software functionality (quality attributes) such as reliability, performance, portability, etc.
The very first step in developing a software system is SR elicitation [4]. In this step, developers work with the customer to find out what services the software should provide [5]. It is very important to implement the requirements elicitation step carefully because the overall development process can be seen as a derivation of the collected SRs [6]. Therefore, SRs should be solid and valid such that further modifications are limited in the future to save time and cost in the development process. Any change in the SRs after starting of source code implementation is considered to be a very timely and costly operation [7,8].
SR elicitation is a challenging task for developers and engineers [9][10][11][12]. SRs are written in natural languages and describe different aspects of the target software [1,13,14]. Mainly, software developers need to understand, analyze, and validate SRs manually. For example, the collected SRs need to be categorized and grouped in different clusters in order to break down the target project into a set of sub-projects [9]. Eventually, each sub-project covers a set of related SRs and developed by a separate and specialized developer team. Moreover, these groups of SRs need to be analyzed to generate Software Requirement Specifications (SRSs), which can be used for validating and verifying the final software product. A software developer needs to extract all related SRs before preparing the final specifications (SRS) that describe them. As a result, grouping SRs into a set of clusters helps developers to better understand and realize the target software project. In addition, it helps to design initial architecture of the software product as these clusters represent components or sub-systems that should be implemented and reused [1,15].
There is a lack of tools or methodologies to automatically classify and cluster the collected SRs. On the one hand, some works are found in the literature with the goal of non-functional SR classification [16][17][18]. On the other hand, functional SR clustering has never been considered in the literature. In this paper, a generic framework to automatically cluster functional SRs based on their semantics is proposed. The proposed framework can be used with any clustering algorithms and distance measures.
This paper presents a hierarchical clustering approach to semantically cluster functional SRs. The proposed approach takes as an input, FR statements of a software product in a flat list, which is documented in an SRS document. This document is an official statement about what developers should implement [19]. The approach returns as output of a set of semantic clusters based on a semantic similarity measure. First, the input is preprocessed with a proposed preprocessing step to increase the accuracy of the clustering algorithm. Then, the Agglomerative Hierarchical Clustering (AHC) algorithm is applied to cluster the target functional SRs into a set of clusters. During the clustering process, a dendrogram report is generated to visualize the progressive clustering of the functional SRs. This can be useful for software engineers to have an idea of a suitable number of clusters into which the functional SRs categorized. In other words, it can be helpful for software engineers in deciding how to break down the target project into different sub-projects.
Our proposal makes the following contributions: -A hierarchical clustering approach to clustering software FRs based on their semantics. -A dynamic clustering framework, which can be extended to include different clustering algorithms and distance measures. -An empirical study to evaluate the proposed automatic clustering using four open-access software projects.
The remainder of this paper is organized into main six sections. Section 2 introduces a motivational example of our proposal. Section 3 presents the most recent work on the subject. The proposed approach is detailed in Section 4. In Section 5, we describe case studies and our research questions. In Section 6, we analyze and discuss the experimental results. Finally, conclusions and future work are stated in Section 7.

Motivational Example
To better understand the benefits of clustering FRs into a set of semantic clusters, we introduce E-Store software as an example. This software product provides FRs and features that allow for online sales, distribution and marketing of electronics. In addition, it focuses on the company, stakeholders and applications. This software product provides 62 FRs. Figure 1 shows the FRs and their corresponding groups as documented manually in the SRS document of E-Store software [20]. During the requirement engineering process phase, a collection of unstructured requirements of E-Store software (62 FRs) manually are grouped into coherent clusters and then written down in an SRS document (as shown in Figure 1b). For example, the FR number 6 and 16 ["The system shall display detailed information of the selected products", "The system shall provide browsing options to see product details"] are manually grouped together as a cluster with title "product details ". Performing such a task is time-consuming and error-prone especially in a large software project as the FRs are organized into different sub-systems that make up the software.

Literature Review
This section presents research work conducted thus far, which is very related to our research work presented in this paper. Before going further, it is important to distinguish between two main concepts: clustering and classification of software requirements. Firstly, the clustering is the process of grouping or organizing requirements into an undefined number of groups (clusters). To the best of our knowledge, no research work has studied the FR clustering. Secondly, the classification is the process of assigning requirements to a predefined number of classes. The requirements broadly are classified into two categories: functional and non-functional requirements [21,22]. In the following, we present research work interested in both categories.

Functional Requirements' Classification
The classification of FRs is seldom considered in the literature of requirement engineering. Sommerville, in [1], classifies FRs into two levels: user requirements and system requirements. In [23], Ghazarian et al classified manually FRs into only five classes. These classes include input data, output data, data persistence, application rules and actions. In [24], Ghazarian also used FRs classes in [23] as a starting point to assign manually a type for each system FR. Then, he expanded and modified the class as necessary. The resulting classification scheme consists of 12 classes. This classification was limited to web-based enterprise systems.

Non-Functional Requirements' Classification
In contrast to the FR classifications, most existing approaches focused on NFR classification. Moreover, NFR classifications are very detailed. For instance, NFRs in Sommerville's textbook are classified into three main types [1]: external, product and organizational requirements. Then, product requirements are further classified into four categories: efficiency, usability, dependability and security requirements. In addition, the efficiency requirements are detailed for space and performance requirements. Moreover, the rest of the main NFR classes are constantly detailed into lower levels. Consequently, the resulting classification scheme is a hierarchy consisting of three levels.
A large number of approaches are proposed to automate the process of identifying and classifying the NFRs. We present here the most recent work on the subject. In [25], Rashwan et al. proposed an ontology-based technique to detect and classify requirements sentences into four types of NFRs (maintainability, reliability, portability, security and usability/utility) using a Support Vector Machine Classifier (VSM). In [26]

Other Related Work
Indeed, the distinction between functional and non-functional requirements is not a clear-cut separation where a non-functional requirement, such as security, may generate multi-functional requirement related to security services [1]. Therefore, a number of attempts are proposed to automate the identification of such distinction in [18,[27][28][29].

The Proposed Approach
In this section, we detail our proposal to identify semantic clusters from a given set of FRs. We first give an overview of our proposal steps. Then, we explain each step in details. Figure 2 presents a dynamic framework for our identification process. Mainly, this framework consists of four steps. The first step takes as an input the SRS document, which is parsed to extract a list of FR statements. In the second step, each FR statement undergoes a set of preprocessing tasks. Then, semantic similarity among FRs is computed in the third step. Finally, these FRs are organized into semantic clusters by employment of a clustering algorithm. In this framework, steps three and four can be seen as switching points to bind different semantic measures and clustering algorithms, respectively.

Parsing Functional Requirements
As FRs describe in detail what system should do and what developers should implement, the statements of these requirements are documented in a well-structured format in an SRS document. These statements are listed in a section under the title "Functional" in SRS document [19]. Therefore, this step parses the SRS document to locate the "Functional" section and then read each statement in this section as a functional requirement.

Pre-Processing Functional Requirements
In order to find best matching among FRs as an important step for identifying semantic clusters, the text of each statement is normalized. Such a normalization is achieved by performing three tasks: tokenization, stop word removal and stemming. These tasks are executed according to the following order.

Tokenization
In this task, each FR is divided into individual statements using comma and dot delimiters. Then, each statement is divided into tokens based on white space. In addition, each compound word is divided into simple tokens (e.g., interDepartment is divided into inter and department) using Camel case notation.

Frequent Tokens Removal
Rare tokens are more informative than frequent tokens in the free text because these frequent tokens have little meaning and no discriminative power [30,31]. In this task, we remove two types of frequent tokens. The first type represents stop-words such as, of, the, to, etc. The second type represents a set of tokens with high frequency across all FR statements. This set depends on the case study of interest. For example, the tokens "system" and "user" exist across all FR statements of the E-Store system case study. Thus, these tokens have a high frequency and should be removed.

Stemming
Reducing tokens to their stems in information retrieval is known as stemming [30]. For example, compressing and compresses are reduced to compress. We rely on the most common English stemmer algorithm called Porter's algorithm [32] to perform this task.

Computing Semantic Similarity
As our proposal aims at constructing semantic clusters of FRs by grouping FRs that are semantically close, a semantic similarity measure is needed. Our proposal relies on the following heuristic to determine such a similarity measure: -Heuristic [semantic similarity]: it indicates text matching between tokens derived from FR statements. Such tokens record important domain knowledge, which represents functionality(s). Therefore, when two or more FR statements share a lot of tokens, it is expected that those FRs are related to the same domain task, especially when the analysts use the same vocabulary across those FR statements.
The semantic similarity between two FRs is computed based on vector space model (VSM) [33]. It is a well-known technique in information retrieval (IR). In VSM's space, each FR is represented as a vector (call). The semantic similarity (semanticSim) between two FRs is defined in Equation (1) using cosine similarity as it is a well-known in our subject [34,35]. For two given FRs, this metric is used to determine how much relevant semantic information is shared among their corresponding token vectors: (1)

Clustering Similar FRs into Clusters
In this final step, we employ an algorithm to group together similar FRs (of course semantic similarity). Among different possible algorithms, we pick out clustering algorithms as their functions serve the goal of this study.
Our clustering algorithm is based on Agglomerative Hierarchical clustering (AHC) [36]. However, this step is not limited to AHC but also any algorithm supporting clustering analysis can be used. Generally, AHC starts by singleton clusters such that each cluster is a single object. Then, the two most similar clusters are merged in each pass. When there are several FRs' clusters that have the same similarity to one cluster, the algorithm considers the first cluster that it encounters. Such a binary merging continues in each pass until a single large cluster is obtained. Such a cluster represents a dendrogram tree, which includes all candidate clusters of these objects. In our work, these singleton clusters initially consist of individual FRs and later clusters of FRs formed during each pass recursively. Below, we illustrate how a dendrogram tree of FRs is built and parsed to extract semantic clusters of a given set of FRs.

Building a Dendrogram Tree of FRs
A dendrogram tree is a representation to illustrate the arrangement of clusters generated by AHC [36]. In this work, we adapt AHC to build a dendrogram tree for a given set of FRs based on the Algorithm 1. This algorithm starts with creating a cluster for each individual FR. Then, the two most similar clusters are merged in each iteration. In our work, this similarity refers to semantic similarity, which is computed between term vectors of the two compared clusters using semanticSim equation. Then, the two most similar clusters are replaced with a new cluster, which is the result of merging of those clusters. This process continues until a single cluster obtained. This cluster represents a dendrogram that represents a set of nested clusters. Figure 3 displays an example of a dendrogram tree. At the bottom level, a cluster is created for each FR. At the top level, all FRs belong to the single large cluster. The internal nodes (internal clusters) refer to the new clusters which are formed by merging the clusters that appear as children of those new clusters.

Identifying Semantic Clusters
As a dendrogram tree is a hierarchy of nested clusters, when this hierarchy is cut off at a specific point based on predefined criteria, a set of clusters is obtained (see Figure 3). Each one is a candidate semantic cluster. Therefore, in this step, we propose Algorithm 2 based on a depth-first search to determine the appropriate cutting point. The input of this algorithm is a dendrogram tree and the output is a set of semantic clusters. The algorithm compares the semantic similarity value of each node (parent node) starting from the root with its children's nodes (its left and right nodes). If the similarity value of the parent node is less than the average similarity value of its sons, the algorithm goes to the next immediate sons. Otherwise, the parent node is identified as a semantic cluster, added to the accumulator (semCluster) and the algorithm goes to the next node in the stack (pile). Along these lines, the most relevant semantic clusters are identified as the traversal proceeds. Figure 3 is a simple example to show an imaginary picture of the execution of Algorithm 2. In this figure, the horizontal line crosses the hierarchy in two points (called cutting points). Therefore, two semantic clusters are identified. The first cluster has seven FRs {FR2, FR10, FR5, FR8, FR9, FR1, FR4} while the second cluster consists of three FRs {FR3, FR6, FR7}.

Experimental Evaluation Settings
The goal of this section is to describe case studies and the research questions being investigated in this work.

Case Studies
The effectiveness of our proposed approach is evaluated using SRS documents of four open-access software products from different domains with different sizes. These products are: E-Store system [20], WASP system [37], UUIS system [20] and MHC-PM system [38].
The E-Store allows for online sales, distribution and marketing of electronics. The WASP system is web architectures for services platforms. The UUIS system is a unified university inventory system to access and manage the integrated inventory. The MHC-PM system is a mental health care patient management system. Table 1 shows statistical information of each software product of interest in terms of number of clusters, size of these clusters and number of FRs.

Investigation Research Questions and Evaluation Procedure
As this work aims at organizing a set of given FRs into a set of clusters which are semantically correct, we investigate the following main research questions which collectively meet the objective of this study: To address the first research question (RQ1), we need a measurement to assess the semantic clustering of identified clusters. For this, we rely on two well known measures in the Information Retrieval (IR) field. These metrics are Precision and Recall [30]. Both Precision and Recall take values in a range between 0 and 1. The ideal value for Precision and Recall is 1. For each identified cluster, we evaluate the cluster correctness by executing the following steps: - We match the identified cluster (i.e., their FRs) with all actual clusters of a software product of interest. The actual clusters are manually clustered and documented in an SRS document of each case study by analysts. Let X be the identified cluster. The cluster that has a large number of matches in terms of FRs with X is called a reference cluster of X.

-
We compute precision and recall values based on the reference cluster according to the following equations: where REF_CLUSTER and IDE_CLUSTER refers to FRs of reference and identified clusters, respectively. If the Precision is equal to 1, this indicates that all FRs of identified cluster is a subset of the reference cluster's FRs. If the Recall value is equal to 1, this refers to all FRs of the reference cluster being a subset of the identified cluster's FRs. Therefore, both Precision and Recall should be used together as complementary parts to evaluate the correctness of the identified cluster.
To answer the second research question (RQ2), we need to compare the number of identified clusters with the actual number of clusters of case study of interest. Such a comparison represents clustering gap metric. For simplicity, we call this metric as C_Gap. The values of this metric is computed by applying the following equation: where #I_CLUSTER and #A_CLUSTER represent the number of identified and actual clusters, respectively. The ideal value of the C_Gap is 0, which refers to that number of identified clusters is equal to the number of the actual cluster of the case study of interest.
As a summary, the results mainly evaluated by Precision, Recall and C_Gap. These metrics work together as an integrated group. Precision and Recall are used to evaluate each identified cluster individually while C_Gap is used to evaluate the number of identified clusters.

Results Analysis and Discussion
In this section, our research questions are answered and the experimental results of our proposed approach are analyzed.

Semantic Clustering (RQ1)
In order to answer the first research question (RQ1), we analyze and evaluate the semantic clustering of the identified clusters. This is performed using Precision and Recall metrics presented in Section 5.2. Table 2 shows average Precision and Recall values of identified clusters from each software product. These values indicate that the FRs of these identified clusters are semantically grouped together to form semantic clusters. Indeed, Precision values take a high-range (0.72-0.83) and Recall values take a reasonable range (54-61) across different software products. In Figure 4, we graphically display the average Precision and Recall values of identified clusters from each software product against the number of FRs. As shown in this figure, the average values for Precision are relatively close to each other. In addition, the average values for Recall are relatively close to each other. This represents an indicator that our proposed approach works effectively regardless of the number of FRs. Table 3 shows statistics for Precision and Recall values of identified clusters from each software product. It is noticeable that standard deviation (StdDev) of these values is low. In addition, the maximum values for Precision and Recall reach the ideal value (1.0). As a result, this is evidence that our approach identifies semantic clusters that always have high Precision and Recall values across different sizes of software products.  In summary, the answer of the first research question (RQ1) is that our approach identifies clusters with confidence that these clusters are semantic clusters regardless of the number of FRs provided by a software product. This answer is based on the average precision and recall values and their statistics shown in Table 3.

Clustering Gap (RQ2)
It is important, before going further, to provide statistics about the identified clusters from software products of interest. Table 4 shows, for each software product, statistical information in terms of number of identified clusters and sizes of these clusters (minimum, average and maximum sizes). By comparing it with Table 1 (statistics about actual clusters), we note that the number of identified clusters is close to the actual number of clusters (shown in Table 1) of each software product. In addition, the size of identified clusters is similar to their corresponding actual clusters. Consequently, these statistics represent an indicator about the efficiency of our proposed approach.  Figure 5 visualizes the relationship between the number of the identified clusters and number of FRs across all software products of interest. In addition, this figure visualizes the relationship between the number of identified clusters and the number of actual clusters. Based on this figure, it is clear that the number of identified clusters is very close to the number of actual clusters across different software products regardless the number of FRs provided by that product. This fact is made evident by Table 5 that describes the minor gaps between the number of actual and identified clusters for each software product.
In summary, the answer of the second research question (RQ2) is that the number of identified clusters for a given software product are the same or very similar to their corresponding actual clusters. The answer of this question is based on Tables 4 and 5.
We present different reference and identified clusters for all studied case studies (WASP System, UUIS System, E-Store System and MHC-PM System) in Tables 6-9, respectively. The shaded FRs in each identified cluster are irrelevant FRs in that cluster.   Table 6. An example of a semantic cluster identified from the WASP system.

Semantic Cluster Members Reference Cluster Members
The WASP platform MUST allow end-users to set an alert on an event The WASP platform MUST allow end-users to set an alert on an event The WASP platform SHOULD allow the end-user to specify the notification type when setting an alert The WASP platform SHOULD allow the end-user to specify the notification type when setting an alert The WASP platform MUST maintain a list of events the end-user can be notified about The WASP platform MUST maintain a list of events the end-user can be notified about The WASP platform SHOULD be able to decide how to notify the user of an alert for which an event was set The WASP platform SHOULD be able to decide how to notify the user of an alert for which an event was set The WASP platform MUST actively monitor all events The WASP platform MUST actively monitor all events The WASP platform MUST allow the end-user to remove previously set alerts on events The WASP platform MUST allow the end-user to remove previously set alerts on events If the user cannot be notified of the event the first time, the WASP platform SHOULD retry to notify the user of the occurrence of the event, until the user has been notified or a specified time-out elapses If the user cannot be notified of the event the first time, the WASP platform SHOULD retry to notify the user of the occurrence of the event, until the user has been notified or a specified time-out elapses The WASP platform MUST notify the end-user about the occurrence of an event for which an alert was set, as soon as the event occurs The WASP platform MUST notify the end-user about the occurrence of an event for which an alert was set, as soon as the event occurs The WASP platform SHALL allow end-users to maintain a buddy list Table 7. An example of a semantic cluster identified from an E-Store System.

Semantic Cluster Members Reference Cluster Members
The system shall allow user to create profile and set his credential The system shall allow user to create profile and set his credential The system shall authenticate user credentials to view the profile The system shall authenticate user credentials to view the profile The system shall allow user to update the profile information Table 8. An example of a semantic cluster identified from a UUIS system.

Semantic Cluster Members Reference Cluster Members
Any DA group member or authorized inventory group member asset is owned by the department Inter departments: request must be approved by a DA group member and faculty group member unless it came from a higher level group A bulk entry can be used to add many assets Table 9. An example of a semantic cluster identified from MHC-PM system.

Semantic Cluster Members Reference Cluster Members
System will generate a daily list of patients who missed their appointments and email/SMS to the clinician responsible for the patient's care

Threats to Validity
In this section, we discuss the limitations of our proposal in terms of the internal and external threats. These threats as follows: -As the identification process mainly relies on textual matching among FRs to identify semantic clusters, our approach is sensitive to the vocabulary used in FR statements. Thus, it does not come as a surprise that our proposal successes and failures depend on the vocabulary used. The obtained results can not be generalized to all software products where the use of the same vocabulary yields best results while the different vocabularies represent an internal threat. However, this threat is shared by all research work that depends on textual matching between SR artifacts. -An important external threat to validity is the use of a limited number of case studies to evaluate the effectiveness of our proposal. Despite the face that the used case studies are fair enough to validate our proposal, a larger number of case studies is needed for a better test.

Conclusions
In this paper, we present an approach to group software functional requirements of a given product into semantic clusters. The identification process employs textual similarity between functional requirement statements. Such similarity reflects the domain's functions and tasks embedded in functional requirement vocabulary. The proposed approach relies on an agglomerative hierarchical clustering algorithm. However, our proposal is a dynamic clustering framework, which can be extended to include different clustering algorithms and distance measures. The empirical study that is conducted using four open-access software products shows that our proposal achieves high performance according to the well-known measures in the subject. Moreover, the experimental results show that the proposed approach identifies semantic clusters that reflect the domain functionalities embedded in given functional requirements.

Future Work
We plan to do a comparative study in order to investigate the results of different clustering algorithms for clustering functional requirements. In addition, we plan to document each cluster by extracting key words from FR statements of that clusters. Such words describe the domain knowledge embedded in that cluster.