1. Introduction
Requirement elicitation is the process of searching and acquiring stakeholders’ interests in a software development [
1,
2,
3]. This aims to help produce new requirements from relevant stakeholders [
4], which can be taken from various sources and are available in a range of formats. Sources of requirements can be from any relevant parties, experts, current systems, and competitor systems [
1]. Stakeholders may comprise system end-users, top management, or external interested parties, such as regulators [
5]. Techniques that are commonly used in requirement elicitation are open/closed interviews, scenarios, use cases, observations, questionnaires, brainstorming, prototyping, focus group discussions, and joint application development (JAD) workshops [
6]. It is widely recognized among researchers and industry practitioners that software projects are very prone to failure when the requirements elicitation process is not carried out properly [
7,
8]. In the software development life cycle (SDLC), this undertaking is challenging because it involves socio-technical characteristics and changing requirements [
9,
10], and because there may be problems of ambiguity and tacit knowledge transfer [
11,
12]. Such problems, if found too late, can lead to high rework costs because they necessitate modifications in other processes [
4,
13].
Different software products in the same area often have similar requirements. The reuse of these requirements can reduce costs in the elicitation process [
14]. Requirements reuse could be from formal [
15] or informal sources [
16], product description sources [
17], user reviews [
18,
19,
20], expert reviews [
18], social media [
21,
22], and online news [
5,
23]. Aside from developing the mobile application ecosystem, user reviews on Google Play are becoming increasingly popular to gather requirement reuse [
18,
24], requests for improvements [
25], and feedback on specific features [
26]. App users continually post a large number of reviews per day, addressed to both developers and user communities. When addressed to developers, reviews could be either appreciations, complaints, bug reports, or new feature requests [
27]. User reviews are written in natural language and so natural language preprocessing (NLP) is used to extract software features from user reviews [
28].
Although extensive research has been carried out to analyze user reviews, only limited studies use sentiment analysis in the software feature extraction. Sentiment analysis can understand the interests and emotional responses from product users [
29,
30]. Zang et al. [
31] used sentiment analysis in analyzing reviews of mobile users, whereas Panichella et al. [
32] classified application reviews relevant to software maintenance and evolution. Chen et al. [
33] mined mobile app reviews to extract the most informative information from users. Guzman and Maalej [
26] assisted developers in assessing user reviews about application features in order to identify new requirements or plan for future releases.
However, none of the research above looked into the use of sentiment analysis in examining the extracted features’ relevance and its influence on the decision-making of new requirements. System analysts’ domain knowledge vary in levels and this may jeopardize the requirements’ completeness. This information is vital for system analysts in the requirements elicitation process as they can use the results to verify the requirements and improve the completeness.
Using user reviews data, this study aims to use sentiment analysis to obtain a comprehensive point of view of requirement reuse. The research questions are as follows RQ1: What is the correlation between feature extraction and sentiment analysis in terms of acquiring relevant requirements? RQ2: What is the correlation between feature extraction and sentiment analysis in informing system analysts in the decision-making of new requirements? RQ3: What is the correlation between feature extraction and sentiment analysis pertaining to verbatim reuse of extracted features as new requirements?
2. Materials and Methods
This study aims to provide a comprehensive requirement report by using sentiment analysis based on user reviews on Google Play. We use the initial requirements made by the system analyst as the system input. The requirements are then optimized by using the outcomes of feature extraction from user reviews available on Google Play.
2.1. Research Framework
The research procedure is shown in
Figure 1. The pre-processing steps for handling the initial requirements are carried out by tokenizing, case folding, and removing stop words. The pre-processing stage for preparing user review data includes data cleaning, spelling normalization, sentence tokenization, and part-of-speech (POS) tagging. Software feature extraction with POS chunking is carried out based on the POS tag pattern. Subsequent to this, the data were treated with case folding, lemmatization, and stop words elimination. Sentiment analysis is conducted on the data by calculating the polarization and subjectivity. The similarities between features from the initial requirements list and user reviews are then calculated to produce the requirements report. This report provides a general outline of the new relevant features and their sentiment analysis. The system analysts could then use this report as a basis for the decision-making.
2.2. Data
The data required in this study consisted of the initial requirements created by the system analysts and the extracted features from user reviews on the Google Play. We used the earthquake and the e-marketplace apps as case studies. Data were collected from the requirements document and user reviews.
2.2.1. Software Requirements Documents
A list of requirements was used as an initial input to the system. Ten system analysts composed the requirements document using specific cases, which resulted in a list of functional requirements. The information used to make this document was obtained from several sources.
Table 1 shows an example of a list of requirements in the requirements document.
2.2.2. User Reviews on Google Play
User reviews were sourced from similar applications relevant to the case study. Keywords for the query were ‘earthquake’ and ‘e-marketplace’, entered into the search feature on the Google Play. From the list of applications that were most relevant to these keywords, the top 10–15 applications for each case study were selected. This search was conducted on 8 July 2020 for the earthquake app, and 18 November 2020 for the e-marketplace app. The list of applications that were used in this study is shown in
Table 2.
User reviews from 15 applications on Google Play were collected using the Python library google-play-scraper 0.1.1 (
https://pypi.org/project/google-play-scraper/, (accessed on 1 March 2021)). Data extracted from user reviews in each application was stored in plain text format. From the 15 earthquake-related applications extracted, and 17,133 user reviews were gathered.
2.3. Pre-Processing
The data collected were raw data that need processing. Since the requirements document and user reviews had different data characteristics, it was necessary to carry out the data pre-processing separately.
2.3.1. The Requirement Document
The pre-processing for the requirement document involved sentence tokenizing, case folding, and stop word eliminating. A sentence tokenizer was used to separate functional requirements written in the requirement document. Case folding was used to change all letters in the requirement list data into lowercase. Stop words are the most common words in languages, such as ‘the’, ‘a’, ‘on’, ‘is’, and ‘all’. These words do not have any significance and are removed from the text. We used the stop word list provided by the Natural Language Toolkit (NLTK) library [
34]. Examples of the pre-processing for the requirement document are shown in
Table 3.
2.3.2. User Reviews
The pre-processing of user reviews data was performed through data cleaning, spelling normalization, tokenization, POS tagging, and stop word removal. Data cleaning was performed to remove unnecessary characters to reduce noise, such as hashtags (#), URLs, numbers, certain punctuation marks, and other symbols. Data cleaning was performed because many user reviews contain irrelevant characters, such as emojis or excessive dots. This was done by using the string encoding function with the parameters ‘ignore’ and ‘ascii’ so that characters outside of ASCII were eliminated from the text.
Table 4 shows an example of the data cleaning implementation.
Spelling normalization is the process of correcting words that are nonetheless incorrectly written or abbreviated. For example, the words ‘notification’ and ‘notif’ have the same meaning. However, because ‘notif’ is an abbreviation that is not recorded in dictionaries, the word ‘notif’ is considered non-valuable. Therefore, the word ‘notif’ needs to be rewritten as ‘notification.’ Abbreviations such as ‘aren’t’ also need to be normalized into ‘are not.’ Spelling normalization is done using the spell function in the NLTK library. The example of spelling normalization is shown in
Table 5.
In this study, the sentiment value was calculated in each sentence. The sentence tokenization divided user reviews into several sentences using the sent_tokenize function from the NLTK library, which splits texts into sentences. Meanwhile, POS tagging functions in NLP automatically labeled words into their parts of speech. For example, in the sentence “I build applications”, there is a label PP = pronoun, VV = verb, and NN = noun. The system receives the input in the form of a sentence, and the output is be “I/PP build/VV applications/NN.” POS tagging was performed using the pos_tag function in the NLTK library.
Table 6 shows an example of POS tagging process in the sentences.
2.4. Sentiment Analysis
Sentiment analysis gives the quantitative values (positive, negative, and neutral) to a text representing the authors’ emotions. User reviews are calculated for polarity and subjectivity using the sentiment feature. We employed the TextBlob library in the analysis sentiment to determine the value. The polarity value lies in the range of [−1, 1], where 1 means a positive sentiment and −1 means a negative sentiment. Meanwhile, the value of subjectivity lies in the range of [0, 1], where 1 means an absolute subjective statement or in the form of opinion, and 0 means an absolute objective statement or in the form of facts [
35]. An example of value assignment can be seen in
Table 7.
2.5. Software Feature Extraction
Software feature extraction was performed to obtain the requirements/features from user reviews. This was done by performing POS chunking. The results were then processed again by case folding, lemmatization, and stop word elimination.
2.5.1. POS Chunking
POS chunking is a phrase in a sentence extracted using the POS tag pattern with a regular expression. The POS tag pattern containing nouns, verbs, and adjectives is extracted into a phrase to bring out the features. For example, the terms ‘voice alerts’ and ‘satellite maps’ represent a software feature. In this study, the POS tag pattern used to extract the software features (SFs) are shown in
Figure 2. After SFs were extracted, we also needed to define VERB (a group of verbs), NC (noun phrases), and NP (a group of NC). Explanation of the POS tag set, and regular expression code used in this study is shown in
Table 8. Penn Treebank POS [
36] was used in this study.
An example of a parse tree from POS chunking results can be seen in
Figure 3. We considered all phases with SF tags. In the example, there are two phrases with SF tags, namely ‘have the ability’ and ‘set multiple alarms’. The examples of software feature extraction are shown in
Table 9.
2.5.2. Case Folding, Lemmatization, and Stop Words
The feature extraction results from the POS chunking process were then processed by case folding, lemmatization, and stop word elimination. Case folding aims to change all letters in software feature data to lowercase, by calling the lower() function. Lemmatization aims to group words with different forms into one item. In this process, word affixes are removed to convert words into the basic word forms. For example, the words ‘mapping’, ‘maps’, and ‘mapped’ are changed to their root form ‘map’. This process uses the lemmatize function in the WordNet Lemmatizer from the NLTK library. Examples of the application of case folding, lemmatization, and stop word elimination are shown in
Table 10. The frequency of software features is recorded to know how many times a feature appears. This determines the importance of a software feature.
2.6. Software Feature Similarity
Similarity calculation aims to determine which software features extracted from user reviews are related to the features in the requirement document. The similarity has a value range of 0–1, indicating the similarity or closeness of two words, phrases, or sentences. The similarity was assessed using the similarity function from the spaCy library [
37]. SpaCy uses the word2vec approach to calculate the semantic similarity of words [
38]. This process is done by finding the similarity between word vectors in the vector space. The ‘en-core-web-md’ vector model from the spaCy library was used in this process. If the similarity value reaches the threshold of 0.5, then the extracted software feature is considered related to the existing feature. Please note that this threshold may need to be adjusted based on the data characteristics and the real-world conditions.
2.7. Requirement Report
Requirement report displays the results of the extraction in a structured format, generated using a data frame from the Pandas library. The requirement report shows the software features, the polarity values, the subjectivity values, the similarity values, and the software feature frequencies. Polarity and subjectivity values were further divided into minimum, maximum, and average. The collected software features may come from different user review texts and may have different sentiment values. The sentiment scores were distributed by groups by function on the data frame. The argument’s minimum, maximum, average values are displayed in the polarity and subjectivity columns. An example of a requirement report can be seen in
Table 11.
2.8. Evaluation
The evaluation seeks to analyze the extracted software features’ characteristics based on the polarity and subjectivity values. We also assessed the similarity value as a comparison to extend the analysis results. The focus of evaluation in this study is to quantify:
RQ1: The correlation between feature extraction and sentiment analysis in terms of acquiring relevant requirements.
RQ2: The correlation between feature extraction and sentiment analysis in informing system analysts in the decision-making of new requirements.
RQ3: The correlation between feature extraction and sentiment analysis pertaining to verbatim reuse of extracted features as new requirements.
To compare the data, a questionnaire was used. Systems analysts were asked to evaluate whether or not the system’s software features are relevant to the requirement documents, whether the extraction results prompt to new features addition, and whether the features are to be use verbatim as new requirements. We employed ten system analysts who previously composed the requirements document. The questionnaire was prepared based on the requirements report. This questionnaire consists of extracted software feature with positive and negative sentiments. The sentiments with high similarity values were also included in the comparison. The questionnaire displays the results of extraction—ten positive phrases, ten negative phrases, and ten phrases with the best similarity values. We generate a questionnaire according to the software feature that the application has (15 questionnaires for the earthquake app and ten questionnaires for the e-marketplace app). We ask three system analysts to evaluate each software feature based on research questions (RQ1–RQ3). The consensus was reached based on a majority assessment results. The data were then be further analyzed to see the sentiment correlation (polarity and subjectivity), the relevance, the decision to change the requirements, and the list of new features to be added.
3. Results
3.1. Software Feature Extraction Methods
We compared the approaches used to decide which one was the best for the software feature extraction. The approaches were POS chunking, Textblob, and Spacy (see
Table 12). We applied the POS tag pattern; the noun phrases function provided by the Textblob and SpaCy libraries; and syntactic dependency attributes, such as dobj and pobj in SpaCy. We compared the results from these with the results from manual tagging. We asked two experts to tag software features from sample user reviews. The comparison results show that the software feature produced by POS chunking has the best F-measure value compared to other approaches, which indicates the effectiveness of the POS tag pattern (See
Figure 2). As such, we used POS chunking as the software feature extraction method in this study.
We also attempted to corroborate the comparisons by using other approaches. However, it should be noted that the data used for comparisons are those reported by the study, so this data cannot be used as an apple-to-apple comparison. We used these comparisons to check whether our approach has delivered similar results compared to other approaches. The results show that the POS chunking is suitable for software feature extraction. Another important point to note is that we encountered the same problem as the previous study [
17,
26], namely the low value of precision. The user review data has a non-standard language format and does not always refer to software features.
3.2. Questionnaire Results
The software features results were separated based on their polarity (positive sentiment and negative sentiment). We added features with high similarity value as the benchmarks as per standard of requirement reuse. The questionnaire required the respondents to assess whether the feature extraction results are relevant, prompt the addition of new requirements, and verbatim reuse of extracted features as new requirements. We assume that the extracted features have some degree of relevancy to the initial requirements and to new features addition.
Figure 4 shows the questionnaire results.
From the highest relevance results, the features with high similarity have an average of 95%, followed by positive sentiment at 90%. Negative sentiment scores are lower at 63%. Whether or not the results prompt system analysts to update the existing requirements list, the high similarity and the positive sentiment score an average value of 78% and 74%, respectively. Meanwhile, the negative sentiment scores only 47%. In terms of adding new requirements, the high similarity has the highest value at 82%, whereas the positive and negative sentiment values score significantly lower at 55% and 22%, respectively.
The results of the questionnaire can be seen in
Table 13, which maps the results of feature extraction with positive, negative, and high similarity sentiment categories along with the actual values of polarity, subjectivity, and similarity. We compared the value of feature extraction that was rated ‘yes’ or ‘no’ by respondents to show their assessment on the relevance value, the influence in the decision-making of adding new features, and the verbatim addition of the outcomes as new features. In
Figure 5, we compared the polarity, subjectivity, and similarity values of the relevant features, prompt new requirements, and verbatim reuse of new requirements. Based on
Table 13 and
Figure 5, an analysis of the feature extraction sentiment analysis was carried out. The ‘yes’ response tends to have a higher polarity value for features extracted compared to the ‘no’ responses in all three categories. However, this does not apply in the high similarity category. High similarity features tend to have a polarity value of 0.06–0.187 (having a positive sentiment value but close to a neutral value). Additionally, there was a slight shift in the polarity value towards a more neutral value for the positive sentiment, negative sentiment, and high similarity when viewed regarding the relevance, the prompting of new requirements, and the addition of features as new requirements. For example, in the positive sentiment group, the polarity value for the relevance was 0.731, and this was lower at 0.715 in the prompting new requirement category.
There was no clear pattern found in terms of relationship between relevance, prompting new requirements, and adding features verbatim as new requirements in the subjectivity value. However, the high similarity group’s value had a higher objectivity value than the positive and negative sentiment groups.
Meanwhile, the comparison of similarity values does not have a significant difference. For example, the relevance value of high similarity was 0.769, while the similarity value for the positive and negative sentiment group was 0.664 and 0.638, respectively. This can be used as a basis to determine a similarity threshold value for the requirement reuse from positive and negative sentiments.
4. Discussion
4.1. Findings Related to the Research Question
The following are the answers to the research questions.
4.1.1. RQ1: What Is the Correlation between Feature Extraction and Sentiment Analysis in Terms of Acquiring Relevant Requirements?
The features with positive sentiments, negative sentiments, and high similarity have a high relevance level based on the questionnaire results; this means that all extracted features are relevant to the initial requirements. This is because all extracted features have been filtered by applying a specific similarity threshold. The average similarity value for all extracted features is above 0.5. The average subjectivity value between features with positive and negative sentiment is at least above 0.5. Meanwhile, the subjectivity value of features with high similarity is consistently below 0.5. This means that the high similarity feature has better objectivity than the features with positive and negative sentiments.
4.1.2. RQ2: What Is the Correlation between Feature Extraction and Sentiment Analysis in Informing System Analysts in the Decision-Making of New Requirements?
This research question means whether the extracted features can inspire new ideas for app features. As expected, the value is lower in comparison with the relevance value. Features with positive sentiment and high similarity experienced an expected decline, but negative sentiment features experienced a drastic decline. The average value of similarity did not change significantly compared to RQ1. Meanwhile, the average subjectivity value has a broader range than RQ1, which indicates a higher value of subjectivity, especially for the positive and negative sentiment category.
4.1.3. RQ3: What Is the Correlation between Feature Extraction and Sentiment Analysis Pertaining to Verbatim Reuse of Extracted Features as New Requirements?
This question seeks whether or not the systems analyst would be willing to reuse the extracted features to revise the requirement document. For positive and negative sentiment categories, the percentage results are lower compared to RQ2. Most the features with negative sentiment score considerably lower, below 50%. Likewise, the features with positive sentiment also score lower, although not as significant as that of the negative sentiment. Meanwhile, the high similarity feature has a consistent value compared to the RQ2. The subjectivity and similarity have the same pattern as RQ2.
4.2. Main Finding
The main finding from this research show that features with a high similarity outperformed the positive and negative sentiments in terms of acquiring relevant features. The high similarity category has a consistent value of above 70% for relevancy; the positive sentiment group can only offset this result. The value is lower in the prompting new requirements category and in the category of using the features verbatim as new requirement. Meanwhile, the value for the negative sentiment category is deficient, which means that although it is possible to obtain relevant features based on negative sentiment, it is unlikely to produce significant results. However, it should be noted that the results of feature extraction can have negative or positive sentiment values; this applies to the relevant feature category, prompting a similar new feature category, and adding verbatim as a new feature category. To obtain the best results, filtration with a certain similarity threshold is recommended, then, categorizing them under high similarity, positive sentiment, and negative sentiment. Thus, feature extraction for the requirement reuse that considers polarity and subjectivity values can be shown to the analysis system. However, since each project is unique, the primary determinant of feature reuse should involve human intervention, hence comparing the results with the system analysts’ assessments.
4.3. Related Studies
Based on the previous studies, user reviews can provide excellent feedback. User reviews may contain appreciations, bug reports, and new feature requests, which could be useful for a software and requirements engineering development [
27]. User reviews in software development can be used in many ways. Jiang et al. [
24] used online reviews to conduct requirements elicitation. Guzman et al. [
26] used user reviews to analyze an application feature systematically. Bakar et al. [
18] extracted software user reviews to reuse software features. Keertipati et al. [
25] extracted application reviews to prioritize feature improvements. In this study, user reviews are used to assist systems analysts in extending the scope of the existing software requirements.
Previous research analyzes in more depth the opinions or sentiments of user reviews. Jiang et al. [
24] adopted a mining technique to extract opinions about software features. Keertipati et al. [
25] found how negative sentiment in software features helped developers prioritize features that needed improvement. Meanwhile, in this study, we provide subjectivity and polarity scores for each extracted feature. The system analysts can choose which features to be used for requirements reuse to expand existing software requirements.
Table 14 shows a comparison of this study with related research.
4.4. Thread to Validity
The results of our study have limitations and should be considered within the study context. The extracted software features have low precision, recall, and F-measure values, which has been a problem in many studies. For this reason, we used selected features for the respondents’ data to determine relevance, inspiring new features, and adding them verbatim as new features.
We only measured a few samples from user reviews, which precludes us from filtering the extracted software features based on their importance level. Another aspect to note is the manual tagging done by the system analysts, which will vary when carried out in other studies and by different people. This is because the definition of software feature will differ from one to another depending on the research objectives. In our study, this issue may impact the software feature’s extraction results, which are evaluated by the experts to measure the relevancy. To solve this issue, in this study, we selected the ten best extraction results with positive and negative sentiments, and with the best similarity compared to the existing requirements.
To some extent, this study’s results still rely on human involvement as the main determining factor in software feature selection. The domain may also become a limitation as different domains will have different results. Finally, this study uses only two case studies with different domains. The results are similar but it cannot be guaranteed that the results will be the same in other domains.
5. Conclusions
In this research, we performed software feature extraction from user reviews to extend the existing software requirements by using sentiment analysis. The extraction results were grouped based on the polarity, subjectivity, and similarity value. We evaluated the correlation between software feature extraction results with regards to the polarity, subjectivity, and similarity. This was done to enhance the requirements elicitation process.
Both extracted features with positive and negative sentiments can be used as a requirement reuse, primarily to expand the existing requirements. However, to obtain the best results, we recommend that all extracted features are filtered based on the similarity value by using a certain threshold. This is crucial in filtering out features that are not relevant to the existing requirements. From this study, it can be concluded that features that have positive or negative sentiments can be used to acquire the relevant requirements, to inform the system analysts in determining new requirements, and reuse the extraction results verbatim as new requirements. However, the positive sentiment group yields a better performance than the negative sentiment group. In fact, the positive sentiment group has a value close to the features with high similarity. The primary source for requirement reuse is the feature extraction with a high similarity value. However, this does not mean we should overlook the feature extraction based on the positive and negative sentiments because it still provides the relevant information to help the determination of requirement reuse.
Further research should be undertaken to investigate the specific roles of the sentiment analysis in the feature extraction and analysis. This includes the semantic meaning of a feature, such as whether a feature has negative sentiment when the feature is not working, incomplete, or has other issues.