Enhancing Software Feature Extraction Results Using Sentiment Analysis to Aid Requirements Reuse

: Recently, feature extraction from user reviews has been used for requirements reuse to improve the software development process. However, research has yet to use sentiment analysis in the extraction for it to be well understood. The aim of this study is to improve software feature extraction results by using sentiment analysis. Our study’s novelty focuses on the correlation between feature extraction from user reviews and results of sentiment analysis for requirement reuse. This study can inform system analysis in the requirements elicitation process. Our proposal uses user reviews for the software feature extraction and incorporates sentiment analysis and similarity measures in the process. Experimental results show that the extracted features used to expand existing requirements may come from positive and negative sentiments. However, extracted features with positive sentiment overall have better values than negative sentiments, namely 90% compared to 63% for the relevance value, 74–47% for prompting new features, and 55–26% for verbatim reuse as new requirements.


Introduction
Requirement elicitation is the process of searching and acquiring stakeholders' interests in a software development [1][2][3]. This aims to help produce new requirements from relevant stakeholders [4], which can be taken from various sources and are available in a range of formats. Sources of requirements can be from any relevant parties, experts, current systems, and competitor systems [1]. Stakeholders may comprise system end-users, top management, or external interested parties, such as regulators [5]. Techniques that are commonly used in requirement elicitation are open/closed interviews, scenarios, use cases, observations, questionnaires, brainstorming, prototyping, focus group discussions, and joint application development (JAD) workshops [6]. It is widely recognized among researchers and industry practitioners that software projects are very prone to failure when the requirements elicitation process is not carried out properly [7,8]. In the software development life cycle (SDLC), this undertaking is challenging because it involves socio-technical characteristics and changing requirements [9,10], and because there may be problems of ambiguity and tacit knowledge transfer [11,12]. Such problems, if found too late, can lead to high rework costs because they necessitate modifications in other processes [4,13].

Data
The data required in this study consisted of the initial requirements created by the system analysts and the extracted features from user reviews on the Google Play. We used the earthquake and the e-marketplace apps as case studies. Data were collected from the requirements document and user reviews.

Software Requirements Documents
A list of requirements was used as an initial input to the system. Ten system analysts composed the requirements document using specific cases, which resulted in a list of functional requirements. The information used to make this document was obtained from several sources. Table 1 shows an example of a list of requirements in the requirements document.

Software Requirements Documents
A list of requirements was used as an initial input to the system. Ten system analysts composed the requirements document using specific cases, which resulted in a list of functional requirements. The information used to make this document was obtained from several sources. Table 1 shows an example of a list of requirements in the requirements document. Information about the earthquake location. Users can add a product to the shopping cart. 5 Information of the earthquake magnitude.
Users can see the total amount to pay before checkout. 6 Display the time of the earthquake. Users can track delivery status. 7 Educational about actions to be taken when an earthquake occurs. Users can interact with the seller using the chat room. 8 Display data of the previous earthquakes.
Users can see purchase history. 9 Information about first aid. Users can give a product review. 10 Information on how to deal with post-earthquake trauma. Users can report frauds or scams. 11 Latest news related to the earthquake. 12 Display the gathering point/safe zone in the user's area. 13 Shows victim data (lost and found). 14 Displays emergency numbers. 15 Fundraising and donation information.

User Reviews on Google Play
User reviews were sourced from similar applications relevant to the case study. Keywords for the query were 'earthquake' and 'e-marketplace', entered into the search feature on the Google Play. From the list of applications that were most relevant to these keywords, the top 10-15 applications for each case study were selected. This search was conducted on 8 July 2020 for the earthquake app, and 18 November 2020 for the e-marketplace app. The list of applications that were used in this study is shown in Table 2. User reviews from 15 applications on Google Play were collected using the Python library google-play-scraper 0.1.1 (https://pypi.org/project/google-play-scraper/, (accessed on 1 March 2021)). Data extracted from user reviews in each application was stored in plain text format. From the 15 earthquake-related applications extracted, and 17,133 user reviews were gathered.

Pre-Processing
The data collected were raw data that need processing. Since the requirements document and user reviews had different data characteristics, it was necessary to carry out the data pre-processing separately.

The Requirement Document
The pre-processing for the requirement document involved sentence tokenizing, case folding, and stop word eliminating. A sentence tokenizer was used to separate functional requirements written in the requirement document. Case folding was used to change all letters in the requirement list data into lowercase. Stop words are the most common words in languages, such as 'the', 'a', 'on', 'is', and 'all'. These words do not have any significance and are removed from the text. We used the stop word list provided by the Natural Language Toolkit (NLTK) library [34]. Examples of the pre-processing for the requirement document are shown in Table 3. Table 3. Examples of implementing pre-processing in the requirements document.

# Requirements
Case Folding The Stop Words Removal 1. Notification when earthquake happened. Notification when earthquake happened. Notification earthquake happened 2. Notification of the aftershocks Notification of the aftershocks Notification aftershocks 3. Information of disaster relief.
Information of disaster relief. Information disaster relief 4. Information of the earthquake location.
Information of the earthquake location. Display map earthquake location. 5. Information of the earthquake magnitude. Information of the earthquake magnitude. Information earthquake magnitude

User Reviews
The pre-processing of user reviews data was performed through data cleaning, spelling normalization, tokenization, POS tagging, and stop word removal. Data cleaning was performed to remove unnecessary characters to reduce noise, such as hashtags (#), URLs, numbers, certain punctuation marks, and other symbols. Data cleaning was performed because many user reviews contain irrelevant characters, such as emojis or excessive dots. This was done by using the string encoding function with the parameters 'ignore' and 'ascii' so that characters outside of ASCII were eliminated from the text. Table 4 shows an example of the data cleaning implementation.  Notification when earthquake happened.
Notification of the aftershocks Notification of the aftershocks Notification aftershocks 3.
Information of disaster relief. Information of disaster relief. Information disaster relief 4. Information of the earthquake location. Information of the earthquake location. Display map earthquake location.

5.
Information of the earthquake magnitude.
Information of the earthquake magnitude.
Information earthquake magnitude

User Reviews
The pre-processing of user reviews data was performed through data cleaning, spelling normalization, tokenization, POS tagging, and stop word removal. Data cleaning was performed to remove unnecessary characters to reduce noise, such as hashtags (#), URLs, numbers, certain punctuation marks, and other symbols. Data cleaning was performed because many user reviews contain irrelevant characters, such as emojis or excessive dots. This was done by using the string encoding function with the parameters 'ignore' and 'ascii' so that characters outside of ASCII were eliminated from the text. Table  4 shows an example of the data cleaning implementation. Good for visual people like me gives you a good idea of energy movement.
Spelling normalization is the process of correcting words that are nonetheless incorrectly written or abbreviated. For example, the words 'notification' and 'notif' have the same meaning. However, because 'notif' is an abbreviation that is not recorded in dictionaries, the word 'notif' is considered non-valuable. Therefore, the word 'notif' needs to be rewritten as 'notification.' Abbreviations such as 'aren't' also need to be normalized into 'are not.' Spelling normalization is done using the spell function in the NLTK library. The example of spelling normalization is shown in Table 5

Requirements Case Folding
The Stop Words Removal Notification when earthquake happened.
Notification when earthquake happened.
Notification earthquake happened Notification of the aftershocks Notification of the aftershocks Notification aftershocks Information of disaster relief.
Information of disaster relief. Information disaster relief Information of the earthquake location.
Information of the earthquake location. Display map earthquake location.
Information of the earthquake magnitude.
Information of the earthquake magnitude.
Information earthquake magnitude

User Reviews
The pre-processing of user reviews data was performed through data cleaning, spelling normalization, tokenization, POS tagging, and stop word removal. Data cleaning was performed to remove unnecessary characters to reduce noise, such as hashtags (#), URLs, numbers, certain punctuation marks, and other symbols. Data cleaning was performed because many user reviews contain irrelevant characters, such as emojis or excessive dots. This was done by using the string encoding function with the parameters 'ignore' and 'ascii' so that characters outside of ASCII were eliminated from the text. Table  4 shows an example of the data cleaning implementation.

Original Sentences
After Data Cleaning Love this app ❤ very accurate! Love this app very accurate! This is an excellent app. Clear presentation and easy to set up. Notifications are good. Plenty of information about each quake. Recommended . This is an excellent app. Clear presentation and easy to set up. Notifications are good. Plenty of information about each quake. Recommended. Excellent! I downloaded this right after my local news announced an earthquake, and sure enough, it showed up as the most recent quake plus all the pertinent details!! Clear, concise, vivid colors, super easy to use. Many thanks Developers, important in my area as we have had several small ones the past few years. Highly recommended!! .
Excellent! I downloaded this right after my local news announced an earthquake, and sure enough, it showed up as the most recent quake plus all the pertinent details!! Clear, concise, vivid colors, super easy to use. Many thanks Developers, important in my area as we have had several small ones the past few years. Highly recommended!! Good for visual people like me.... gives you a good idea of energy movement .
Good for visual people like me gives you a good idea of energy movement.
Spelling normalization is the process of correcting words that are nonetheless incorrectly written or abbreviated. For example, the words 'notification' and 'notif' have the same meaning. However, because 'notif' is an abbreviation that is not recorded in dictionaries, the word 'notif' is considered non-valuable. Therefore, the word 'notif' needs to be rewritten as 'notification.' Abbreviations such as 'aren't' also need to be normalized into 'are not.' Spelling normalization is done using the spell function in the NLTK library. The example of spelling normalization is shown in Table 5.
. This is an excellent app. Clear presentation and easy to set up. Notifications are good. Plenty of information about each quake. Recommended.

3.
Excellent! I downloaded this right after my local news announced an earthquake, and sure enough, it showed up as the most recent quake plus all the pertinent details!! Clear, concise, vivid colors, super easy to use. Many thanks Developers, important in my area as we have had several small ones the past few years. Highly recommended!! Computers 2021, 10, x FOR PEER REVIEW Notification when earthquake happened.
Notification of the aftershocks Notification of the aftershocks Notificatio 3.
Information of disaster relief. Information of disaster relief. Informatio

4.
Information of the earthquake location. Information of the earthquake location. Display m

5.
Information of the earthquake magnitude.
Information of the earthquake magnitude.

User Reviews
The pre-processing of user reviews data was performed spelling normalization, tokenization, POS tagging, and stop word was performed to remove unnecessary characters to reduce nois URLs, numbers, certain punctuation marks, and other symbols. formed because many user reviews contain irrelevant characters, sive dots. This was done by using the string encoding function w nore' and 'ascii' so that characters outside of ASCII were elimina 4 shows an example of the data cleaning implementation. Good for visual people like me of energy movement.
Spelling normalization is the process of correcting words tha rectly written or abbreviated. For example, the words 'notificati same meaning. However, because 'notif' is an abbreviation that is aries, the word 'notif' is considered non-valuable. Therefore, the rewritten as 'notification.' Abbreviations such as 'aren't' also nee 'are not.' Spelling normalization is done using the spell function i example of spelling normalization is shown in Table 5.  Table 3. Examples of implementing pre-processing in the requirements document.

equirements Case Folding
The Stop Words Removal otification when earthquake hapened.
Notification when earthquake happened.
Notification earthquake happened otification of the aftershocks Notification of the aftershocks Notification aftershocks nformation of disaster relief.
Information of disaster relief. Information disaster relief nformation of the earthquake locaion.
Information of the earthquake location. Display map earthquake location.
nformation of the earthquake magitude.
Information of the earthquake magnitude.
Information earthquake magnitude

User Reviews
The pre-processing of user reviews data was performed through data cleaning, spelling normalization, tokenization, POS tagging, and stop word removal. Data cleaning was performed to remove unnecessary characters to reduce noise, such as hashtags (#), URLs, numbers, certain punctuation marks, and other symbols. Data cleaning was performed because many user reviews contain irrelevant characters, such as emojis or excessive dots. This was done by using the string encoding function with the parameters 'ignore' and 'ascii' so that characters outside of ASCII were eliminated from the text. Table  4 shows an example of the data cleaning implementation.

Original Sentences
After Data Cleaning Love this app ❤ very accurate! Love this app very accurate! This is an excellent app. Clear presentation and easy to set up. Notifications are good. Plenty of information about each quake. Recommended . This is an excellent app. Clear presentation and easy to set up. Notifications are good. Plenty of information about each quake. Recommended. Excellent! I downloaded this right after my local news announced an earthquake, and sure enough, it showed up as the most recent quake plus all the pertinent details!! Clear, concise, vivid colors, super easy to use. Many thanks Developers, important in my area as we have had several small ones the past few years. Highly recommended!! .
Excellent! I downloaded this right after my local news announced an earthquake, and sure enough, it showed up as the most recent quake plus all the pertinent details!! Clear, concise, vivid colors, super easy to use. Many thanks Developers, important in my area as we have had several small ones the past few years. Highly recommended!! Good for visual people like me.... gives you a good idea of energy movement .
Good for visual people like me gives you a good idea of energy movement.
Spelling normalization is the process of correcting words that are nonetheless incorrectly written or abbreviated. For example, the words 'notification' and 'notif' have the same meaning. However, because 'notif' is an abbreviation that is not recorded in dictionaries, the word 'notif' is considered non-valuable. Therefore, the word 'notif' needs to be rewritten as 'notification.' Abbreviations such as 'aren't' also need to be normalized into 'are not.' Spelling normalization is done using the spell function in the NLTK library. The example of spelling normalization is shown in Table 5.  Notification when earthquake happened.
Notification of the aftershocks Notification of the aftershocks Notification aftershocks 3.
Information of disaster relief. Information of disaster relief. Information disaster relief 4. Information of the earthquake location. Information of the earthquake location. Display map earthquake location.

5.
Information of the earthquake magnitude.
Information of the earthquake magnitude.
Information earthquake magnitude

User Reviews
The pre-processing of user reviews data was performed through data cleaning, spelling normalization, tokenization, POS tagging, and stop word removal. Data cleaning was performed to remove unnecessary characters to reduce noise, such as hashtags (#), URLs, numbers, certain punctuation marks, and other symbols. Data cleaning was performed because many user reviews contain irrelevant characters, such as emojis or excessive dots. This was done by using the string encoding function with the parameters 'ignore' and 'ascii' so that characters outside of ASCII were eliminated from the text. Table  4 shows an example of the data cleaning implementation. Good for visual people like me gives you a good idea of energy movement.
Spelling normalization is the process of correcting words that are nonetheless incorrectly written or abbreviated. For example, the words 'notification' and 'notif' have the same meaning. However, because 'notif' is an abbreviation that is not recorded in dictionaries, the word 'notif' is considered non-valuable. Therefore, the word 'notif' needs to be rewritten as 'notification.' Abbreviations such as 'aren't' also need to be normalized into 'are not.' Spelling normalization is done using the spell function in the NLTK library. The example of spelling normalization is shown in Table 5.
. Good for visual people like me gives you a good idea of energy movement.
Spelling normalization is the process of correcting words that are nonetheless incorrectly written or abbreviated. For example, the words 'notification' and 'notif' have the same meaning. However, because 'notif' is an abbreviation that is not recorded in dictionaries, the word 'notif' is considered non-valuable. Therefore, the word 'notif' needs to be rewritten as 'notification.' Abbreviations such as 'aren't' also need to be normalized into 'are not.' Spelling normalization is done using the spell function in the NLTK library. The example of spelling normalization is shown in Table 5.
In this study, the sentiment value was calculated in each sentence. The sentence tokenization divided user reviews into several sentences using the sent_tokenize function from the NLTK library, which splits texts into sentences. Meanwhile, POS tagging functions in NLP automatically labeled words into their parts of speech. For example, in the sentence "I build applications", there is a label PP = pronoun, VV = verb, and NN = noun. The system receives the input in the form of a sentence, and the output is be "I/PP build/VV applications/NN." POS tagging was performed using the pos_tag function in the NLTK library. Table 6 shows an example of POS tagging process in the sentences.

Sentiment Analysis
Sentiment analysis gives the quantitative values (positive, negative, and neutral) to a text representing the authors' emotions. User reviews are calculated for polarity and subjectivity using the sentiment feature. We employed the TextBlob library in the analysis sentiment to determine the value. The polarity value lies in the range of [−1, 1], where 1 means a positive sentiment and −1 means a negative sentiment. Meanwhile, the value of subjectivity lies in the range of [0, 1], where 1 means an absolute subjective statement or in the form of opinion, and 0 means an absolute objective statement or in the form of facts [35]. An example of value assignment can be seen in Table 7.

Software Feature Extraction
Software feature extraction was performed to obtain the requirements/features from user reviews. This was done by performing POS chunking. The results were then processed again by case folding, lemmatization, and stop word elimination. POS chunking is a phrase in a sentence extracted using the POS tag pattern with a regular expression. The POS tag pattern containing nouns, verbs, and adjectives is extracted into a phrase to bring out the features. For example, the terms 'voice alerts' and 'satellite maps' represent a software feature. In this study, the POS tag pattern used to extract the software features (SFs) are shown in Figure 2. After SFs were extracted, we also needed to define VERB (a group of verbs), NC (noun phrases), and NP (a group of NC). Explanation of the POS tag set, and regular expression code used in this study is shown in Table 8. Penn Treebank POS [36] was used in this study.
in the form of opinion, and 0 means an absolute objective statement or in the form of facts [35]. An example of value assignment can be seen in Table 7.

Software Feature Extraction
Software feature extraction was performed to obtain the requirements/features from user reviews. This was done by performing POS chunking. The results were then processed again by case folding, lemmatization, and stop word elimination.

POS Chunking
POS chunking is a phrase in a sentence extracted using the POS tag pattern with a regular expression. The POS tag pattern containing nouns, verbs, and adjectives is extracted into a phrase to bring out the features. For example, the terms 'voice alerts' and 'satellite maps' represent a software feature. In this study, the POS tag pattern used to extract the software features (SFs) are shown in Figure 2. After SFs were extracted, we also needed to define VERB (a group of verbs), NC (noun phrases), and NP (a group of NC). Explanation of the POS tag set, and regular expression code used in this study is shown in Table 8. Penn Treebank POS [36] was used in this study.   An example of a parse tree from POS chunking results can be seen in Figure 3. We considered all phases with SF tags. In the example, there are two phrases with SF tags, namely 'have the ability' and 'set multiple alarms'. The examples of software feature extraction are shown in Table 9. An example of a parse tree from POS chunking results can be seen in Figure 3. We considered all phases with SF tags. In the example, there are two phrases with SF tags, namely 'have the ability' and 'set multiple alarms'. The examples of software feature extraction are shown in Table 9.

4.
This application helps us identify the ferocity of these quakes and puts us at ease knowing the aftershocks are lower than the initial quake.
Identify the ferocity of these quakes At ease knowing the aftershocks are lower than the initial quake

Case Folding, Lemmatization, and Stop Words
The feature extraction results from the POS chunking process were then processed by case folding, lemmatization, and stop word elimination. Case folding aims to change all letters in software feature data to lowercase, by calling the lower() function. Lemmatization aims to group words with different forms into one item. In this process, word affixes are removed to convert words into the basic word forms. For example, the words 'mapping', 'maps', and 'mapped' are changed to their root form 'map'. This process uses the lemmatize function in the WordNet Lemmatizer from the NLTK library. Examples of the application of case folding, lemmatization, and stop word elimination are shown in Table 10. The frequency of software features is recorded to know how many times a feature appears. This determines the importance of a software feature.

Software Feature Similarity
Similarity calculation aims to determine which software features extracted from user reviews are related to the features in the requirement document. The similarity has a value range of 0-1, indicating the similarity or closeness of two words, phrases, or sentences. The Computers 2021, 10, 36 9 of 18 similarity was assessed using the similarity function from the spaCy library [37]. SpaCy uses the word2vec approach to calculate the semantic similarity of words [38]. This process is done by finding the similarity between word vectors in the vector space. The 'en-coreweb-md' vector model from the spaCy library was used in this process. If the similarity value reaches the threshold of 0.5, then the extracted software feature is considered related to the existing feature. Please note that this threshold may need to be adjusted based on the data characteristics and the real-world conditions. at ease knowing the aftershocks are lower than the initial quake at ease knowing the aftershocks are lower than the initial quake at ease knowing the after shock are lower than the initial quake ease knowing after shock lower initial quake

Requirement Report
Requirement report displays the results of the extraction in a structured format, generated using a data frame from the Pandas library. The requirement report shows the software features, the polarity values, the subjectivity values, the similarity values, and the software feature frequencies. Polarity and subjectivity values were further divided into minimum, maximum, and average. The collected software features may come from different user review texts and may have different sentiment values. The sentiment scores were distributed by groups by function on the data frame. The argument's minimum, maximum, average values are displayed in the polarity and subjectivity columns. An example of a requirement report can be seen in Table 11.

Evaluation
The evaluation seeks to analyze the extracted software features' characteristics based on the polarity and subjectivity values. We also assessed the similarity value as a comparison to extend the analysis results. The focus of evaluation in this study is to quantify:

RQ1:
The correlation between feature extraction and sentiment analysis in terms of acquiring relevant requirements.

RQ2:
The correlation between feature extraction and sentiment analysis in informing system analysts in the decision-making of new requirements.

RQ3:
The correlation between feature extraction and sentiment analysis pertaining to verbatim reuse of extracted features as new requirements.
To compare the data, a questionnaire was used. Systems analysts were asked to evaluate whether or not the system's software features are relevant to the requirement documents, whether the extraction results prompt to new features addition, and whether the features are to be use verbatim as new requirements. We employed ten system analysts who previously composed the requirements document. The questionnaire was prepared based on the requirements report. This questionnaire consists of extracted software feature with positive and negative sentiments. The sentiments with high similarity values were also included in the comparison. The questionnaire displays the results of extraction-ten positive phrases, ten negative phrases, and ten phrases with the best similarity values. We generate a questionnaire according to the software feature that the application has (15 questionnaires for the earthquake app and ten questionnaires for the e-marketplace app). We ask three system analysts to evaluate each software feature based on research questions (RQ1-RQ3). The consensus was reached based on a majority assessment results. The data were then be further analyzed to see the sentiment correlation (polarity and subjectivity), the relevance, the decision to change the requirements, and the list of new features to be added.

Software Feature Extraction Methods
We compared the approaches used to decide which one was the best for the software feature extraction. The approaches were POS chunking, Textblob, and Spacy (see Table 12). We applied the POS tag pattern; the noun phrases function provided by the Textblob and SpaCy libraries; and syntactic dependency attributes, such as dobj and pobj in SpaCy. We compared the results from these with the results from manual tagging. We asked two experts to tag software features from sample user reviews. The comparison results show that the software feature produced by POS chunking has the best F-measure value compared to other approaches, which indicates the effectiveness of the POS tag pattern (See Figure 2). As such, we used POS chunking as the software feature extraction method in this study. We also attempted to corroborate the comparisons by using other approaches. However, it should be noted that the data used for comparisons are those reported by the study, so this data cannot be used as an apple-to-apple comparison. We used these comparisons to check whether our approach has delivered similar results compared to other approaches. The results show that the POS chunking is suitable for software feature extraction. Another important point to note is that we encountered the same problem as the previous study [17,26], namely the low value of precision. The user review data has a non-standard language format and does not always refer to software features.

Questionnaire Results
The software features results were separated based on their polarity (positive sentiment and negative sentiment). We added features with high similarity value as the benchmarks as per standard of requirement reuse. The questionnaire required the respondents to assess whether the feature extraction results are relevant, prompt the addition of new requirements, and verbatim reuse of extracted features as new requirements. We assume that the extracted features have some degree of relevancy to the initial requirements and to new features addition. Figure 4 shows the questionnaire results.
From the highest relevance results, the features with high similarity have an average of 95%, followed by positive sentiment at 90%. Negative sentiment scores are lower at 63%. Whether or not the results prompt system analysts to update the existing requirements list, the high similarity and the positive sentiment score an average value of 78% and 74%, respectively. Meanwhile, the negative sentiment scores only 47%. In terms of adding new requirements, the high similarity has the highest value at 82%, whereas the positive and negative sentiment values score significantly lower at 55% and 22%, respectively.
The results of the questionnaire can be seen in Table 13, which maps the results of feature extraction with positive, negative, and high similarity sentiment categories along with the actual values of polarity, subjectivity, and similarity. We compared the value of feature extraction that was rated 'yes' or 'no' by respondents to show their assessment on the relevance value, the influence in the decision-making of adding new features, and the verbatim addition of the outcomes as new features. In Figure 5, we compared the polarity, subjectivity, and similarity values of the relevant features, prompt new requirements, and verbatim reuse of new requirements. Based on Table 13 and Figure 5, an analysis of the feature extraction sentiment analysis was carried out. The 'yes' response tends to have a higher polarity value for features extracted compared to the 'no' responses in all three categories. However, this does not apply in the high similarity category. High similarity features tend to have a polarity value of 0.06-0.187 (having a positive sentiment value but close to a neutral value). Additionally, there was a slight shift in the polarity value towards a more neutral value for the positive sentiment, negative sentiment, and high similarity when viewed regarding the relevance, the prompting of new requirements, and the addition of features as new requirements. For example, in the positive sentiment group, the polarity value for the relevance was 0.731, and this was lower at 0.715 in the prompting new requirement category.
The software features results were separated based on their polarity (positive sentiment and negative sentiment). We added features with high similarity value as the benchmarks as per standard of requirement reuse. The questionnaire required the respondents to assess whether the feature extraction results are relevant, prompt the addition of new requirements, and verbatim reuse of extracted features as new requirements. We assume that the extracted features have some degree of relevancy to the initial requirements and to new features addition. Figure 4 shows the questionnaire results.   From the highest relevance results, the features with high similarity have an average of 95%, followed by positive sentiment at 90%. Negative sentiment scores are lower at 63%. Whether or not the results prompt system analysts to update the existing requirements list, the high similarity and the positive sentiment score an average value of 78% and 74%, respectively. Meanwhile, the negative sentiment scores only 47%. In terms of adding new requirements, the high similarity has the highest value at 82%, whereas the positive and negative sentiment values score significantly lower at 55% and 22%, respectively.
The results of the questionnaire can be seen in Table 13, which maps the results of feature extraction with positive, negative, and high similarity sentiment categories along with the actual values of polarity, subjectivity, and similarity. We compared the value of feature extraction that was rated 'yes' or 'no' by respondents to show their assessment on the relevance value, the influence in the decision-making of adding new features, and the verbatim addition of the outcomes as new features. In Figure 5, we compared the polarity, subjectivity, and similarity values of the relevant features, prompt new requirements, and verbatim reuse of new requirements. Based on Table 13 and Figure 5, an analysis of the feature extraction sentiment analysis was carried out. The 'yes' response tends to have a higher polarity value for features extracted compared to the 'no' responses in all three categories. However, this does not apply in the high similarity category. High similarity features tend to have a polarity value of 0.06-0.187 (having a positive sentiment value but close to a neutral value). Additionally, there was a slight shift in the polarity value towards a more neutral value for the positive sentiment, negative sentiment, and high similarity when viewed regarding the relevance, the prompting of new requirements, and the addition of features as new requirements. For example, in the positive sentiment group, the polarity value for the relevance was 0.731, and this was lower at 0.715 in the prompting new requirement category.   There was no clear pattern found in terms of relationship between relevance, prompting new requirements, and adding features verbatim as new requirements in the subjectivity value. However, the high similarity group's value had a higher objectivity value than the positive and negative sentiment groups.
Meanwhile, the comparison of similarity values does not have a significant difference. For example, the relevance value of high similarity was 0.769, while the similarity value for the positive and negative sentiment group was 0.664 and 0.638, respectively. This can be used as a basis to determine a similarity threshold value for the requirement reuse from positive and negative sentiments.  The relationship between polarity, subjectivity, and similarity of extracted features between case studies can be compared using this figure: For RQ1, the correlation between feature extraction and sentiment analysis in terms of acquiring relevant requirements; for RQ2, the correlation between feature extraction and sentiment analysis in informing system analysts in the decision-making of new requirements; and for RQ3, the correlation between feature extraction and sentiment analysis pertaining to verbatim reuse of extracted features as new requirements. Figure 5. The relationship between polarity, subjectivity, and similarity of extracted features between case studies can be compared using this figure: For RQ1, the correlation between feature extraction and sentiment analysis in terms of acquiring relevant requirements; for RQ2, the correlation between feature extraction and sentiment analysis in informing system analysts in the decision-making of new requirements; and for RQ3, the correlation between feature extraction and sentiment analysis pertaining to verbatim reuse of extracted features as new requirements.

Findings Related to the Research Question
The following are the answers to the research questions. The features with positive sentiments, negative sentiments, and high similarity have a high relevance level based on the questionnaire results; this means that all extracted features are relevant to the initial requirements. This is because all extracted features have been filtered by applying a specific similarity threshold. The average similarity value for all extracted features is above 0.5. The average subjectivity value between features with positive and negative sentiment is at least above 0.5. Meanwhile, the subjectivity value of features with high similarity is consistently below 0.5. This means that the high similarity feature has better objectivity than the features with positive and negative sentiments. This research question means whether the extracted features can inspire new ideas for app features. As expected, the value is lower in comparison with the relevance value. Features with positive sentiment and high similarity experienced an expected decline, but negative sentiment features experienced a drastic decline. The average value of similarity did not change significantly compared to RQ1. Meanwhile, the average subjectivity value has a broader range than RQ1, which indicates a higher value of subjectivity, especially for the positive and negative sentiment category.

RQ3: What Is the Correlation between Feature Extraction and Sentiment Analysis Pertaining to Verbatim Reuse of Extracted Features as New Requirements?
This question seeks whether or not the systems analyst would be willing to reuse the extracted features to revise the requirement document. For positive and negative sentiment categories, the percentage results are lower compared to RQ2. Most the features with negative sentiment score considerably lower, below 50%. Likewise, the features with positive sentiment also score lower, although not as significant as that of the negative sentiment. Meanwhile, the high similarity feature has a consistent value compared to the RQ2. The subjectivity and similarity have the same pattern as RQ2.

Main Finding
The main finding from this research show that features with a high similarity outperformed the positive and negative sentiments in terms of acquiring relevant features. The high similarity category has a consistent value of above 70% for relevancy; the positive sentiment group can only offset this result. The value is lower in the prompting new requirements category and in the category of using the features verbatim as new requirement. Meanwhile, the value for the negative sentiment category is deficient, which means that although it is possible to obtain relevant features based on negative sentiment, it is unlikely to produce significant results. However, it should be noted that the results of feature extraction can have negative or positive sentiment values; this applies to the relevant feature category, prompting a similar new feature category, and adding verbatim as a new feature category. To obtain the best results, filtration with a certain similarity threshold is recommended, then, categorizing them under high similarity, positive sentiment, and negative sentiment. Thus, feature extraction for the requirement reuse that considers polarity and subjectivity values can be shown to the analysis system. However, since each project is unique, the primary determinant of feature reuse should involve human intervention, hence comparing the results with the system analysts' assessments.

Related Studies
Based on the previous studies, user reviews can provide excellent feedback. User reviews may contain appreciations, bug reports, and new feature requests, which could be useful for a software and requirements engineering development [27]. User reviews in software development can be used in many ways. Jiang et al. [24] used online reviews to conduct requirements elicitation. Guzman et al. [26] used user reviews to analyze an application feature systematically. Bakar et al. [18] extracted software user reviews to reuse software features. Keertipati et al. [25] extracted application reviews to prioritize feature improvements. In this study, user reviews are used to assist systems analysts in extending the scope of the existing software requirements.
Previous research analyzes in more depth the opinions or sentiments of user reviews. Jiang et al. [24] adopted a mining technique to extract opinions about software features. Keertipati et al. [25] found how negative sentiment in software features helped developers prioritize features that needed improvement. Meanwhile, in this study, we provide subjectivity and polarity scores for each extracted feature. The system analysts can choose which features to be used for requirements reuse to expand existing software requirements. Table 14 shows a comparison of this study with related research.

Thread to Validity
The results of our study have limitations and should be considered within the study context. The extracted software features have low precision, recall, and F-measure values, which has been a problem in many studies. For this reason, we used selected features for the respondents' data to determine relevance, inspiring new features, and adding them verbatim as new features.
We only measured a few samples from user reviews, which precludes us from filtering the extracted software features based on their importance level. Another aspect to note is the manual tagging done by the system analysts, which will vary when carried out in other studies and by different people. This is because the definition of software feature will differ from one to another depending on the research objectives. In our study, this issue may impact the software feature's extraction results, which are evaluated by the experts to measure the relevancy. To solve this issue, in this study, we selected the ten best extraction results with positive and negative sentiments, and with the best similarity compared to the existing requirements.
To some extent, this study's results still rely on human involvement as the main determining factor in software feature selection. The domain may also become a limitation as different domains will have different results. Finally, this study uses only two case