Towards a Semi-Automated Data-Driven Requirements Prioritization Approach for Reducing Stakeholder Participation in SPL Development

: Requirements prioritization (RP), part of Requirements engineering (RE), is an essential activity of Software Product-Lines (SPL) paradigm. Similar to standard systems, the identiﬁcation and prioritization of the user needs are relevant to the software quality and challenging in SPL due to common requirements, increasing dependencies, and diversity of stakeholders involved. As prioritization process might become impractical when the number of derived products grows, recently there has been an exponential growth in the use of Artiﬁcial Intelligence (AI) techniques in different areas of RE. The present research aims to propose a semi-automatic multiple-criteria prioritization process for functional and non-functional requirements (FR/NFR) of software projects developed within the SPL paradigm for reducing stakeholder participation.


Introduction
Requirements prioritization (RP) is an important activity of requirements management, however, this activity can become a complex process in a family of products projects, due to common requirements, increasing dependencies, and diversity of stakeholders involved.In most prioritization method, such as Hundred Dollar, MoSCoW and Numerical Assignment Technique (NAT), the participation of stakeholders are essential to provide the prioritization criteria based on their expertise [1].In this respect, Hujainah et al. [2] suggest the exclusion of users from tasks that can be automated, and include them only in important tasks that generate value.
In the latest years, the application of AI techniques in several stages in Software Engineering has been increasing and will continue growing [3].We argue that it is possible to take advantage of these techniques to exploit information and discover new criteria, to decrease the stakeholder's participation.
In this paper, we focus mainly on those activities that can be automated for identifying a set of prioritization criteria and generating a list of ranked requirements.We also analyzed the available datasets and discuss their main limitations.In the next section, the proposed process is shown in detail.

A Semi-Automated Data-Driven Requirements Prioritization Process
The proposed process consists of two phases, Criteria Identification Phase and Requirements Prioritization Phase.A summary of the proposal is shown in Figure 1.

Criteria Identification Phase
In this first phase, multiple prioritization criteria are identified with the minimum stakeholder participation.This phase starts with the data sources selection carried out by the analyst, and, optionally, loading new requirements and criteria.Then, the data is automatically collected by extracting and analyzing data from several sources, like reviews from App Marketplaces and requirements' formal documents.After collection, Natural Language Processing (NLP) techniques can be used to identify features (features are distinctive characteristics or properties of a family of systems) and associating them with existing features in feature models.Feature models are diagrams in SPL projects that show features in a hierarchical structure and conceptual relationships among features [4].These features can be previously prioritized and new prioritization criteria can be obtained when associating the new features with the existing ones.Moreover, thanks to the use of sentiment analysis, we aim to identify sentiment and deontic in user reviews, which can provide another type of prioritization criteria.A (supervised or non-supervised) classification algorithm is used to perform the classification in FR/NFR.This classification can be used as other criteria, due to the importance of some NFRs like security or performance, considered crucial to the quality of systems.All these criteria can be obtained automatically, without the participation of stakeholders.

Requirements Prioritization Phase
In the Second Phase, a requirements prioritization is performed based on criteria previously identified.All these criteria require to be unified and summarized in order to provide more understandable information.At this point, stakeholders can review the prioritization criteria, by confirming those that are relevant for the project.Once the criteria are selected, the prioritization is performed automatically by means of a machine learning algorithm.Algorithms such as Machine-Learned ranking, classification algorithms like Decision Tree or Random Forest, and even Deep Learning algorithms in combination with others algorithms can be used in this process.Finally, the output of this process is a list of ranked requirements.This will be saved as historical data for future use.

Datasets
Datasets are an essential component of any machine learning model.PROMISE [5] is a dataset used in most of the research for FR/NFR classification.This dataset has 625 requirement sentences, with 255 identified as functional and 370 as non-functional requirements.The NFR is labeled with the following types: Availability, Legal, Look and feel, Maintainability, Operational, Performance, Scalability, Security, Usability, Fault tolerance, and Portability.However, it presents unbalanced data in the categories of NFR.The unbalanced data can affect the precision and recall metrics of several classification algorithms, and generate a biased model.There are several ways to address this problem.Down-sampling in the majority classes is one technique, but it could lose valuable data.Synthetic data generation (Up-sampling) is another technique, that using some algorithms to create data that follow the tend of the minority classes.Balanced ensemble learning refers to the use of multiple learning machines and combines their outputs to obtain a better prediction.
For requirements prioritization methods based on supervised algorithms, RALIC [6] is a dataset used for some research.RALIC dataset contains several data about ratings and recommendations of requirements by stakeholders.This dataset is used in traditional methods and in machine learning methods for predicting the value of a rating from stakeholders.
Both datasets are in the English language.These datasets are good references but more datasets, especially in Spanish, are needed.This implies collecting historical requirements and carrying out their labeling, get balanced and standardized data and ensure enough quantity for training, testing and validation.

Conclusions
In this article, we presented a data-driven requirements prioritization process that can be used in SPL projects.The proposed prioritization process aims to reduce mainly the stakeholder participation through the identification of additional criteria to avoid some risks like disagreement between stakeholders and lack of time.We rely on AI techniques, like NLP and Machine Learning algorithms, to optimize mainly the criteria identification by exploiting information from different data sources.We also review two datasets that are used for FR/NFR classification and for requirements prioritization.As a result of this review, some of their limitations (e.g., imbalanced datasets), and the necessity of new datasets were identified.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Figure 1 .
Figure 1.Data sources and AI techniques used for prioritizing requirements of SPL projects.