Performance Evaluation Metrics for Multi-Objective Evolutionary Algorithms in Search-Based Software Engineering: Systematic Literature Review

: Many recent studies have shown that various multi-objective evolutionary algorithms have been widely applied in the ﬁeld of search-based software engineering (SBSE) for optimal solutions. Most of them either focused on solving newly re-formulated problems or on proposing new approaches, while a number of studies performed reviews and comparative studies on the performance of proposed algorithms. To evaluate such performance, it is necessary to consider a number of performance metrics that play important roles during the evaluation and comparison of investigated algorithms based on their best-simulated results. While there are hundreds of performance metrics in the literature that can quantify in performing such tasks, there is a lack of systematic review conducted to provide evidence of using these performance metrics, particularly in the software engineering problem domain. In this paper, we aimed to review and quantify the type of performance metrics, number of objectives, and applied areas in software engineering that reported in primary studies—this will eventually lead to inspiring the SBSE community to further explore such approaches in depth. To perform this task, a formal systematic review protocol was applied for planning, searching, and extracting the desired elements from the studies. After considering all the relevant inclusion and exclusion criteria for the searching process, 105 relevant articles were identiﬁed from the targeted online databases as scientiﬁc evidence to answer the eight research questions. The preliminary results show that remarkable studies were reported without considering performance metrics for the purpose of algorithm evaluation. Based on the 27 performance metrics that were identiﬁed, hypervolume, inverted generational distance, generational distance, and hypercube-based diversity metrics appear to be widely adopted in most of the studies in software requirements engineering, software design, software project management, software testing, and software veriﬁcation. Additionally, there are increasing interest in the community in re-formulating many objective problems with more than three objectives, yet, currently are dominated in re-formulating two to three objectives.


Introduction
Tackling problems in the software engineering (SE) discipline (i.e., regarding products, processes, and resources) has commonly been characterized as complex, error-prone, and expensive. While there is thus a need to simplify these aspects of problem-solving to make it less complex, less failure-prone, and less costly, objectively achieving these conflicting goals within existing constraints is difficult for decision-makers. However, many software engineering problems are specified as optimization problems-and to solve such problems, practitioners use optimization techniques (metaheuristics) to search for the best (i.e., optimal) solutions. In the specialized literature, SBSE is the common term used in relation to this [1], and it has been successfully applied in practice to SE areas, such as software requirements, design, testing, and many more. This has led to many SE problems being re-formulated as search problems [1,2]. For example, in test case prioritization in regression testing aims to maximize coverage criteria, while minimizing a set of given constraints, such as cost and time, however, this makes the decision-making process a challenging task.
In the course of finding quality solutions to support the decision-making (DM) process, several techniques are used. There are techniques (algorithms) that can improve and maintain a single solution at a time, and those can maintain multiple solutions (population) at once [3]. Most of these methods are inspired by the intelligence that has evolved in nature in living things, as exemplified in biology through genetics and the movement of animals (i.e., insects, birds, fish, etc.) for their survival.
In the initial stages of SBSE research, single objective problems (SOPs) are re-formulated and solved using Simulated Annealing and Tabu Search. However, these methods can only maintain and improve a single solution at a time [3,4]. In contrast, multi-objective evolutionary algorithms (MOEAs) are improved versions of the initial methods and can tackle multiple objective problems (MOPs) with no more than three objectives, simultaneously. There are also other improved methods that can handle many-objective optimization problems (MaOPs) with more than three objectives. These methods are also called manyobjective evolutionary algorithms (MaOEAs) [5]. Both MOEAs and MaOEAs produce sets of solutions with different trade-offs (Pareto optimal solutions).
However, in the available literature, there appears to be no agreement on the number of objectives we call 'multi' or 'many' [6], and the focuses of these terminologies may create confusion.
In practical approaches, these multi/many-objective methods deal with large numbers of conflicting objectives, and finding the best solutions may not be easily observable. For such tasks, they can find multiple Pareto optimal solutions and perform better global searches of the search space [7]. However, evaluation of the solution sets obtained by these methods, which present different trade-offs among the objective problems, has to be quantitatively assessed in a meaningful way using a number of measurement scales [8]. Such measurements have different purposes as some are specific to problems (e.g., some SE testing papers use the average percentage of faults detected [APFD]), some use statistical measures, and others use the calculated execution time as a performance measure. However, our study is focused on the metrics used to evaluate solvers. Terminologically, these metrics are called performance metrics, also known as quality indicators.
On the other hand, solving the MOP requires to use of metaheuristic solvers to optimize the number of conflicting objectives or functions and provide a set of solutions (Pareto optimal set) to the decision-maker. However, there is no single solution that is better than the other with respect to all objectives, thus, these solvers provide an approximation of the Pareto front [9]. In the literature, several metrics are proposed to evaluate and compare these approximations sets [10].
In our context, we review the use of performance metrics that is used to evaluate the quality of solver (algorithm) outputs, or during comparisons with other solution sets obtained by other solvers. In general, there is no universally good or bad algorithm; however, one algorithm may perform well for a specific problem, thus, such types of metrics may necessarily be used in the studies.
To fill this gap, researchers have developed a number of performance metrics [11,12]. In the specialized literature [13][14][15], these metrics are roughly grouped into capacity, convergence, diversity (distribution, spread), or combination (convergence, diversity) metrics [13]. It is worth noting that some studies have reported that no single performance metric is enough to assess all the qualities of such methods [8,9,15] because each metric can only assess one or two desirable properties of the solution sets (e.g., convergence, diversity, or both) [16]. However, the rapid growth in the usage and development of multi/many-objective methods, performance metrics, and comparison of algorithms has received little attention [15].
Apart from the performance metrics, there are studies re-formulate the number of objectives and target specific areas in the SE domain. Nevertheless, readers are redirected toward the formal definitions of these metrics, critical analysis, and comparison of different metrics [9,13,14,17], which are not covered here.
To the best of our knowledge, according to an overwhelming number of studies in the field of SBSE, there have been few systematic literature review studies relevant to the use of such metrics, number of objectives, and sub-field of SE applied over the last decade. This served as a motivation to explore, investigate, and interpret the relevant studies (based on our research questions). This research will serve as a guide and reference for research practitioners to obtain new knowledge in order to see the possible gaps that might exist in this area, which will lead to improving the current practice in this niche, such as increasing the number of performance metrics, increase the number of objectives, and to apply new SE areas.
This research is meant to analyze how the SBSE community progressively evolved the performance metrics, number of objectives, and applied SE areas. The rest of this paper is organized as follows: Section 2 mainly provides an overview of the existing secondary studies (related works) in the field. Section 3 defines the review method, including the study research questions, search strategies, study selection criteria, and data extraction process. In Section 4, the analysis and results of the study are detailed, and finally, Section 5 provides a discussion and summary of the findings as a conclusion of the study.

Related Work
This section describes the secondary studies that have been conducted within the context under consideration.
Ramirez et al. [18] conducted a study that relied on a guided review motivated by the growing attention being focused on many-objective problems. The research sought to discover the limitations on problem formulation, algorithm selection, experimental design, and industrial applicability. In the findings, it was agreed that multi and many-objective EAs use the same indicators, but no quantifiable results were obtained or objectively stated in the study on the distribution of these metrics.
An early informal review conducted by Sayyad and Ammar [19] aimed to collect data on the algorithms, tools, quality indicators, and number of objectives used in the SBSE community. The researchers concluded that the use of MOEAs was becoming a new trend and found that many articles used single algorithms. They also reported that a few articles employed performance metrics, and HV was the most used one. However, research in the community has increased, and the use of multi/many-objective methods are on the rise. Hence, conducting up-to-date SLR is advisable to accommodate new trends and challenges.
Colanzi et al. [20] focused on the Brazilian authors and their contribution to the field of SBSE. Some of their objectives included tallying the number of publications of the community, the areas in which they focused, and their optimization techniques, as well as identification of the authors and their levels of collaboration. However, although their defined research questions were aimed at the SBSE field, the research was limited in scope, and their results are not generalizable.
Assunção et al. [21] also reported a similar mapping study targeting Brazil to expose existing research groups within the Brazilian SBSE community mainly discussed questions addressing evolved problems in SE, techniques used in solutions, and the number of researchers, institutions, and regions involved in these areas. Their findings showed significant growth in the community.
Another recent critical review was conducted by Chen et al. [17] within a limited timeframe (2009-2019). The study analyzed the quality indicators (performance metrics), problems involved in subfields of SE, and overall general issues in SBSE. The review finally provided methodological guidance on how to select and use evaluation methods in different scenarios. However, the study did not discuss how the community was currently using the metrics, which metrics were most used or least used, or how many metrics each study involved. Such discovery is our objective, and it may lead the community to better understand current practices.
The above-mentioned studies focused on SBSE but with regard to different or specific subjects, such as specific location [20,21], or specific techniques [18], while some of them are old [19], and some are new [17]. Hence, our study with its guided protocol is mainly meant to provide a generalizable result to the SBSE community by discovering how the community practitioners are employing the above-mentioned metrics, and this might eventually reveal pertinent issues and future opportunities.
It is worth considering that in the accumulated literature, we found a number of review studies targeted on specific contexts of SBSE, such as different areas in testing, requirements, design, and software refactoring, and many others, but their results may not be generalizable and address little attention to the broader collection of literature in this field, especially in performance metrics. In this regard, to utilize our limited scope, readers are redirected to references [22][23][24][25][26][27][28][29][30][31][32][33].
Thus, to keep the SBSE community up to date on this subject, since performance metrics are equally used to evaluate algorithm performance, especially when new algorithms are proposed, there are other studies from other communities that mainly discussed the issues related to the performance metrics used. Such studies, including a study reported by Jiang et al. [13], grouped the performance metrics in the literature into four main classes and then analyzed the relationships between representative metrics from the groups. However, the study was limited to only investigating the performance metrics categorized in the literature and their relationships among symmetric and continuous Pareto fronts (PFs). The'authors suggested further investigation of the relationships of other geometric perspectives in performance metrics, such as asymmetric and discrete PFs, and also highlighted the need for appropriate metrics with hypervolume (HV) use for concave shapes.
Riquelme et al. [14] conducted informal and small review focused on the frequency usage of 54 performance metrics in MOEAs with their advantages and disadvantages from five editions (2005, 2007, 2009, 2011, and 2013) by only main sourcing the published studies of bi-annual evolutionary multi-objective optimization (EMO) community conferences.
A recent review study observed by Li and Yao [34] categorized and analyzed the weaknesses and strengths of 100 state-of-the-art performance metrics with their desirable properties. With the help of that, they concluded that there is no perfect metric to measure the solution sets, since different metrics are appropriate in different situations. Another research direction suggested was to design new performance metrics suitable to the preferences of decision-makers (DMs).
Okabe et al. [8] reviewed the existing performance metrics by categorizing them into a number of groups based on their functionalities, then showing the advantages and disadvantages of performance metrics. Thus, a comparative study was done that discovered some of the metrics were misleading. Therefore, their point of discussion appeared to be that no single metric alone can quantify the qualities of the solution sets obtained by solvers.
Laszczyk and Myszkowski [35] described a taxonomy-based surveying 38 of the existing performance metrics and their definitions along with their advantages and disadvantages. They claim their proposed complementary set of metrics can create meaningful results when used on solution sets obtained by solvers.
Audet et al. [36] is another review study of performance metrics recently published, which intended to focus on using 57 metrics grouped into four categories: Cardinality, convergence, distribution, and spread. The research gap reported in this paper is the need for new metrics that can tackle the limitations faced by the HV.
While these papers are similarly discussed and focus on discovering the weaknesses of the existing performance metrics in use by the EMO community, no paper tracked how the current research in SBSE practice uses these metrics (i.e., their applications to real-world problems instead of artificial problems). This research will, therefore, discover if enough practitioners are using these metrics and distributions and the types of metrics employed.
For further performance metric analysis on their strengths and weaknesses with practical guidance, readers are referred to references [37][38][39][40].

Research Methodology
Systematic literature review (SLR) is a process of identifying the relevant research questions, collecting the relevant secondary data, evaluating and interpreting such data.
To obtain a good sample of primary studies, several approaches are discussed in the literature, such as standard SLR, Systematic Mapping Study (SMS), Snowballing, or Quasi-Gold Standard (GQS) methods. In relation to these methods, SMS is mainly employed when the primary studies are huge or to cover broad topics, however, the cost of assessing all the studies would be unreasonably high [41]. To reduce such constraints, it is required to stop classification processes at a certain level, thus, it may reduce or leads to missing important articles. On the other hand, Snowballing (backward and forward) is also used to find primary studies, however, it might be necessary, if used a well-defined reliable, and efficient search strings in the digital libraries. While some studies use the concept of GQS to improve the search steps, thus, this depends too much on a good QGS [41].
While standard SLR is driven by a very specific research question that is used to identify, analyze, and interpret the relevant studies [41,42]. In SLR, the primary studies are identified with the help of the search process, and data extraction process (such as inclusion and exclusion criteria) [42]. We believe the standard SLR methodology used in our study is essential to support this research constructively.
In this aspect, we describe the methodology of this SLR method guided by Kitchenham et al. [43] to systematically collect, analyze and summarize the quantifiable data obtained from the specialized literature. In the following subsections, we will discuss our research questions, search strategies, study selection, and data extraction process.

Research Questions
To define what we are trying to answer, it is essential to design our research questions (RQs) at the studies that quantitatively evaluate the Pareto-based methods using performance metrics, number of objectives, and applied SE areas, thus, we consider the following RQs: RQ1: What are the studies that applied none or one or more performance metrics?
To answer this RQ, we investigate the number of performance metrics that the SBSE community employed over the years, we aim to check the number of performance metrics each study employed by adopting a grouping strategy. This discovery will help the practitioners to understand how the existed studies measuring the quality of the solution sets obtained by the solvers.
RQ2: What metrics are most or least used in the studies? In this question, we aim to identify the metrics that reported mostly or least used. From this point of discovery, tallying, and grouping (adopting previous grouping strategy), the different set of metrics in each group and their frequencies are discussed.
RQ3: What is the rank order of the metrics most or least used in the studies? To see how the overall metrics and their ranks, we calculate their total frequencies. In RQ2, the overall rank of these metrics were not discussed. However, in this RQ, we intended to identify the metric frequencies and group them using their frequencies and calculate their percentages. In this case, we also avoid using the previous grouping strategy. RQ4: How do the top popular metrics (>5%) increase or decrease in the studies? It is beneficial to see how a set of metrics distributed in the study group (adopting previous grouping strategy), especially those gained more than five percent. This investigation will help us to increase our knowledge about how a set of metrics become increasingly popular or decreased in the study groups. RQ5: How well do the current studies in SBSE use performance metrics? In this RQ, we also identify and further investigate the study groups by showing the total number of studies in each study group, their total number of unique metrics, and their total frequency metric.
RQ6: What are the number of objectives used in the studies?
In this aspect, we show the number of objectives the community employing in practice by grouping the studies based on the objective count. This will discover the current practice of SBSE practitioners and the future direction of the research.
RQ7: What are the applied areas in SE of the studies?
In this RQ, we investigate the most and least common investigated software engineering (SE) areas by showing the studies' distribution among these. Previously software testing was nominated, but recently, many areas in SE were investigated in the SBSE community. In this case, we also adopt the grouping made by the previous studies [19,20]. This might help and lead the current practitioners to further investigate these areas.
RQ8: What are the performance metrics distribution in each SE applied area?
In this RQ, we also identify how the previously investigated performance metrics (based on the grouping strategy) are distributed on the applied SE areas. This will help the SBSE practitioners to understand which SE areas are employing more or fewer metrics.
Answering the above RQs will help the practitioners to understand how studies have measured the quality of the solution sets obtained by solvers. There may be philosophical ideas among members of SE communities that might be revealed through answering these RQs, thus giving them meaning. To address the scope of the RQs, we limit the research papers published over years, which ranges from 2000-2020. While, we selected the publications only in relation to SBSE, especially those involving the multi/manyobjective methods and their employed performance metrics, and number of objectives utilized in their evaluation setups and in applied areas of SE.

The Search Strategy
To avoid missing the relevant studies, we used a manual search method from the four most suitable digital libraries: Scopus III.
Web of Science IV.
Science Direct To avoid covering limited articles, we selected a set of digital liberties that can cover a large number of articles. We make a detailed search string and relevant to our topic to collect a significant number of studies. The construction of these strings are inspired by several literature reviews [17,20,23,26,44,45]. These queries are enough to cover a wide range of articles and match the article title, abstract, and keywords.
To identify publications, we used a set of keyword strings in our search parameters, as shown in Table 1. These keywords are categorized into those related to SE, Search Based, and performance metrics. This group organization was inspired by that found in Reference [44].
The keywords related to SE field areas, Search Based, and performance metrics were extracted and then combined using Boolean operators, such as "OR" and "AND." All the search parameters targeted article titles, abstracts, and keyword sections. Finally, these strings were executed by splitting them into shorter segments because some of the targeted databases would not fully accept long strings, (to avoid showing unsatisfactory results).

Study Selection
To select the candidate papers, we employed inclusion and exclusion criteria as we mainly aimed to not miss any beneficial articles that matched our research objectives and were written in the English language in sources from specified publishers (IEEExplore, Science Direct) or indexers (Scopus, Web of Science). To finally select the desired studies, we filtered the fetched articles by carefully reading the titles, abstracts, keywords, and body texts, iteratively. The steps in the process of searching and selecting are illustrated in Figure 1.  In the above steps (from top to bottom), Step 1 shows the number of studies returned from each database, with a total of 699 studies. In Step 2, we excluded those mismatched by title. In Step 3, deletion of the duplicates was performed. In Step 4, the abstracts were read, and those that were out of our scope were excluded; and finally, in Step 5, a full reading of the remaining articles was performed. We also applied the inclusion and exclusion criteria in every step if matched.
The inclusion criteria are as follows: 1.
The study must be related to the topic (SBSE) and must use multi-objective or manyobjective methods.

2.
The study must be written in the English language. 3.
The study must be available online and in electronic format.
And the exclusion criteria are: 1.
Studies not related to SBSE; 2.
Not written in the English language; 4.
Not available online.

Data Extraction Process
In this step, after full text reading, we extracted the desired data from the final selected studies that satisfied our criteria. To review the primary studies, multiple researchers (three researchers) are randomly assigned to assess the relevant papers, then the researchers extract the data from the relevant studies, and the obtained data were crosschecked. The data are then stored in Excel spreadsheets for further analyses. The desired extracted parameters included the name and number of performance metrics and number of objectives and subfield of SE used. This process facilitated easy classification and analysis to answer our research questions.

Results
To make a detailed explanation in this stage, we analyzed the extracted data from the final 105 studies after applying inclusion and exclusion criteria to answer the research questions. To start, we first summarized all the 27 unique metrics used in the studies, as shown in Table 2. Figure 2 shows the number of studies by publication year. RQ1: What are the studies that applied none or one or more performance metrics? In order to see how the existing studies used performance metrics, we grouped our collected studies based on the number of metrics used. After full reading, we found that the maximum number of metrics used in the nominated studies was six. Thus, our grouping strategy adopted these abbreviations: M0 means zero metrics, M1 for one metric, M2 for two metrics, M3 for three metrics, M4 for four metrics, M5 for five metrics, and M6 for six metrics. This means articles that had not employed or not reported the defined performance metrics would be listed in the M0 group and those with one metric in M1, etc. Figure 3 shows that most of the studies, based on this grouping, used zero metrics, which accounted for 37 articles, and the second rank deployed two metrics. Meanwhile, the graph shows a decline in the studies that employed more than two metrics. It is worth noting that only four studies employed six metrics (M6), and four others used five metrics (M5). However, to make it more meaningful, we needed to address what the dominant metrics were, thus creating another detail from this point by showing how they (the 27 metrics) were used over 105 studies. This would answer another RQ.
RQ2: What metrics are most or least used in the studies? From this point of discovery, tallying of the used metrics is discussed. It is not surprising that some of the employed metrics were selected, due to their broad usage in the literature [10,46]. We only tallied the sections of the studies that used performance metrics; hence, those using zero metrics were excluded, such as the M0 set. With the help of Excel spreadsheet visualization, the obtained result is presented in figures. Figure 4, which shows the distribution of the metrics for the M1 group, which had a total of 19 articles, yet they employed one metric in each of the studies, and therefore, there were only two unique metrics involved, which were HV and IGD. However, IGD was only used once, while HV was used 18 times in this set of the M1 group. Generalized spread GS 7 Error ratio ER 8 Inverted generational distance IGD 9 Generational distance GD 10 R-metric R2 11 Maximum spread MS 12 Contribution metric -13 Maximum Pareto front error MPFE 14 Hypercube-based diversity metric -15 Spread: Delta measure ∆ 16 Convergence metric CM 17 Coverage difference D 18 Two set coverage C 19 Euclidean distance ED 20 Epsilon family 21 Spacing S 22 Inverted generational distance IGD+ 23 Overall nondominated vector generation ONVG 24 Percentage P 25 Lp-norm-based diversity Lp-norm 26 Number of solutions in the region of interest Proi 27 Convergence measure ρ    Figure 5 shows that a total of 24 articles employed sets of two metrics. Although the number of studies involved in this group (M2) was more than the previous one (M1), it comprised a good number of metrics (good diversity with a total of 11 unique metrics), and yet HV and IGD were the leading ones, which means they were the most used metrics. HV was used 18 times, IGD was used six times, and there were nine other additional metrics in this set, which were the hypercube-based diversity metric, which was used five times, and delta spread (∆) and ED, which were used four times each. The remaining metrics were used as follows: and S were used three times each, GD two times, and finally, HVR, NDS, and ρ (convergence measure) were used only one time each.  Figure 6 shows a total of 16 metrics that appeared in the publications. After comparison according to their usage, HV was found to be the highest in total with seven cases, while NDS and the hypercube-based diversity metric were ranked five and four, respectively. While GD, the contribution metric, ∆, and CM were used two times each, and the remaining nine metrics in this list had the lowest values, only having been used once. In the M4 group, as Figure 7 shows, GD is the most used for the first time, having been used five times, and HV and IGD are in second position, having been used four times. In this graph, 11 unique metrics are involved with a total of six articles in the set (M4) and a two-set coverage (C). Spacing (S) metrics gained three and two, respectively, while the rest of the metrics had one use in each, which are HVR, PFS, GS, the hypercube-based diversity metric, ∆, and ED metrics. In Figure 8, although the articles utilizing more than two metrics are lower in number, the number of unique metrics is high. The figure indicates 15 unique metrics with the frequency used for each metric and their scores as follows: HV was used three times, and MS, , and S were used two times each, while the rest (HVR, GS, ER, IGD, GD, R2, ∆, CM, D, C, and IGD+) were reported only one time each. Finally, articles that employed six metrics (M6) were also fewer in quantity (four in total), and they used ten unique metrics. Figure 9 shows the frequencies of these metrics. HV and IGD had four each, and the remaining eight metrics were reported with three different scores: PFS, GS, and had three each, ER and GD had two each, while the contribution metric, MPFE, and S were reported once each. Although the total of the unique reported metrics is 27 out of 105 articles, and they are repeatedly used in some of them. From this perspective, we can answer another research question on the total ranks of these 27 metrics over the studies.
RQ3: What is the rank order of the metrics most or least used in the studies? To show the overall metrics with their ranks by calculating the total frequency of each metric, the above-mentioned analysis was used to determine how group studies employed these metrics in separate representations. Thus, in this section, the most or least used metrics (high or low in frequency) are described using their total frequencies. Table 3 shows the 27 unique metrics grouped based on their frequencies in column two. This means those that received the same value will be in the same set or rank. Column three shows the percentages calculated for each metric as the product of the frequency/total frequency of 168 multiplied by 100. Please note the total of the percentage values should be calculated as follows. For example, frequency number five has three metrics in that position (PFS, GS, C), and each of them has the value 3.0%, which means their total must be calculated as 3.0 + 3.0 + 3.0 = 9, and the rest should be calculated in the same manner (only if a set of metrics is in the cells) to reach a total of 100%. The result is that HV has the highest frequency of 54, and is, thus, positioned in the first position in terms of the number of times used (frequency), and in percentage, this metric accounts for 32.1%. In the second position, IGD is presented, which has a frequency of 17 and accounts for 10.1% of the total. In position three and four, GD and the hypercube-based diversity metric (also called spread [S]) scored 12 and 10 in frequency and 7.1% and 6.0% in the percentage column, respectively. The reaming metrics are less than 6.0% in score; hence, they are the least used metrics in this report. In position five (∆, , S) and six (NDS, ED), there are sets of metrics in the cells with total frequency scores of 8 and 6 and percentages of 4.8% and 3.6%, respectively. As shown in Table 3, the rest of the metrics account for less than 3.6% in scores. However, another RQ mainly concerns how the top metric evolution was based on grouping. RQ4: How do the top popular metrics (>5%) increase or decrease in the studies?
To answer this question, we need to show the distribution of special metrics, particularly those used more than 5%, in order to increase our knowledge about how these metrics got increasingly popular or when they decreased in the overall studies. We made a bar chart to visualize the distribution of these metrics over the above-mentioned groups (M1 to M6). Figure 10 shows that for M1, M2, M3, and M5, the HV metric was the most used, and for the rest, it was used less than in M1 and M2. For the M4 and M5 groups, HV and IGD were comparable. In M2, IGD, the hypercube-based diversity metric, and GD appeared with higher values, respectively. Although the number of publications in the remaining sets is less than the previous ones, some of the metrics in this list also decreased in use, such as the hypercube-based diversity metric, which declined in use after its first appearance in M2, and it was not reported in M1, M5, or M6, while IGD was represented in all of the groups and GD is present in M2-M6. However, the graph shows most of the studies relied on HV in M1 and in M2 when the studies started using more than two metrics together with other metrics, such as IGD, GD, and the hypercube-based diversity metric. In all of the groups, HV was top-ranked except in M4, where GD was highest in frequency, and M6, where it was equal with IGD. It must be noted, however, that there are more studies in M1 and M2 (43 articles in total) compared to M3, M4, M5, and M6. This means most of the studies (according to M1-M6) employed one or two metrics, as shown in Figure 3 and Table 4. In short, this graph shows HV is preferred for DM when it comes to using one metric, while the rest of these metrics only become desirable when it comes to using more than one metric.  RQ5: How well do the current studies in SBSE use performance metrics? In Table 4, the total 105 articles and the study groups (M0 to M6) together with their references are reported. The total number of articles in each set (as earlier mentioned) together with the total number of unique metrics employed in each set and the total frequencies of use shown in Figure 10 visually emphasize that more metrics are used in sets M2, M3, M4, M5, and M6. Thus, while the articles involved are fewer in quantity, more diverse metrics were used.
Regarding the data presented in Figure 11, it is worth mentioning that the M1 group has a total of 19 studies that used single metrics, comprising a total of two unique metrics (HV and IGD) with a frequency of 19; however, 18 of them were HV, while the remaining one was IGD. In the same figure, the rest of the groups maintain a good diversity of metrics. This indicates that few studies utilized multiple metrics in their research, but the total number of unique metrics was high. For example, studies that employed more than four metrics employed the highest number of metrics: M3 = 16, M4 = 11, M5 = 15, and M6 = 10 metrics. RQ6: What are the number of objectives used in the studies? Figure 12 shows the number of objective functions or problems formulated in the community. As above-mentioned there are MOPs with no more than three objectives (2 and 3) and MaOPs with more than three objectives. As shown in Figure 12, two and three objectives are the most re-formulated problems, while there are increasing interest in the community in formulating MaOPs compared to previous studies' review [19], although there are number of studies that formulated a different number of objectives in a single study, such as References [75,76,86,91,95,96,102,108,109,117,[122][123][124][125]133,138,146,147]. Table 5 references of these objectives are stated.
To answer this question, we adopt the grouping made by the previous related studies [19,20]. Figure 13 shows the studies' distribution of common software engineering areas. As shown in the graph, software testing is the most applied area, while the graph shows a decline for software design, requirements, management, and verifications, respectively. This also indicates that the SBSE community practically applied many applicable areas in SE fields, and we believe some are still not mature, yet they are gaining popularity in the community. On the other hand, some questions may arise regarding the popularity of software testing or design. There are many convincing facts, and some are historically related. For example, early SBSE studies were on software testing, and this might lead to a new research gap or discoveries that result in further investigation. Another fact is that software testing is a perfect fit for automation that might be applicable to SBSE, as well, although testing activities are considered the most expensive in SE in terms of time, cost, and resources. Regarding this aspect, practitioners might prefer to optimize conflicting objectives, while SBSE pioneers believe the metric richness in SE fields is a perfect fit for applying search-based methods. However, it is unknown if testing and design have more metrics compared to other fields. Table 6 lists the references for applied areas in SE.  Table 6. List of references of the applied areas.
In this section, we show how the different performance metrics, based on the grouping (M0-M6) are distributed on the applied areas reported in Figure 13. Regarding the data presented in Figure 14 shows that requirement-based studies are the lowest in numbers in M0 group (studies reported zero metric), only one study contributes to this list, while design-based publications are the highest with a total of 17 studies in the same group, yet, both requirement and design studies maintained consistency with the rest of the groups (M1 to M6), except design-based studies which had not appeared in M5 group. On the other hand, studies under management areas are the second-lowest according to M0, M2, and M3 sets, but also appeared in M4 and M5. Although the testing areas are the highest in numbers and verification studies are the lowest, yet, testing areas become the second highest in M0, and ranked in second-lowest in M2 and M3, while there are studies that employed four metrics and six metrics in M4 and M5, respectively, but had not appeared in M6 group. Generally, the graph shows a decline in the studies that employed more than two metrics.

Discussion and Conclusions
Performance metrics have been identified as having a promising role in better assessing the quality of solutions provided by evolutionary algorithms and perform better in comparison with them, thus becoming a key ingredient to support the preferences of decision-makers. In this paper, the aim was to show the current practices or how the SBSE practitioners used these metrics. It is believed such discoveries will eventually highlight the possible sets of metrics, objective functions, or new areas in SE to explore in the future. To achieve this, we carried out a systematic review with a guided protocol to carefully (systematically) plan, collect, and present the dominant results, in detail. We technically defined the relevant research questions to answer and also conducted a manual search from a set of digital libraries to select the candidate papers. Inclusion and exclusion criteria were applied, and finally, the desired extracted data were stored in Excel worksheets. We then discussed the outcomes using tables and graphs to better digest the data. As a result, the final 105 relevant publications revealed that there are (based on the groupings) several studies that employed zero metrics (solver metrics). In the SBSE community, it is preferable to use more metrics, and it is worth noting that only four studies employed six metrics, and four others used five metrics. In addition, the analysis also discovered the number of sets of metrics used in the studies and their ranks over the study groups. To this effect, HV was the most widely used individual performance metric, while for groups, HV, IGD, GD, and the hypercube-based diversity metric are top-ranked, respectively (they had frequency scores of more 10). On the other hand, there is increasing interest in the community in re-formulating MaOPs with more than three objectives, additionally, software testing was the most applied area in software engineering.
Furthermore, we addressed some of the open issues found in our study, and they are mainly related to these three main areas: Performance metrics, number of objectives, and SE application areas. All the issues related to these should be addressed in the future.
Performance metric: We found that there are remarkable studies that did not employ performance metrics, while those that used two metrics increased in number, and the remaining studies, specifically in sets M3, M4, M5, and M6, used more diverse metrics. However, in the literature, most researchers did not agree to evaluate their algorithms based on a high number of performance metrics, but they agreed that no single performance metric alone can assess all the qualities of the solution sets, since each metric can only be targeted to evaluate a single or two desired properties.
Another issue is metrics preference among researchers. We observed that some of the studies justify the reason they employ these metrics as either based on a metric's popularity (i.e., usage or related work in which it was used) or if it best fits their choice of algorithm (i.e., References [6,86,87]), while some other studies adopted some of these metrics because they are hybrids. For example, HV can cover both convergence and diversity [87], while some others avoided using more metrics because that might have led to different conclusions or threatened the validity of their results [122]. Thus, such gaps will remain in their future work [122]. Some avoid these metrics, which would be a visible gap in their future work [75]. With regard to such practice, it is also clear that the use of performance metrics has received little attention. Since the current practice is dominated using 0 to 2 metrics, hence such comparison might be unfair. Thus, we recommend employing more diverse sets of metrics, since they have been found to be low in quantity in current practice.
Other possible research gaps that deserve further investigation include other metrics, such as statistical measures, since these automated performance metrics produce sets of numerical values, and these data require the application of further statistical analysis; however, it is debatable which statistical model is best fit to describe these data.
The number of objectives: We believe in advancing the current practice of defining the objective functions in SBSE will eventually reveal new research gaps. However, there are increasing interests in the community in re-formulating MaOPs with more than three objectives. We found not all the research studies define a new problem(s), some studies apply the existed problems [102,122,125]. This depends on the objective of the paper, some papers intend to formulate new problems while others only propose a new algorithm or compare existed algorithms by either applying existed formulated problems or considering new problem formulation. However, this does not indicate the practitioners are relaying the existed formulated problems, since the majority of them are formulating new problems with several objective functions, while we have seen studies employing a different number of objectives in a single study [75,76,86,91,95,96,102,108,109,117,[122][123][124][125]133,138,146,147]. Such practice of formulating a limited number of objectives shows the practitioners are either facing difficulties in re-formulating more objective functions that normally need a mathematical definition or defining a small number of objectives that are less expensive and easy to perform. It is worth mentioning that the community is lacking theoretical studies or discussions. Although, traditionally SBSE community re-formulated a single objective problem, and currently dominated by two to three objectives, however, this indicates, the opportunities of exploring a wide range of objectives are open issues.
Software engineering application areas: We observed that some of the applied areas in SE are less explored, such as requirement, management, and verification, while some areas are highly explored, such as software testing and software design. It is worth noting, that the less dominated areas indicate there are limited problems to solve in that area while the dominant areas are considered to have more diversity of problems to solve.
It is also interesting to address why some software engineering areas are less applied compared to others. Some of the factors that can be linked to this include: Some areas in software engineering disciplines are characterized to be expensive in terms of cost and time, such areas include software testing. However, decision-makers might prefer solving such constraints to have the best alternative solutions. Software testing was also considered one of those SBSE practitioners previously applying, however, over the years, the discussion was growing significantly while finding new research gaps become easy, and interest in responding to such future works is another contributing factor. Another fact is that software testing is easy to automate, thus, such automated problems are easy to measure, and such measurements are used to guide the fitness functions. Although SBSE pioneers argue that the software engineering field is rich in metric, however, this makes many areas in software engineering subfields to become fit for re-formulating as a multi-objective problem and applying search-based methods. However, it is required a future investigation and finding if software testing and software design have more metrics compared to other subfields such as management, requirement, and verification. Besides, it is recommended to explore more software engineering fields that are least applied and re-formulate their problem (e.g., formal methods).