Techniques for Calculating Software Product Metrics Threshold Values: A Systematic Mapping Study

: Several aspects of software product quality can be assessed and measured using product metrics. Without software metric threshold values, it is difﬁcult to evaluate different aspects of quality. To this end, the interest in research studies that focus on identifying and deriving threshold values is growing, given the advantage of applying software metric threshold values to evaluate various software projects during their software development life cycle phases. The aim of this paper is to systematically investigate research on software metric threshold calculation techniques. In this study, electronic databases were systematically searched for relevant papers; 45 publications were selected based on inclusion/exclusion criteria, and research questions were answered. The results demonstrate the following important characteristics of studies: (a) both empirical and theoretical studies were conducted, a majority of which depends on empirical analysis; (b) the majority of papers apply statistical techniques to derive object-oriented metrics threshold values; (c) Chidamber and Kemerer (CK) metrics were studied in most of the papers, and are widely used to assess the quality of software systems; and (d) there is a considerable number of studies that have not validated metric threshold values in terms of quality attributes. From both the academic and practitioner points of view, the results of this review present a catalog and body of knowledge on metric threshold calculation techniques. The results set new research directions, such as conducting mixed studies on statistical and quality-related studies, studying an extensive number of metrics and studying interactions among metrics, studying more quality attributes, and considering multivariate threshold derivation.


Introduction
Software engineering aims to determine and advance the quality of software throughout the software development life cycle. For achieving this aim, software metrics are applied to consider the qualitative aspects of software components quantitatively [1]. Arar and Ayan [1] observed that software metrics can measure software from different perspectives, but embody the internal quality of software systems. Software metrics without threshold values do not help much to evaluate the quality of software components [2]. In order to tackle this problem, it is crucial that dependable threshold values be selected [3]. The application of metrics encompasses thresholds to ascertain if a specific value is acceptable or abnormal and, thus, provides denotation to metrics, enabling them to become a decision-making device pre-and post-software release [4]. However, determining appropriate threshold values is a rigorous task as thresholds attained by achieving a correlation analysis are only for a defined group of identical software systems [5]. This is further

Object-Oriented Metrics Thresholds
Despite the importance of object-oriented metrics in software engineering, they have not been adequately applied in the software industry [2,22,23]. The reason is that the thresholds of a majority of metrics are not identified. The thresholds are described as "heuristic values used to set ranges of desirable and undesirable metric values for measured software" [4]. Identifying these values is crucial because they allow the metrics to be applied to evaluate quality issues in software. Furthermore, without the insight of these thresholds, one cannot comment on questions such as "which class has the largest number of methods?" or "which method has a large number of parameters?" [24].
In recent years, several studies have been performed to derive the threshold values of object-oriented metrics. These studies use different methods to determine threshold values for various metrics categories. Shatnawi used log transformation to derive threshold values for CK metrics [25], alternatively, Malhotra and Bansal [26] used a statistical model derived from logistic regression to identify the thresholds of CK metrics. These two studies are only two examples of a large number of research papers that will be analyzed and classified in the present study.

Related Work
There have been some secondary studies in the area of software metrics. Beranic and Hericko [3] presented a preliminary review on approaches for software metrics threshold derivation from 14 primary studies. According to their observations, the majority of the studies presented new ways and generally applied statistical techniques for threshold derivation. In addition, they selected a programming language, for which the metrics are derived (i.e., Java, C, C++, and C#). They answered four research questions related to approaches and methods used for software metrics threshold derivation, as well as the objective of threshold derivation for which software metrics thresholds are derived, and finally the available tools for metric collection and threshold derivation. In another study, Ronchieri and Canapro [27] provided a preliminary mapping study of software metrics thresholds, which was limited to the Scopus database and two research queries: which papers are currently most important in the software metrics threshold research community, and whether software metrics threshold papers can be meaningfully aggregated. Thus, this study was the first attempt, but its scope was very limited in terms of topic and venue of publications (i.e., journals and conferences). Kitchenham conducted a preliminary mapping study of software metrics research to identify influential software metrics research published between 2000-2005. In addition, the authors aimed to investigate the possibility of using secondary studies for integrating the research results [28].
Tahir and MacDonell [29] published a systematic mapping study of research on dynamic metrics. They identified and evaluated 60 papers, which were classified based on three different aspects: research focus, research type, and research contribution. Vanitha and ThirumalaiSelvi [30] performed a literature review to identify and analyze the metrics for some quality factors. In addition, their review aimed to provide a general overview of the software metrics to guide researchers in identifying which metrics can be applied to assess different software quality attributes. Despite their contributions to the field, none of these studies focused on object-oriented metrics, nor did they pay attention to identifying the threshold values.

Systematic Mapping Process
Systematic mapping studies are applied to structure a research area and to provide a classification (i.e., structure) of the type and results of articles that have been published by classifying them. The main process steps of this systematic mapping research are to include the research questions, conducting the search for relevant papers, screening of papers, keywording of abstracts, data extraction, and mapping [31]. Figure 1 illustrates the steps of a systematic mapping study. In this article, systematic mapping is applied to software quality in studies that focus on threshold values of object-oriented metrics. studies presented new ways and generally applied statistical techniques for threshold derivation. In addition, they selected a programming language, for which the metrics are derived (i.e., Java, C, C++, and C#). They answered four research questions related to approaches and methods used for software metrics threshold derivation, as well as the objective of threshold derivation for which software metrics thresholds are derived, and finally the available tools for metric collection and threshold derivation. In another study, Ronchieri and Canapro [27] provided a preliminary mapping study of software metrics thresholds, which was limited to the Scopus database and two research queries: which papers are currently most important in the software metrics threshold research community, and whether software metrics threshold papers can be meaningfully aggregated. Thus, this study was the first attempt, but its scope was very limited in terms of topic and venue of publications (i.e., journals and conferences). Kitchenham conducted a preliminary mapping study of software metrics research to identify influential software metrics research published between 2000-2005. In addition, the authors aimed to investigate the possibility of using secondary studies for integrating the research results [28].
Tahir and MacDonell [29] published a systematic mapping study of research on dynamic metrics. They identified and evaluated 60 papers, which were classified based on three different aspects: research focus, research type, and research contribution. Vanitha and ThirumalaiSelvi [30] performed a literature review to identify and analyze the metrics for some quality factors. In addition, their review aimed to provide a general overview of the software metrics to guide researchers in identifying which metrics can be applied to assess different software quality attributes. Despite their contributions to the field, none of these studies focused on object-oriented metrics, nor did they pay attention to identifying the threshold values.

Systematic Mapping Process
Systematic mapping studies are applied to structure a research area and to provide a classification (i.e., structure) of the type and results of articles that have been published by classifying them. The main process steps of this systematic mapping research are to include the research questions, conducting the search for relevant papers, screening of papers, keywording of abstracts, data extraction, and mapping [31]. Figure 1 illustrates the steps of a systematic mapping study. In this article, systematic mapping is applied to software quality in studies that focus on threshold values of object-oriented metrics.

Definitions of Research Questions
The main objective of the present systematic mapping study is to understand, clarify, and identify threshold values of object-oriented metrics that were most used and published in relevant papers. According to the area and the application of object-oriented metrics, the research questions are formulated in Table 1.

Definitions of Research Questions
The main objective of the present systematic mapping study is to understand, clarify, and identify threshold values of object-oriented metrics that were most used and published in relevant papers. According to the area and the application of object-oriented metrics, the research questions are formulated in Table 1.

Search Strategy
Based on the current trend in software metrics, we extracted some keywords to query the primary study. The essential keywords were initially identified as "thresholds value", "object-oriented metrics", and "software metrics". To build search strings, we used different combinations of keywords with potential synonyms and required logical operations. The resulting search strings were: "Threshold values" AND ("Object-Oriented metrics" OR "Object-Oriented measurements" OR "Software metrics"), ("Threshold values of objectoriented metrics"). These search strings were applied in five digital libraries, namely IEEE Xplore (Institute of Electrical and Electronics Engineers), Science Direct, Springer, Wiley, and ACM. Papers published in the last 20 years were included.

Screening
In this phase, we focused on the title, abstract, keywords. However, in some papers, the entire text had to be read. After finishing this step, a set of inclusion/exclusion criteria was applied to select the relevant papers. The third column of Table 2 reports articles that are initially selected. Table 3 indicates the selection criteria for articles; these are mainly inclusion and exclusion criteria to select or reject any article for our study. Papers were included if: • They mention the object-oriented metrics and thresholds in the title, abstract, or keywords; • The threshold values are the main topic in the study; and • They provide approaches and applications to calculate the threshold values.
Papers were excluded if: • They do not identify threshold values of object-oriented metrics; • They explain object-oriented metrics without focusing on thresholds; and • They are duplicate papers.  Certain contradictory papers were examined by two members of the team to make a better decision whether to include or exclude them. After the screening phase, the number of papers was reduced to 45 for final inclusion in this mapping study.

Classification Scheme
The papers were classified based on the general expert knowledge (i.e., attributes) and specific information extracted during the screening phase. We defined twelve classification attributes. The last column of Table 2 represents the number of final studies that were selected. The following information was recorded for each study as general properties:

•
Paper title sorted from A to Z.

•
Year, when the paper was published.
In addition to the classification of research works, this study also addresses the following issues in detail: • Threshold calculation methods: to determine calculation methods used in each paper.

•
The quality attributes for threshold calculation methods: to determine the quality attributes for threshold calculation methods. • Metrics categories used: to identify which metrics sets are used in research papers.

•
The most relevant and least relevant metrics: to list relevant and irrelevant metrics to the method that was used in the studies.

•
The types of projects used in the studies.

Extraction
The real mapping of relevant papers to categories of the classification scheme is executed during this phase. An excel sheet was used to document the data extraction process based on the attributes provided in the classification scheme. When the reviewers filled in the data of a paper into the scheme, they provided a short rationale why the paper should be in a certain category (e.g., why the paper has been considered to be an empirical approach). From the final table, the frequencies of publications in each category could be calculated.

Final Pool of Research Studies
After the primary search and the analysis to exclude unrelated works, the pool of the selected papers was finalized to 45 studies as listed in Appendix A. The classification of the selected publications according to the classification scheme is described in Section 3.4.
We present the annual trend of the included studies and the number of studies according to the publication type. Figure 2 shows the annual trend of publications included in our mapping study. We can observe that the earliest research paper was published in 2000. Figure 3 shows the number of research studies based on the publication type (e.g., conference, journal, workshop). We can observe that most of the studies were published in conferences (45%), and the fewest have appeared in workshops, book chapters, and thesis (2%).
We present the annual trend of the included studies and the number of studies according to the publication type. Figure 2 shows the annual trend of publications included in our mapping study. We can observe that the earliest research paper was published in 2000. Figure 3 shows the number of research studies based on the publication type (e.g., conference, journal, workshop). We can observe that most of the studies were published in conferences (45%), and the fewest have appeared in workshops, book chapters, and thesis (2%). Table 4 shows the evaluation phases and the corresponding scope.

Quality Assessment
After primary papers were selected based on the study selection criteria, we performed a quality assessment for the selected studies. We used eight quality criteria derived from Kitchenham et al. [6] and another SLR study [32]. Each quality assessment question was answered based on the following options: yes = 1, no = 0, somewhat = 0.5. If the total score of the paper received was less than or equal to 4, it was excluded from the list because the quality threshold was set to a value of 4. All authors were involved in the quality assessment. When there was a conflict about the scoring per paper, an additional meeting was held. Figure 4 shows the quality distribution of selected papers. Details of the quality assessment are provided in Table A2 in Appendix B.
The quality criteria are listed as follows:   Table 4 shows the evaluation phases and the corresponding scope.

Quality Assessment
After primary papers were selected based on the study selection criteria, we performed a quality assessment for the selected studies. We used eight quality criteria derived from Kitchenham et al. [6] and another SLR study [32]. Each quality assessment question was answered based on the following options: yes = 1, no = 0, somewhat = 0.5. If the total score of the paper received was less than or equal to 4, it was excluded from the list because the quality threshold was set to a value of 4. All authors were involved in the quality assessment. When there was a conflict about the scoring per paper, an additional meeting was held. Figure 4 shows the quality distribution of selected papers. Details of the quality assessment are provided in Table A2 in Appendix B.
After primary papers were selected based on the study selection criteria, we performed a quality assessment for the selected studies. We used eight quality criteria derived from Kitchenham et al. [6] and another SLR study [32]. Each quality assessment question was answered based on the following options: yes = 1, no = 0, somewhat = 0.5. If the total score of the paper received was less than or equal to 4, it was excluded from the list because the quality threshold was set to a value of 4. All authors were involved in the quality assessment. When there was a conflict about the scoring per paper, an additional meeting was held. Figure 4 shows the quality distribution of selected papers. Details of the quality assessment are provided in Table A2 in Appendix B.
The quality criteria are listed as follows:

Systematic Mapping Process
The results are shown in this section, and each research question is answered in each subsection. The quality criteria are listed as follows:

Systematic Mapping Process
The results are shown in this section, and each research question is answered in each subsection.

Q1 What Kind of Threshold Calculation Methods Exist in the Literature?
In these studies, a variety of threshold calculation methods have been used. We categorized papers into the following categories: programmer experience, statistical properties, quality-related, and review studies, as shown in Table 5.
Programmer experience: The proposed survivability model by Sodiya et al. [33] is for object-oriented design, which is aimed to fix the problem of software degradation caused by increasing growth in classes and methods. They proposed a model based on the code rejuvenation technique triggered by a threshold value. Li and Shatnawi [34] found a positive correlation between source code and bad smells that were identified using programmers' thresholds. Bad smells are considered patterns associated with bad programming practices and bad design decisions. Alan and Catal [35] presented an outlier detection algorithm using object-oriented metrics thresholds. In another paper by Suresh et al. [36], the numerical thresholds have been specified for every metric, as given by Rosenberg and Stapko [37]. Catal et al. [13] applied clustering algorithms and metrics threshold values for the software fault prediction problem.
Statistical properties: Vale et al. [38] compared three methods from Alves et al. [7], Ferreira et al. [2], and Oliveria et al. [39] to calculate thresholds for software product lines. Paper [25] aimed to provide a technique to calculate thresholds in the SPL (software product line). The proposed method relies on data analysis and considers the statistical properties of the metrics. Herbold et al. [11] defined a data-driven methodology to calculate thresholds. They calculated thresholds using machine learning by dividing the dataset into training set (50%), selection set (25%), and evaluation set (25%). They proposed an algorithm called rectangle learning for calculating the thresholds. Alves et al. [7] designed a method that defines the thresholds of metrics from measurement data with respect to the statistical properties of the metric. Shatnawi [25] used log transformation for metrics thresholds calculation. In this study, all metric data are transformed into a natural log. After that, they are used to calculate the mean and standard deviation to derive thresholds. A statistical model derived from logistic regression (LR) was used in the paper [40] to calculate the OO threshold values. Ferreira et al. [2] derived the thresholds based on a statistical analysis of the data and commonly used values of thresholds in practice. In their paper of Singh and Kahlon [41], the logistic regression model was used to derive the metrics threshold value for the identification of bad smells. Benlarbi et al. [42] presented a statistical technique for calculating and evaluating thresholds of object-oriented metrics. In the paper of Lavazza and Morasca [43], the authors used three distribution-based statistical methods, namely adapted distribution-based (ADB), Alves et al. [7], and Vale et al. [38]. Shatnawi et al.'s [17] method selects the threshold value, which has the maximum value for sensitivity and specificity as the threshold value. However, Catal et al. [44] chose the threshold value, which maximizes the area under the ROC curve, and passes from the following three points, namely (0, 0), (1, 1), and (PD, PF). The reason is that there might not always exist a threshold value that maximizes sensitivity and specificity. The other researchers, in their work (Bansiya and Davis, [45]; Martin, [46]; Tang et al., [47]; Henderson-Sellers, [48]), used the statistical properties of metrics to derive thresholds from a large corpus of systems. Lochmann [49] applied the benchmarking approach of Lanza and Marinescu [9] to determine thresholds. Veado et al. [50] implemented a threshold calculation tool based on the Alves et al.'s approach [7], Ferreira et al. [2], Olivria et al. [22,39], and Vale et al. [38] method. In Oliveira et al.'s [22] paper, the authors suggested the concept of the relative threshold to evaluate metrics that follow heavytailed distributions.
Quality related: The survivability approach was proposed in [26] for object-oriented systems for solving the problem of software degradation commonly caused by increasing growth in classes and methods. In [17], the threshold values of the metrics were used to detect code bad smells. These thresholds were presented in Gronback [51]. Al Dallal applied a significance threshold (a = 0.05) for four studies of object-oriented metrics, namely [24,52,53]. This threshold is used to understand if a metric is a significant fault predictor. The data-driven method was suggested by Fontana et al. [14], which is repeatable and transparent, and enables the extraction of thresholds. This method is also based on the metric's statistical distribution. Threshold values are derived for five different bad smells; for example, WMC has separate thresholds for the God class and data class. Moreover, the receiver operating characteristic (ROC) curve is a technique that has been used for the first time to identify object-oriented metric threshold values in work [17]. Oliveira et al. [39] presented a tool to calculate relative thresholds using measurement data. Arar and Ayan [1] proposed a hybrid model that uses Shatnawi et al. [17] and Bender's [54] approach. Hussain et al. [55] used Bender's [54] approach to detect object-oriented threshold values. Catal et al. [44] proposed a new threshold calculation by modifying Shatnawi's threshold method [25]. Shatnawi and Althebyan [56] proposed a statistical power-law distribution to calculate threshold values. In work by Singh and Kahlon [41], Shatnawi et al.'s [17] method, which is based on ROC analysis, was applied to calculate matrix threshold values.
Bender [54] derived thresholds for change-proneness using the ROC technique. In the paper of Malhotra et al. [15], threshold values obtained by Shatnawi et al. [17] were used as-is. Review study: Beranic and Hericko [3] performed a review study and showed the importance of software metrics in software engineering. Ronchieri and Canapro [27] provided a preliminary mapping study of software metric thresholds calculation papers based on the Scopus database.
The review of the selected papers has found a taxonomy of three types of research in threshold calculation as shown in Table 5. The taxonomy is based on the evolution of threshold studies. In early works, researchers considered metrics as simple measures of quality and therefore, researchers proposed thresholds based on their experience. However, thresholds were not proved to have any link with quality. A set of studies started to link thresholds to quality attributes such as fault-proneness and change-proneness. These studies although found the link could not report consistent thresholds among different systems. The researchers have proposed to find thresholds from a large corpus of systems. The linkage of thresholds with a quality attribute is limited by the number of systems that report bug data. The use of prediction models was deemed complex and time-consuming on a large corpus of systems. To resolve such complexity, statistical properties of metrics were proposed to find thresholds from the corpus of systems. Therefore, finding thresholds from a corpus depends on statistical properties.

Q2 What Are the Quality Attributes for Threshold Calculation Methods?
There are different quality attributes of threshold calculation methods that have been addressed in the research studies. As shown in Table 6, we divided them into five categories: fault detection, bad smells detection, reuse-proneness, other design problems, and none when no quality attribute was targeted. The method suggested by Vale et al. [38] has various applications, including fault prediction and bad smell detection. Software system design, maintenance, and testing are the quality attributes for the object-oriented metrics that appeared in the paper of Shatnawi [62], which proposes a method based on the regression models. Li and Shatnawi [17] studied the evolutionary design process. Furthermore, they investigated the relationships between bad smells and class-error probability. Alan and Catal [35] provided an outlier detection algorithm used to improve the performance of fault predictors in the software engineering domain. The application domains for the threshold calculation method in the paper by Fontana et al. [14] are software quality and maintainability assessments, and they are especially useful for code smell detection rules. The scope of the paper by Herbold et al. [11] can determine the source of code discrepancies.
Al  introduced a new cohesion metric (SCC) for software. These metrics let early design artifacts be evaluated at early phases. Moreover, in another research article [52], Al Dallal proposed a model to predict classes that require the Extract Subclass Refactoring (ESR) and presents these classes for improvement. In 2011, this author published two papers [24,53]. In the first one, he defined the values of cohesion metrics, which considerably improved the detection of fault-prone classes. Furthermore, in the second paper, he extended the definition of the lack of cohesion (LCOM) metric as transitive LCOM (TLCOM), which improves its quality in predicting faulty class and in indicating class cohesion. As industrial applications, the thresholds derived with the method in the paper by Alves et al. [7], have been successfully utilized for software analysis, benchmarking, and certification.
Shatnawi [25] proposed an approach using log transformation to reduce data skewness. Log transformation has affected how metrics are used in reviewing software quality, such as threshold derivation and fault prediction. In another research, Shatnawi et al. [17] identified threshold values of CK metrics, which classify classes into several risk levels. Suresh et al. [36] analyzed a subset of object-oriented metrics as well as traditional metrics to compute system reliability and complexity for real-life applications. Prediction of faultprone classes based on object-oriented metrics threshold values in earlier phases of the software development life cycle was the subject of study by Malhotra and Bansal [26]. This early prediction helps developers to allocate necessary resources accordingly.
Ferreira et al. [2] suggested some threshold values that determine the problematic classes with respect to design flaws. The work in [41] derives a threshold metrics value against the bad smell using risk analysis at five different levels. The authors in [39] and their tool (RTOOL) presented a way to calculate the metrics thresholds. Benlarbi et al. [42] specify that the test theory is useful to predict fault and the effect of OO measures and show the principles in dealing with metrics.
Lavazza and Morasca [43] showed that thresholds according to distribution-based techniques are not practical in identifying fault-prone components. Vale et al. [38] defined metrics thresholds values to detect God classes in software product lines. Arar and Ayan [1], Hussain et al. [55], and Shatnawi and Althebyan [56] derived metrics thresholds for predicting faults on Eclipse projects. Catal et al. [44] used threshold values to identify noisy modules in software fault prediction datasets. Singh and Kahlon [41] aimed to predict faulty classes. Prioritization of classes for refactoring was targeted in the paper by Malhotra et al. [15]. Lochmann's [49] research was prioritized on quality assessment. Veado et al. [50] proposed a practical tool to measure threshold values in software projects. Oliveira et al. [22,39] aimed to control and monitor technical depth.
Malhotra and Bansal [40] have replicated the derivation of thresholds using ROC curves on the change-proneness of modules. The derivation and validation of thresholds were confirmed on five versions of the Android operating system. Filo et al. [58] used a cumulative relative frequency graph to identify thresholds. The technique was validated on the detection of large classes (one of the code bad smells). Stojkovski [23] selected the mean as the threshold value if the distribution is normal; otherwise, percentiles (70% and 90%) were used to select thresholds of the large corpus of Android applications. Vale et al. [63] proposed a benchmark-based threshold identification technique from a large corpus of systems. The authors identified thresholds at different risk assessment levels (i.e., very low, low, moderate, high, and very high). Thresholds were used to detect two bad smells (i.e., large classes and lazy classes). Mori et al. [60] have derived thresholds from a large benchmark composed of more than 3000 systems. The study derives thresholds by choosing the top ranks, 90%, and 95 modules as problematic. The problematic modules were considered as bad smells. Boucher and Badri [64] replicated the use of ROC curves, VARL, and Alves rankings. The authors compared the performance of these techniques in detecting faults in twelve open-source systems. The results have shown that the ROC curve outperforms all others in fault detection. Table 6 summarizes the studies into quality attributes. As shown in Table 6, fault detection was the most studied quality attribute.

Q3 Which Metrics Combinations Have Been Used?
In research papers collected in this study, there are various categories of metrics proposed to evaluate the software systems. Paper [25] uses four different metrics that belong to CK metrics and other traditional metrics. These metrics capture different attributes, i.e., size, coupling, complexity, and refinement. In addition, papers [16,23,24,26,36,49,57,58,60,61,63,65,66] utilized CK metrics for analysis and measures and identified threshold values for these metrics. The McCabe complexity metrics were applied in papers [7,18,26,41,63]. The work in [17] applied six of the bad smells identified by using a metrics model proposed by Gronback [51].
Al Dallal [24,35,53], focuses on cohesion metrics. Moreover, he used more than one category in another study [52]: the existing size, cohesion, and coupling metrics. Paper [14] applied cohesion and complexity metrics, which contain a code smell metrics set.
In [11], metrics for maintainability and classes were identified. CK metrics have been used in paper [25]. In addition, the CK metrics suite has been used to compute system reliability in another research paper [36]. In addition, this paper used Martin's metrics suite and traditional metrics such as cyclomatic complexity to compute software complexity. Because CK metrics address several aspects of software design, Malhotra et al. [26] used them as independent variables. Apart from the CK metrics suite, "lines of code" (LOC) is used to measure size because that is the most common metric for this purpose. Shatnawi et al. [17] used 12 metrics from different categories, including CK metrics, Li's metrics [20], and Lorenz and Kidd's [4] metrics. These metrics are categorized as inheritance, coupling, cohesion, polymorphism, class-size metrics, and class complexity. Six of the object-oriented software metrics have been studied in their research paper [2]. Paper [41] selected a set of metrics, characterized as Inheritance, Coupling, Class Complexity, Cohesion, Class Size, Information hiding, and Encapsulation. Heavy-tailed distribution metrics were used in [39], which illustrated a specific tool called RTTOOL for relative thresholds.
Research studies in threshold opt to find a link between metrics and software quality attributes. Researchers validate metrics to find this link in two scenarios, individually and in combination. However, the combination of metrics has always been found to be better in predicting quality and therefore, the researchers have validated the use of many combinations to find the least number of metrics that provide the best prediction performance. The CK metrics are one of the best combinations of metrics to predict quality. The research on thresholds is a replication of studies on fault prediction and therefore, researchers select a combination of metrics to validate their methodology of threshold calculation. The CK metrics were the most metrics reported in the quality prediction fields and therefore, this should reflect on threshold studies. This question reports the most reported combination of metrics. The number of metrics used in research is very large and therefore, it is difficult to report each metric individually, considering that some metrics have variations. Therefore, we have reported in Table 7 a summary of metrics combinations based on the use of CK metrics. Table 7. A summary of CK metrics suites.

Q4 What Are the Types of Studies?
We observed that not all studies are empirical. Some researchers identified thresholds based on experience. For example, McCabe has identified thresholds based on experience. Empirical studies evaluate existing thresholds or methods for its calculation, while theoretical studies explain and review issues related to software engineering and any form of software metrics thresholds. There are other researchers such as Malhotra and Bansal [40] who studied thresholds of CK and other metrics in predicting the change-proneness of code. Filó et al. [58] studied the use of the CK and size metrics thresholds in code bad smell detection. Stojkovski [23] derived thresholds for only five metrics (NOM, RFC, CBO, NOC, DIT), four of which are from the CK suite. Vale et al. [63] derived and validated the metrics (i.e., LOC, CBO, WMC, NCR) on the detection of bad smells. In the work of [18,60], thresholds of the CK and LOC metrics were also derived and validated for bad smell and fault detection.
According to the nature of object-oriented metrics and depending on measures and practical methods, major papers in this study are empirical studies. There are also papers that are theoretical and reviews in nature, while others include both empirical and theoretical content. Table 8 shows a categorization of the paper types into empirical, both empirical and theoretical, and theoretical and review. Due to the space limitation, we did not present all the papers based on these categories and aimed to show the available categories. This question is answered for the most closely connected or appropriate metrics to each model or algorithm provided in the papers. The metrics are correlated if questions are answered based on these metrics. The relevant and irrelevant metrics are illustrated in Table A3 in Appendix C. All studies included one or more of the many software metrics that were proposed and validated in the literature. For example, the CK metrics have been studied in many research papers.

Discussion
• Q1 (Type of threshold calculation methods): Studies of thresholds were categorized into four groups: programmer experience, statistical properties, quality-related, and review study. Considering the result of this question implies that the most common approach applied to identify the threshold values for object-oriented metrics is statistical analysis [7,43,59]. Additionally, log transformation and receiver operating characteristic (ROC) curves are two other methods used in some other studies [17,44]. The number of studies that use pure statistical techniques to derive threshold is the largest among all studies (i.e., 22 studies). In addition, quality-related studies were conducted in many papers [2,17,26]. The utilization of programmer experience for threshold identification was considered but limited to some studies [33][34][35]. There were some review studies on threshold derivation. However, these studies were not as comprehensive as in our study and did not answer these research questions.
Statistical methods are the most used among other techniques. The reason is that it depends on data that can be collected easily using static analysis tools that are available abundantly. However, most researchers do not compare many techniques except a few such as [50]. Therefore, there is a need for research that puts all previous techniques in one study on the same datasets to get a fair comparison among different techniques.
Quality-related techniques require collecting extra data such as faults, changes, and code refactoring in the lifetime of software. This extra information is not always available and ready to use for most software systems. Quality information requires extra effort and costs and developers may not have time to add them to the code repository. Such data are mostly available for the open-source field, which also suffers from problems like the correctness of recording the data or missing information. Therefore, statistical techniques are more common than quality-related techniques. We notice that there are no mixed techniques on both large corpora of data as in statistical techniques and quality data because collecting quality data for a large corpus of systems is time-consuming. Future works can focus on building tools that combine collecting quality data and at the same time static metrics for a large corpus of systems.

•
Q2 (The quality attributes for threshold calculation methods): The papers in this study were categorized based on quality attributes as follows: fault detection, bad smells detection, design problems, reuse-proneness, and no application. According to Seliya et al. [73], fault detection is important to assure desired software reliability and is vital in bringing anticipated allocated cost and time. Reviews on software fault-proneness have shown the significance of this quality attribute on software quality (Kitchenham [28]; Hall et al. [74]; Rathore and Kumar [75]; Radjenovic et al. [76]).
Smell detection and refactoring also have an important effect on software quality, and many research papers discussed their effect on software quality (Fernandes et al. [77]; Zhang et al. [78]; Rasool and Arshad [79]). These two quality attributes are the most reported ones in our mapping study.
Predicting faulty classes based on threshold values received attention in these research papers to improve the quality of the class design (i.e., 16 studies). Additionally, threshold values have been used to isolate non-faulty classes from other classes that have risky classes. Some papers focused on detecting bad smells to be used as a methodical approach to classifying problematic classes (i.e., 9 studies). There are a considerable number of studies that have not validated thresholds on quality attributes (i.e., 12 studies) [3,7,13,22,23,27,36,39,49,50,56,57].
Collecting quality data is a time-consuming activity and open-source systems do not provide enough information to find, for example, the fault-prone modules. The refactoring of code is not systematic in most systems and depends greatly on efforts available for refactoring. The developers do not publish enough information on code refactoring and software processes. Therefore, collecting information about faults and code refactoring is not plausible, but the consideration of mixed models that depends on two or more quality attributes is not studied as shown in the selected studies. The results of threshold derivation have shown different metric values for the same techniques and for different systems. These results indicate inconsistencies among techniques and among different software systems. There are factors that affect consistency in threshold derivation. For example, threshold techniques for more than one quality attribute are potential future research that may give accurate or consistent thresholds for software metrics.
• Q3 (Metrics categories): Answering Q3 shows that a well-known suite of metrics (CK) has been used in most research papers [25,35,62]. Among 300 metrics found in literature, most studies proposed and validated the CK metrics as opposed to other metrics such as the LOC as a size measure by Nuñez-Varela et al. [80]. The authors in the study by Srinivasan and Devi [81] observed that metrics are an important factor for both currently developed software and also, for the future improvement of the software. To measure different quality factors, these metrics, such as complexity, are applied. Some papers only focused on applying one of these factors, like cohesion. In addition to CK metrics, Martin's metrics, McCabe's complexity, Li's, and Lorenz and Kidd's metrics also have been used in some studies [2,39,41]. There is a number of studies that have not considered the CK metrics at all [14,24,52]. These studies derived thresholds for non-object-oriented systems. However, Shatnawi et al. [17] used 12 metrics in different categories, including CK metrics, Li's metrics [20], and Lorenz and Kidd's [4] metrics.
There are inconsistencies in threshold values identified for the same metric for different software systems and using different metrics analysis tools. The definition of some metrics is not used consistently in different tools. The researchers considered different tools, which affect the consistency of derived thresholds for the same system. There is a need to compare metric thresholds using the same analyzer for several systems to know whether the inconsistencies come from the metric definition and collection or the technique itself. In addition, a large number of metrics exist in research and are not considered for threshold derivation, which is also potential future research. Threshold derivation techniques should be compared and validated on the same software and using the same tool.
• Q4 (The types of studies): The result of this question was as expected because the object-oriented metrics thresholds are practical areas, which are based on calculations and experimentation. Consequently, most of the papers included in this research have empirical nature [7]. However, the type of study also includes empirical and theoretical studies, supported by Ronchieri and Canaparo [27]. The review papers, such as Beranic and Hericko's study [3], have been further added in this direction to show that it is an active area of interest among researchers.
Thresholds were proposed early with metric proposal and definition. For example, some researchers have used percentiles to identify God classes because they were selecting a portion of the system for investigation. However, the connection between metrics and many quality attributes has appeared and the need for advanced threshold techniques became a requirement. Thresholds need to be validated and empirical studies are the most suitable setting. There is a need for comprehensive research on threshold identification by considering all possible techniques and metrics analysis tools to reach valid and more accurate threshold values. Threshold values should be validated theoretically and empirically. In addition, empirical works should include other important quality attributes such as reusability, maintainability, security, and performance. Furthermore, a few research studies have considered multi-value attributes such as bug severity and priority and how they affect threshold identification, and whether the binary classification is better than multivariate classification.
• Q5 (Relevant metrics): We observed that many of the papers did not report the relevant metrics. However, there are also some studies that clearly explain whether the metrics threshold values were calculated properly or not. Table A3 in the Appendix C shows these metrics per paper.
There are plenty of metrics for measuring internal code properties such as size, inheritance, cohesion, and coupling. Coupling metrics were the most correlated with quality attributes. However, how these metrics interact and how they affect threshold derivation was not studied in previous works. There is a need to understand the profound relationship between metrics. Although some studies have considered the profound effect of size, there is a need for more studies on studying, for example, the interaction between coupling and other metrics. For example, it is well-known that low coupling and high cohesion are preferred in systems' quality; however, how each metric affects the thresholds of the other is not considered in the literature.

Threat to Validity and Limitations
Five well-known electronic databases have been investigated to reach papers that focus on threshold-related studies. These libraries are IEEE Explore, Science direct, ACM Digital Library, Springer, and Wiley. There are variations of systematic mapping and experimental studies that can have limitations on our study.
The papers searched in this work are limited to well-known digital libraries. Therefore, there is a possibility of missing relevant papers. The authors have focused on well-known databases. The authors used synonyms related to threshold values of object-oriented metrics. Threshold calculation papers are usually related to fault prediction papers, and there is no complete separation between threshold papers from quality-related papers.
Bias over publications: It is a human tendency to report more favorable outcomes over adverse outcomes in every publication. If any work is reported, negative results have low chances of getting accepted and vice versa.
We have not limited to search analysis and selection techniques. We have searched all types of work related to this review, irrespective of their specific data or method. The selected studies for the review process from different periods when various institutions/practitioners utilized such methods. The broad acceptance of such methods leads to influence the researchers to use such methods.
Searching and selecting of studies: Our strategy was designed to find as many articles as possible. We built a wide range of search strings over digital databases. Although the search outcomes included many irrelevant articles, we decided to keep the search string fixed, as such, we do not miss any relevant papers. We considered regulating the search string to diverse our search results. We applied the inclusion and exclusion principle to choose relevant papers. Three researchers examined the principle of selection and discarded criteria. Therefore, the possibility to exclude the paper incorrectly was minimized.
Quality assessment: Usually, quality assessment is conducted by one researcher and validated by another researcher. However, in our review, the second researcher performed data extraction and quality assessment. Most biases were expected in quality assessment because research questions may be hard to justify. The score of articles may vary from one researcher to another.

Implications for Researchers and Practitioners
Researchers and practitioners can focus on building tools that combine collecting quality data and at the same time static metrics for a large corpus of systems.
Threshold techniques for more than one quality attribute are potential future research that may give accurate or consistent thresholds for software metrics.
A large number of metrics exist in research and are not considered for threshold derivation; researchers can focus on this direction. Threshold derivation techniques should be compared and validated on the same software and using the same tool.
There is a need for more studies on threshold values evaluation in context to the interaction between coupling and other metrics.
There is a need for industrial studies and cases to know practitioners' perspectives on threshold calculation techniques.

Conclusions and Future Work
In this paper, we performed a systematic mapping study on software metrics threshold calculation techniques and presented our results. The result of this study showed that most of the papers were published in recent years, which means that the field of identifying the object-oriented metrics threshold values has received more attention recently, and this is a very active research area among software engineering researchers. In addition, results demonstrated that most of the methods proposed to calculate the thresholds are based on statistical analysis techniques. Another result from this paper is the wide use of the CK metrics in different projects. The use of object-oriented metrics threshold values was investigated in a variety of quality attributes such as software design, software maintenance, industrial and medical applications. The study has found that threshold derivation was not comprehensive in most studies and further studies are needed to fill the gaps. The threshold derivation research has not covered many quality attributes and has not covered a large number of metrics. Furthermore, thresholds for severity and priority are not investigated.
For future work, we recommend extending this study to include papers published after our screening period and the other search engines and electronic databases. In addition, a systematic literature review of this field based on specific research questions can be conducted on the limitations of software metrics.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1. Identified Papers in this Systematic Mapping Study.