Software Productivity in Practice: A Systematic Mapping Study

Practitioners perceive software productivity as one of the most important subjects of software engineering (SE) because it connects technical to social and economic aspects. Nonetheless, software processes are complex and productivity means different things to different people. In order to realize the full contribution of software productivity research to the industrial practice of SE, the analysis and synthesis of existing practitioner viewpoints and concerns are required. A systematic mapping study is developed here to investigate the existence of diverse empirical perceptions of productivity within the distinct business sectors and knowledge areas covered by the industrial practice of SE, also identifying the commonalities among them. This study adopts the DBLP and Scopus search engines to identify bibliographic references from 1987 to 2021 related to software productivity. References that do not correspond to complete not-later-subsumed articles published in peer-reviewed journals and proceedings are excluded from the analyses. Only papers reporting on empirical studies based on software industry data or that present industry practitioner viewpoints are included in these analyses. In total, 99 papers are analyzed. The mapping found great variability in study findings, particularly concerning the impacts of agile development practices on software productivity. The systematic mapping also drew methodological recommendations to help industry practitioners address this subject and develop further research.


Introduction
Practitioners perceive software productivity as one of the most important subjects of software engineering (SE), because it connects technical to economic aspects. Ever since the early studies on this subject [1], software productivity measurement considers the costs of employed personnel, equipment and third-party components as possible inputs, whereas source code, specifications and other produced software artifacts are regarded as possible outputs. However, recent studies point out that the concerns of industry practitioners regarding software productivity go far beyond technical and economic aspects and also embrace social aspects [2,3], such as affects [4], daily practices [5], teamwork [6] and job definitions [7].
Nonetheless, software productivity is not straightforward to understand, since software processes are complex per se [8] and there are complex interactions between process steps, such as requirements engineering and software design [9], and between systems and software. Moreover, software productivity means different things to different people [10] and various terms are used to denote the same productivity factors [11]. Consequently, the meaning of software productivity varies according to perspective and context [12]. In order to realize the full contribution of software productivity research to the industrial practice of SE, it is necessary to analyze and synthesize the existing practitioner viewpoints and concerns. This paper develops a systematic mapping study to investigate the existence of diverse empirical perceptions of productivity within the distinct business sectors and knowledge areas (KAs) covered by the industrial practice of SE, also identifying the commonalities that exist among them. This study is a replication, refinement and extension of an earlier systematic literature review considering a different time frame and methodology [13]. It is noticeable that, since then, relevant studies have been published that significantly impact review findings. Moreover, by revisiting the original research goals of the systematic literature review (producing a broad overview of the subject area, providing research evidence and quantifying this evidence) and adopting an enhanced methodology, the present research effort can be characterized as a systematic mapping study, according to the criteria suggested in [14].
The present study was developed considering the recommendations formulated by Kitchenham and Charters in [14] and the PRISMA methodological guidelines [15] (which prescribe the adoption of the GRADE system [16]). This study adopted the DBLP and Scopus search engines to identify bibliographic references related to software productivity from 1987 to 2021. A total of 99 papers published in peer-reviewed journals and proceedings were analyzed reporting on empirical studies. Papers were classified according to their authors' attributes, covered business sectors and KAs, types and goals of reported studies, as well as studied productivity measures. These data were tabulated and study findings analyzed and synthesized.
The distinctive characteristics of the reported research derive from the decision to analyze only empirical studies conducted with software industry data or presenting industry practitioner viewpoints. This design appeared to be adequate because the main goal of the present study is to contribute to the SE industrial practice and, in general, the outcomes of productivity studies are relatively distinct in industrial settings [3,17]. In these environments, empirical studies assume a high degree of relevance since studied settings are not artificial and software developers are professionals [18]. This study is also original and important because it identifies the historical evolution of the software productivity field, contributes to the body of evidence and draws recommendations to help industry practitioners in addressing this subject and developing further research. This paper is organized as follows: Section 2 describes the research methodology; Sections 3 and 4 present the analyses of primary and systematic indirect (secondary and tertiary) studies, respectively; Section 5 presents the findings and recommendations of the present study; and Section 6 discusses the existing validity threats. The last section presents some prospects for future research (Section 7).

Systematic Mapping Methodology
This section describes the adopted literature review protocol and systematic mapping methodology, which follow the guidelines presented in [14,15].

Context Definitions
The focus of the present study is the dependent variable of software productivity. The objects of this study are software engineering processes and organizations wherein productivity can be addressed. The studied subjects are software professionals that conduct software processes and are affiliated with software organizations. Independent variables that capture factors affecting software productivity are also studied, although they are not the strict focus of the present research.
Interventions in software processes that may have cause-effect relationships with productivity are investigated here. Interventions are approaches to software productivity that have the following ultimate goals (as suggested in [18]): observing, analyzing, describing, understanding, predicting and acting on productivity.
In practice, software processes have inputs (observed through independent variables), may receive interventions that result in outcomes, and produce outputs (observed via dependent variables). Outcomes and outputs are connected to interventions and inputs through construct validity. Depending on the studied context, software productivity may have diverse confounding factors, such as developer affects [4] or knowledge [7], making it impossible to distinguish the effects of two interventions from each other.

Systematic Mapping Definitions
Industry practitioners are SE professionals affiliated with private or public administration organizations (software industry). They are essentially distinct from academic practitioners, affiliated with universities and research centers, which are not studied here.
The empirical studies addressed in this systematic mapping are detailed in published papers. The systematic mapping deals both with primary and indirect studies. Primary studies report on scientifically investigating research objects and subjects, whereas indirect studies incorporate results from previous studies in the analysis. Primary and indirect studies are classified as case studies, experiments, simulations, surveys and reviews, eventually using qualifiers. Table 1 presents a detailed definition of this classification. Table 1. Categories of Empirical Studies Analyzed (adapted from [18]).

Study Type Description
Case Study Adopts research questions, hypotheses, units of analysis, logic linking data to hypotheses and multiple criteria for interpreting the findings. If some of these requirements are not satisfied, it is considered an exploratory case study . It is called a case-control study if comparisons are drawn between a focus group and a control group, which has not suffered any intervention.

Experiment
Adopts random assignment(s) of interventions in subjects, large sample sizes, well-formulated hypotheses and the selection of (an) independent variable(s), which is (are) (randomly) sampled. If all these requirements are satisfied, it is considered a controlled experiment; otherwise, it is a quasi-experiment.

Simulation
Adopts models to represent specific real situations/environments or data from real situations as a basis for setting key parameters in models. If the model is used to establish the goal(s) of (an) objective function(s), it is called an optimization model.

Survey
Proposes questions addressed to participants through questionnaires, (structured) interviews, online surveys, focus group meetings and others. Participants may also be approached in a census process or according to random sampling.

Review
Incorporates results from previous studies in the analysis. If the subjects are papers, it corresponds to a literature review. If a well-defined methodology is used to collect references, critically appraise results and synthesize their findings, it is called a systematic literature review. If the purpose is to provide a broad overview of a subject area, mapping the distribution of objects across a conceptual structure, it is called a systematic mapping. If statistical analysis methods are adopted, it is regarded as a meta-analysis.

Review Question Formulation
The goal of the reported research, formulated according to the Goal Question Metric (GCM) methodology [19], is to study the software productivity literature with the purpose of analyzing and synthesizing this subject area, taking into account the diverse underlying notions and definitions that exist across the business sectors and KAs covered in industrial practice by empirical SE. The following research questions are derived from this goal:

Bibliographic Reference Search Strategy
DBLP (dblp.org, accessed on 24 April 2022) [20] has been used as the main tool to obtain bibliographic references for this study, since it is an open and curated tool that covers most of the sources of published scientific research on SE, including publications in the ACM and the IEEE Computer Society Digital Libraries.
The originally adopted search criteria were to find "productivity" in the paper title and "software" either in the paper title or in the publication title (proceedings or journal name). Since DBLP allows the formulation of search queries with implicit conjunctive connectives, the previous review was produced using the search string "software productivity" to obtain all references matching both keywords. However, the most recent DBLP queries missed a few previously recovered references. The root cause of this lack of repeatability was the change implemented by some bibliographic reference suppliers in the presentation style of journal names. Instead of exporting full journal names to DBLP, some publishers now only export abbreviated names. Consequently, the search string has been modified accordingly to "softw productivity".
The present study was carried out considering the period 1987-2021 to update the previous study and observe software productivity publications for 35 years. The DBLP query, last performed on 8 January 2021, returned 445 references for this period. The Scopus database (www.scopus.com) was also used as an additional source of bibliographic references. A query on Scopus last performed on the same date resulted in 489 references, but only 182 of these references were not present in the DBLP search result. Inspecting this result, it became clear that Scopus provides additional coverage of regional events and journals not directly connected to SE. However, the respective publications often would not satisfy the adopted exclusion and inclusion criteria. Consequently, just the additional references corresponding to international events and journals directly connected to SE were considered in the present study. The search on Scopus resulted in 50 additional bibliographic references to be analyzed.
Although the obtained set of references may seem small when contrasted to related work, it appears to represent the respective universe adequately since a generic search string and an international publication coverage were adopted. The threats to validity that arise from these settings were treated in the ways discussed in Section 6.

Reference Exclusion Criteria
The present study excluded from the analysis the references that failed to satisfy any of the following conditions: 1.
Correspond to complete articles written in English published in peer-reviewed journals and event proceedings: The retrieved references were ignored if they corresponded to books, theses, technical reports, editorials, abstracts and summaries, preventing the analysis of incomplete, partial or not completely validated research results. The few references corresponding to papers written in other languages were also ignored; 2.
Correspond to journal papers, book chapters and conference/workshop papers which were not later subsumed: Each retrieved reference was excluded if it was later subsumed by a subsequent publication. Subsumption was chosen as an exclusion criteria to avoid analyzing results that later on appear in modified form or with different contents in relation to previously published versions; 3.
Are strictly connected to software productivity: This criterion was posed to avoid analyzing studies related primarily to other subjects (such as SE education and training), or experience reports that study specific subjects (such as productivity software) or methods, techniques and tools addressing software productivity as a secondary subject (such as management techniques and software development environments that ensure higher productivity); The author of the present study verified compliance with these criteria considering only any information on paper title, authors, abstract and publication media available online. The subsumption of a paper by another one was checked only when both references were obtained as a result of the bibliographic search. From the 495 references resulting from the initial search, only 242 satisfied all these criteria.

Paper Inclusion Criteria
The author attempted to obtain a complete version of each published paper, but only 163 of these papers were readily available online matching the selected bibliographic references. Each obtained paper was read to ensure its compliance with the following inclusion criteria: 1.
Reports at least on one empirical study; 2.
Has a industry practitioner author or analyses software industry data (data from the software industry is admitted here in an ample sense, covering raw data and source code from private and public administration organizations, from open databases or closed development projects, so long as they are effectively used/adopted in industry); 3.
Describes the adopted methodology; 4.
Explains the studied variables and measures; 5.
Provides a statement of the main findings.
In particular, the requirement that papers report on at least one empirical study prevented the inclusion of articles with opinionative content, such as position and vision papers and expert opinion texts.
Clearly, although the chosen exclusion and inclusion criteria are objective, their enforcement was based solely on the author's judgment. This poses a relevant validity threat to the findings of the present work, which is discussed in Section 6. Nevertheless, the requirement of compliance of the obtained papers with the inclusion criteria above reduced the scope of this study from 242 references to 97 articles to be analyzed.

Secondary and Tertiary Study Treatment
Among the 97 papers initially included in the analysis, there were indirect studies that analyze the findings of other articles. In order to include one such paper in the present study, the preceding exclusion and inclusion criteria had to be satisfied (in which case, it corresponds to a mixed study, presenting both a primary and an indirect study) or at least one paper referenced therein was required to comply with the criteria above. Indeed, some of the initially included papers have a mixed nature, such as [21][22][23][24], whereas nine papers present systematic indirect studies, such as literature reviews and meta-analyses ( [8,17,18,[25][26][27][28][29][30]).
A backward snowballing process was performed [31] to take advantage of the required inspection procedure. This technique analyzes the papers referenced in a publication to find relevant studies that had not been discovered using the adopted search strategy. The snowballing technique was applied only to the identified systematic literature reviews, systematic mappings and meta-analyses. Nine additional references were obtained in this way, including [2], cited in [8], and a systematic literature review. Consequently, snowballing was applied recursively yet again on the references of [2], resulting in one additional publication to be analyzed [32]. After verifying inclusion criteria, the backward snowballing process only produced these two extra papers to be analyzed.
Consequently, the selection process resulted in 99 papers to be analyzed in the present study: ten systematic reviews, mappings and meta-analyses and 89 articles that contain other study types. Table 2 presents a summary of the paper selection process.
Number of Analyzed Papers (k = e + j) 73 99 The reader should not be surprised by the effectiveness reduction of the application of the backward snowballing technique in the present study. This happened due to the adoption of Scopus as an additional bibliographic reference source here. It is also important to mention that the replication of the previous study, considering a different time frame and a slightly modified publication search strategy, makes the results reported in this paper not directly comparable to those previously reported. For example, in the present case, 13 additional references were recovered by the most recent DBLP query between 1987 and 2017. Another 21 papers published in the same period are now available to the author. Moreover, one paper recovered in the previous study has been subsumed. Nevertheless, it is important to present the two studies in comparison to demonstrate the transparency of the adopted procedures in both cases.

Paper Processing and Treatment
The references and full versions of the selected papers were used to extract the following tabular data:
Year of publication; 3.
Total number of authors and industry practitioner authors; 4.
Number of studies on software productivity; 6.
Main SE KA and KA topic(s); 9.
The first two fields were extracted from each bibliographic reference. Author affiliations, numbers of authors and reported studies on software productivity, conflicts of interests, and funding information were obtained from the data included in each paper. Study type and business sector, KAs and productivity approach goal were gathered by the author while reading each paper. The list of non-foundational SWEBOK KAs [33] was used as a coding taxonomy for included paper subjects classification. Table 3 presents the description of these KAs. On the other hand, no a priori definition of studied business sectors was chosen, so they are reported here in the way they appear in published papers. The data sources, productivity measures, analysis methods and findings of each study were compiled by inspecting each paper in detail, considering the extensive body of empirical methods in the SE literature (cf. [34]).

Analysis and Synthesis Methods
The main methods used in analysis and synthesis procedures were the visual inspection of included papers and the tabular presentation of collected data. Missing data were noted and presented in this way in connection to each research question. In addition to tabular presentations, textual descriptions of collected data are also presented here, along with the respective occurrence frequencies. Frequencies corresponding to single occurrences are omitted for simplicity of presentation.
Some bar charts are also presented here to show evidence at a high level of granularity and facilitate the development of temporal trend and gap analyses of included studies. However, these charts are not comparable to those shown in the previous study [13] since a period of 35 years is analyzed here, equally divided into five-year periods from 1987 to 2021.

Data Analysis and Primary Study Finding Compilation
This section describes the attempts to answer the research questions by analyzing the findings and data collected in primary studies and non-systematic reviews. The findings of other indirect studies are analyzed in Section 4 since the respective papers have distinct structures and adopt different methodologies. The certainty analysis in this body of evidence and a synthesis of the findings of the present systematic mapping study are detailed in Section 5.
The demographics of the analyzed papers are as follows. Concerning authorship, 35% of the papers have industry practitioners among their authors, whereas 65% only have authors affiliated with academic institutions. In terms of gender, 48.5% of the papers have female authors, whereas 51.5% only have male authors. Figure 1 presents the geographic distribution of the authors of analyzed articles.  Figure 2 presents the historical breakdown of the number of studied papers through the KAs of SE. Overall, the figure displays a growth trend in the number of studied papers on software productivity. In the last five years, the number of analyzed papers is three times greater than in the initial period. Some diversification in addressed KAs is noticeable, with general studies (recorded under the tag SWEBOK in this study when many phases were addressed in a single paper) being substituted by specific ones, mainly SE management practices (cf. SEM: [5,6,10,[35][36][37][38][39][40][41][42][43][44][45][46][47]). In the past, papers addressed more traditional phases of development processes, from design to maintenance (cf. SD, SC, ST, SQ, SCM and SM). Despite the general growth, only a few articles address social aspects (cf. SEPP) and early stages of software development processes (cf. SR). The most frequent business sectors mentioned in primary studies are: the business of software development (in 23.6% of the papers); other information technology businesses (11.2%); banking, space and commerce (4.5% each); defense and services (3.4% each); and automotive, education and government (2.2% each). Surprisingly, 32.6% of the papers did not mention the target economic sectors of the studied development processes, whereas 11.2% of the papers addressed many different sectors.
Another challenge is understanding the formulation of productivity measures and how they are used in studies for software productivity analysis. Table 4 presents a list of productivity measures extracted from the studied papers. Often, software construction and maintenance measures are expressed as ratios between inputs and outputs of software processes [21]. However, some authors prefer a more algebraic formulation, using regression equations ( [90,99,100]) or data envelopes ( [62,63,70,96]). Although single ratio measures ease data collection and analyses, factors such as elapsed time are not explicitly incorporated in these analyses [22]. Moreover, analyses based on such measures suffer validity threats that are not always easy to counter [29].
From the point of view of the studied objects, measures based only on source code capture only the productivity of programming, testing and maintenance tasks [32], while others, such as systems analysis and software design, demand measuring the production of more structured artifacts-models [75], use cases [85], function points ( [35,37,39,42,49,50,53,55,57,58,69,78,97]) and even formal proofs [101]. These measures usually ignore non-functional requirements and practices such as reuse [8].
An additional degree of complexity in measurement is introduced by recent efforts to understand collaborative and distributed development, which adopt elapsed time ( [43,45]) or frequency-based ( [36,59,95,102]) measures. With the departure from general studies covering the whole development process and technical aspects, productivity measures also diverged from software artifact measurement. Apart from new techniques, such as self-assessments ( [5,47,87]), contemporary measures have been devised to consider various constructs and factors such as affects [4] and job enthusiasm [89].  Furthermore, going back to traditional productivity measurement techniques, there are also ways of accounting for software process inputs and outputs based on monetary values. These measures should be used with caution, for example due to the adoption of different currencies in studies, such as the Renminbi in [106,107], the Iene in [100] and the American Dollar in [65,80,90,99]. The possibility of changes in the purchasing power of adopted currencies due to the effects of physical or economic processes on assets, such as depreciation and inflation [54], may hinder study comparability.
The analysis methods most frequently adopted in primary studies were: descriptive statistics (39.3% of the papers); statistical charts (37.1%); linear regression (20.2%); correlation analysis and ANOVA (12.4% each); qualitative analysis methods (7.9%); data envelopment analysis (DEA) and stepwise regression (6.7% each); the Cobb-Douglas model, the least-squares method and Kruskal-Wallis tests (5.6% each); Spearman's rank correlation and Student's t-tests (4.5% each); Pearson's rank correlation, system dynamics simulation models and Wilcoxon Rank-Sum tests (3.4% each); analogy-based estimation, logistic regression, Mann-Whitney tests, Markov's chains, regression trees and structural equation modeling (2.2% each). Some adopted methods have their roots in other disciplines, such as the Cobb-Douglass model and DEA (frequently used in Econometrics) and System Dynamics Models (developed to understand industrial processes). Usually, the adequacy of these methods is justified by practical reasons. For example, [96] mentions that DEA allows for the identification of productivity factors that are under managerial control and have a significant impact on productivity, and, once identified, management can take steps to retain and amplify positive factors and mitigate or eliminate negative ones. Method transference also happens in relation to computer science branches such as machine learning: analogy-based estimation [85,93], regression trees [87,91], Bayesian belief networks [98], thematic network analyses [12] and K-means learning [85] are also adopted in included studies. They are used to overcome limitations in statistical techniques, such as the requirement of normally distributed variables. The use of analytical methods from other disciplines provides additional evidence of the maturation of software productivity measurement, but suggests that traditional methods have not been entirely effective in approaching this subject.
While it is a good practice to choose the tests and methods that best fit the problem under analysis, the preconditions for their application are frequently not discussed in published papers. For example, random sample selection, missing data, frequency distribution, homoscedasticity, colinearity, goodness of fit, statistical power and effect size have not always been addressed in publications (with few exceptions, such as [22,23]). Such methodological weaknesses partially diminish confidence in some studies and, as a general concern, should be addressed in the future.

Software Productivity Approaches (RQ3)
The studied papers also identify the ultimate goals in software productivity approaches. Table 5 classifies the analyzed articles according to these goals. Figure 3 presents the historical breakdown of the number of studied papers through these goals. Between 40% and 60% of the studied papers have understanding goals, confirming the perception that software productivity has always demanded explanations as to why and how outputs and outcomes are observed in each studied context.  It is important to mention that the adopted classification of ultimate goals embodies a notion of subsumption of less demanding goals by more stringent ones. That is why the numbers of papers in the table and figure are slightly skewed towards action goals since they presume the fulfillment of prediction, understanding and other goals.
One might expect that most research on software productivity would have understanding (and measurement) goals, but this is an oversimplification. While, on the one hand, actionable theories, such as optimization models [44], guide interference in software processes, on the other, techniques like structural equation modeling help describe intangible aspects of software development [3].
Although understanding goals were listed in 55% of the papers published in the five years ending in 2021, there is reasonable diversification of study goals. This diversity follows the emergence of new subjects in SE, which require the formulation of distinct productivity study goals. Examples are Software as a Service (SaaS, [90,99]), standardization [11] and agile practices ( [6,12]). The development of more studies with observation, analysis, description and action goals should be addressed in the future. Figure 4 presents the historical breakdown of the number of study types found in analyzed papers. Therein, some variability can be noticed. Experiments and case studies were dominant, corresponding to at least 33% and 14% of the papers in each five-year period, respectively. However, there has been a substantial increase in surveys, reaching 40% of the papers in the last period. The future still holds the promise of more studies on knowledge-based simulation and optimization models of software productivity [103], which correspond to only three included papers. The growth in the number of surveys seems to mirror the emergence of new interests in SE in the last decade. Indeed, four studies address SE management issues, such as daily practices [5], workflows [10], teamwork [6] and global development [42], whereas two others concern human aspects, such as affects [4] and social practices [3], which are typical concerns in agile practices. It is also noticeable in the last decade that many surveys investigate the business of engineering software, such as technical debt [87], job definitions and satisfaction ( [7,47]), working environments [76], and health and well-being conditions ( [88,102]).

Study Types and Reported Findings (RQ4)
Concerning the analysis of study findings, the hierarchical structure of the nonfoundational KA topics and subtopics detailed in the SWEBOK [33] is adopted here, in conjunction with some specific subtopics that do not belong to this hierarchy (OSS, reuse and SE economics databases). The KA subtopics explicitly addressed in the included papers are Software Process Improvement (SPI), Rapid Application Development (RAD), Capability Maturity Model (CMM), Object-Oriented Design (OOD) and Test-Driven Development (TDD). Included papers are grouped according to this classification scheme and their findings are analyzed in the context of each KA.
Special attention is given here to productivity factors studied in most included papers. A structured list of productivity factors appears in [2] and a definition of these factors through theoretical constructs is proposed in [24]. Herein, these definitions are taken for granted and the influence of factors on software productivity is coded considering their directionality, effect and significance whenever possible.
The direction of influence of productivity factors corresponds to positive (↑), conditional (→), indistinguishable (∼) or negative (↓) contributions. The symbols < and > are used to denote greater-than and lower-than relationships. When the reported results are statistically significant (p-value < 0.05) or strongly statistically significant (p-value < 0.01), the usage of these relational symbols is doubled or tripled, respectively. In order to codify the findings reported in included studies, the square symbol ( ) is used as a placeholder for any of the aforementioned relational symbols. Finally, when inconclusive results are reported, the symbol ? is used next to the relational symbols.

Studies Using SE Economics Databases
Many studies analyze the productivity data of software development projects that are contributed by private companies to public databases, as discussed in Section 3.2. These experiments and case studies cover either the entire body of knowledge on SE or only software construction. They analyze two specific subjects, not necessarily in an exclusive way: the adequacy of models and methods for software productivity measurement or prediction and specific software productivity factors. Table 6 summarizes of the respective study types and findings, together with the respective KA topics and the total numbers of authors, practitioner authors and reported studies in each paper. There is evidence of improved productivity over time, with variations coming from company and business sector. Insurance and commerce are the least productive, while manufacturing is the most productive sector among the studied projects. There is no significant difference in productivity between new developments and maintenance projects.
The ability of ordinal regression models to classify any future project in one of the predefined categories is high based on the studied databases.
(AsmildPK06) [70] 3/0 1 controlled experiment SWEBOK It is possible to develop proper exponential statistical models to predict productivity, but linear models are inappropriate. DEA can incorporate the time factor in analyses and can be used to determine the best performers for benchmarking purposes.
(MosesFPS06) [51] 4/0 1 case study SC The studied company outperforms those in the ISBSG database by approximately 2.2 times. Possible explanations are that projects are lead by staff with knowledge of systems and business processes and an optimized model-based development process is adopted. The Bayesian credible intervals gives a more informative form of productivity estimation than it would be possible using the usual confidence interval alternative for the geometric mean of the ratio.
(WangWZ08) [52] 3/2 1 exploratory case study SWEBOK Project size, type and business sector are factors that influence software productivity with varying significance levels. There is no evidence that team size and adopted programming languages affect productivity. There is no significant difference in productivity between new developments and redevelopment projects.
(BibiSA08) [91] 3/0 2 quasi-experiment SWEBOK A combination of the methods of association rules and regression trees is prescribed for software productivity prediction using homogeneous datasets. Their estimates are in the form of rules that the final user can easily understand and modify.
(Tsuno09) [53] 5/1 1 quasi-experiment SWEBOK Architecture and team size have a strong correlation with productivity. Business sector, outsourcing and projects skewed towards the implementation ensure moderate productivity.
(GeH11) [90] 2/0 1 quasi-experiment SWEBOK The stochastic frontier approach takes both inefficiency and random noise into account and is a better approach for productivity analysis. It allows the understanding of SaaS company dynamics and catch-up effects by comparison to traditional companies.
(RodriguezSGH12) [39] 4/0 1 controlled experiment SEM Improvement projects have significantly better productivity than new development and larger teams are less productive than smaller ones.
(TsunodaA17) [55] 2/0 1 quasi-experiment SWEBOK The propensity score analysis can determine undiscovered productivity factors. The company business sector and the development platform are significantly related to software productivity.
The adopted primary programming language has a significant effect on the productivity of new development projects. The productivity of enhancement projects appears much less dependent on programming languages. The business area and architecture have significant effect on productivity. No evidence of the impact of CASE tools usage on productivity was determined. The productivity of new development projects tends to be higher than that of enhancement projects.
(LavazzaLM20) [93] 3/1 1 quasi-experiment SC Software enhancement costs more than new software development, at least for projects greater than 300 Function Points. There is a lot of variability in studied data to reach this conclusion.
# of Studies = number of studies.
The included papers investigate the adoption of the following new analytical models and methods:  [55].
Their findings correspond to positive results: the investigated models and methods are considered appropriate for software productivity measurement or prediction.
The included studies based on databases also analyze many different software productivity factors considering diverse contexts. These factors can be classified as organizational/managerial factors or technical factors. On the one hand, the studied organizational/managerial factors are business sector, company, level of outsourcing, project and team size. On the other, the investigated technical factors are software architecture, development platform, adopted programming language and development tools. The contexts of these studies are software development and maintenance projects.
The complete coding of the respective relationships between factors and software project productivity is presented in Appendix A.1. However, some derived relationships are shown below as a way of illustrating the coding process: 1.
The coding of the finding related to outsourcing should be read as "development project productivity decreases as the outsourcing level increases". Moreover, the conclusion on development platforms should be read as "development platform significantly contributes to development project productivity". In addition, the outcome related to development tools should be read as "No evidence was found that the adoption of development tools contributes to development project productivity". Futhermore, the finding related to business sectors should be read as "software (that is, maintenance and new development) project productivity is (significantly, according to [56]) affected by the business sector".

Other Studies Covering the SWEBOK
Among the studies covering the whole SWEBOK that do not adopt databases as a data source, it is possible to find many experiments and surveys that analyze productivity factors or software productivity from regulatory ( [11]) and economic ( [99,104,107]) perspectives. There are also some literature reviews ( [24,103,105]) among these papers. Table 7 presents a summary of the respective study types and findings.
Papers with an economic perspective investigate study subjects in connection to economic measures. The specific topics studied in these papers are: These studies cannot rely on standardized variables defined in public databases. So, it becomes more difficult to aggregate evidence, but the coding of each derived relationship is presented in Appendix A.2. It is possible to divide again productivity factors into organizational/managerial and technical ones. Factors in the former category are organizational structure, risk assessment, team experience with users and technology, and technical debt. In the latter category, there are UCPs, FPs, LOCs, the adoption of development platforms and programming languages, and RAD and reuse practices.
Many papers report on subjective factors influencing software productivity [79], which are even more difficult to quantify. They are technical supervision, working conditions, achievement, responsibilities and recognition [109]; motivation, performance, management, compensation and rewards, organizational climate and happiness [32]; job definitions [7]; external interruption, environment adaptation and emotional issues [88]; satisfaction with the work environment and ability to work privately [76]; job enthusiasm, peer support for creativity, and helpful feedback [89].

Requirements Engineering
Just two included papers study software requirements. The first half of Table 8 summarizes the respective study types and findings. Productivity factors studied therein are requirements volatility, engineer communications and management tools adoption. A synthesis of the studied productivity factor relationships appears in Appendix A.3.

Object-Oriented Development
Four included papers study object-oriented analysis and design in connection to software productivity. The second half of Table 8 summarizes their types and findings. The productivity factors studied therein are application domain, project size, mobility incentives, deadline enforcement and effective OOD adoption. Appendix A.4 presents a synthesis of the respective factor relationships.

Software Construction
Many included papers cover specific activities of software construction that refer to the creation of working software through a combination of coding, verification, unit testing, integration testing, and debugging. These activities are generally regarded as software construction in the SWEBOK [33]. The included papers report on experiments [22,23] and case studies [71,84,97] that investigate specific productivity factors, discuss the adequacy of regression models for model-based development productivity prediction [86] and selfreported productivity evaluation [74]. An additional paper proposes a new software effort measurement technique [65]. Table 9 presents a summary of the included study types and findings. Once again, studied factors can be classified as managerial/organizational and technical ones. In the former category, there are formal education, team capabilities and knowledge. In the latter, there are software architecture, requirements volatility, development tools and pair programming. Therespective relationships are synthesized in Appendix A.5.

Software Reuse
As already mentioned, one of the most important software construction practices is reuse. That is why included papers addressing this theme are treated separately here. Each respective article is unique in its combination of study type and analysis methods. They all address the practice of reuse as a relevant factor in software development project productivity. Table 10 presents a summary of the corresponding study types and findings. A synthesis of the relationships in individual studies is presented in Appendix A.6.

Open-Source Software
Open-source software is a transversal theme in the SWEBOK. The included papers on OSS are also treated separately here since the respective experiments and case studies analyze specific productivity factors and measures. Table 11 presents a summary of the individual study types and findings.
Interestingly, OSS project productivity factors and measures are substantially different from those in other KAs. Apart from OSS adoption, the studied factors have technical nature. They are software aging, adopted programming language and development increment. The respective relationships are synthesized in Appendix A.7.

Software Testing
Among the included papers, two study testing in connection to software productivity. The first half of Table 12 summarizes the respective study types and findings. The investigated productivity factors are testing project difficulty and task process transferability in software testing projects. A synthesis of the findings related to the factorss that influence testing productivity is presented in Appendix A.8.

Software Maintenance
Four included papers study maintenance in connection to software productivity. The second half of Table 12 summarizes the respective experiments and case studies.
Yet again in this case, productivity factors can be classified into the organizational/ managerial and technical categories. The studied organizational/managerial factors are team capabilities, mentors succession and experience, and offshoring. The technical factors are domain knowledge, workload, development increment and maintenance granularity, as well as artifact coupling and quality control. A synthesis of the respective relationships appears in Appendix A.9.

Software Engineering Management
According to the SWEBOK [33], software engineering management (SEM) is defined as the application of management activities (planning, coordinating, measuring, monitoring, controlling, and reporting) to ensure that software products and services are delivered efficiently and effectively to the benefit of stakeholders. Many included papers cover these concerns through experiments, case studies, surveys and simulation studies. Table 13 presents a summary of the included papers. The studied factors are inherently managerial or related to the management of technical activities. Inherently managerial factors are offshoring, team autonomy, mobility, experience heterogeneity and management, task coordination and completion incentives, and project size. The management of the following aspects is also studied: adoption of process models, RAD, development and testing tools. Appendix A.10 synthetizes the respective relationships.
Many papers report on the existence of social and personal factors that facilitate managing software development professionals and teams for improved productivity. These are usually measured in qualitative or self-reported ways. They are that the use of the Pareto optimal set in personnel allocation and task scheduling supports better management decisions [44]; each developer has a highly fragmented daily routine, and factors influencing their productivity are quite individual [5]; personalized recommendations for improving software developers' work are essential to optimize their productivity [10]; new hires with prior internships tend to perform better than others in the beginning, take several weeks to reach the productivity levels of experienced employees, and their team support effect decreases with time [45]; code-based metrics outperform commit-based metrics in reflecting developer perceived productivity, and triangulation can strengthen organizational confidence in productivity measures [46].

Rapid Application Development
Software construction and engineering management are tightly connected to the specific practices of rapid application development. They cover lean and agile practices, such as the adoption of Scrum, prototyping, Test-Driven Development and pair programming. The first half of Table 14 summarizes these surveys and case studies together with the respective findings. The studied factors are related to the managerial aspects of team management, size, diversity, turnover, as well as to personal capabilities and Scrum adoption. A synthesis of the studied relationships is presented in Appendix A.11.   The structured synthesis method allows inferring the intensity and confidence of the factors affecting software development productivity. It offers an initial theoretical framework for representing the current status of empirical knowledge in software development productivity.
(JohnsonZB21) [76] 3/2 2 online survey, interviews SWEBOK In productivity models, the overall satisfaction with the work environment and the ability to work privately with no interruptions are as important and significant factors. Private offices were linked to higher perceived productivity across all disciplines. For software engineers, another vital factor for perceived productivity was communicating with the team and leads.
(MurphyHillEA21) [89] 9/9 4 randomized questionnaire-based survey SWEBOK Factors that most strongly correlate with self-rated productivity are non-technical factors, such as job enthusiasm, peer support for new ideas, and receiving helpful feedback about job performance. Compared to other knowledge workers, software developers' self-rated productivity is more strongly related to task variety and working remotely.
(ZhaoWW21) [107] 3/0 1 quasi-experiment SWEBOK There are regional differences in the level of development of local software companies. Different public policy promotion paths should be adopted in each case considering simultaneously all the identified gaps in the degree of higher education, the scale of enterprises and the level of investment in research and development activities and fixed assets, acting on them accordingly. The governing influence on OOD productivity may be the business workflow, but not the development approach. There is significant evidence that productivity increases as project size increases. Business deadlines may have a strong influence on the overall productivity of projects.
(PortM99) [69] 2/1 1 case study SEMM/OO The adoption of OOD coupled with OOP significantly improves overall project productivity and efficiency, but OO development approaches are less efficient than traditional approaches in the requirements phase.
(SiokT07) [72] 2/1 1 quasi-experiment SWEBOK/OOD Productivity is significantly different for distinct application domains. There is no significant difference in productivity between projects developed using OOA/ODD and SA/SD or programming language. Small projects are slightly more productive than medium and large projects. The change-point measure permits both combined and individual productivity measurement for design, implementation and test activities. It supports a conceptual approach to productivity measurement at a higher level than in each development activity.
(KitchenhamM04) [22] 2/0 1 controlled experiment SC A software productivity measure related to effort can be formulated when several jointly significant factors are related to effort. The practice of reuse is determined to affect productivity significantly. Executives evaluate that requirements stability, customer satisfaction and customer/staff personality type may contribute to software productivity.
(ParrishSHH04) [97] 4/0 1 case study SC Highly collaborative pairs are dramatically (4 times) less productive than pairs working on the same task but not simultaneously. Programming pairs can learn to work more productively together over time by devising their productive collaboration process. Any productivity gains reported with pair programming are likely due entirely to the role-based protocol rather than to any inherent consequences of working closely in pairs.
(TomaszewskiL06) [71] 2/0 1 case study SC The following are identified as productivity bottlenecks in software construction: unstable requirements and lack of programming tools (large); quality of platform documentation, and too optimistic planning (average). Apart from treating these bottlenecks, higher knowledge of the development language and platform and adoption of reuse practices may improve productivity.
(Tan09) [84] 6/0 1 case study SC The collected data present a clear trend of decreased software productivity over the years. Staff capabilities, software architecture, and other development tasks affect software productivity, either positively or negatively. In incremental development, the assumption that productivity will vary from increment to increment cannot be taken for granted.
(DiesteEtAll17) [23] 8/0 10 controlled experiment SC Familiarity with a unit testing framework or IDEs appears to affect software productivity positively. Years of practical or academic programming experience do not influence programmer productivity, so the routine practice does not appear to lead to improved performance. However, academic learning, which could be considered an instance of deliberate practice, influences quality and productivity.
(AzzehN18) [86] 2/0 2 controlled experiment SC Learning productivity ratios for each project look more reasonable and efficient than using a static ratio for all software organization projects. Using effort regression models based on UCP size variables is more accurate than effort estimation-based productivity models.
(BellerOBZ21) [74] 4/4 1 quasi-experiment SC A simple linear regression model could explain almost half of the variance in self-reported productivity when expressed as a product and process measure. Organizations should be aware of the large conceptual discrepancy between self-reported and measured productivity and that optimizing for individual productivity is different from optimizing for team productivity.   There are clearly identifiable differences between the task processes of high-productivity programmers and the task processes of average-productivity programmers. Task processes of high-productivity programmers were transferred to average-productivity programmers by training them on the key steps missing in their processes but commonly present in the work of their high-productivity peers. A substantial productivity gain was found among average-productivity programmers due to this transfer.
(BankerDK91) [96] 3/0 1 controlled experiment SM High project quality does not necessarily reduce maintenance productivity. A significant positive impact is observed on maintenance productivity by project team capabilities and good response time. A negative significant impact is identified due to the lack of previous experience in the application domain.
(BankerS94) [63] 2/0 1 quasi-experiment SM Project size has an important influence on maintenance productivity. There are significant economies of scale in the studied maintenance projects. There may be significant gains in maintenance productivity by grouping simple modification projects into larger planned releases.
(Mockus09) [59] 1/1 2 controlled experiment SM Larger projects, overload mentors and offshoring succession significantly reduce the productivity ratio. The breadth of mentor experience and succession of mentors' primary product significantly increase productivity.
(BibiAS16) [98] 3/0 1 case study SM Small methods produce nearly maximal productivity in the majority of cases. Tightly coupled systems exhibit low productivity rates, a negative effect of coupling on maintainability. Larger projects are more productive and have lower defect levels than smaller ones. Early prototyping and daily builds promise subsequent work on the features most valued by customers, with a significant positive impact on productivity. Other practices are not correlated to productivity. There is danger in assuming the implementation of more flexible processes piecemeal by picking-and-choosing practices because there are complex interactions among them.
(RamasubbuCBH11) [38] 4/0 1 controlled experiment SEM Firms that distribute software development across long distances benefit from improved productivity. Variations in configurational characteristics of distributed teams lead to different performances. Locally tailored, agile, and interaction-oriented process models are associated with improved productivity. Project configurations that attain high productivity tend to achieve low quality and vice versa. An imbalance in the experiences of personnel significantly decreases productivity.
(Mohapatra11) [37] 1/0 1 quasi-experiment SEM Application complexity affects productivity negatively, and training in the application domain has an opposite effect. Productivity tends to increase with the availability of documentation and testing tools and better client support.
(CataldoH13) [40] 2/1 2 case study SEM Identifying the right set of relevant work dependencies and coordinating accordingly has a significant impact on increasing productivity. When developers' coordination patterns are congruent with their coordination needs, productivity increases.
(PalaciosCSGT14) [42] 5/0 1 questionnaire-based survey SEM Performance in global development projects is lower than in-house projects due to the lack of attention to tasks by software managers. This is due to communication, coordination and control overheads. The management of offshore projects affects their performance in negative ways. Significantly improved performance is perceived in case managers present accessibility, responsivity and neglect their superior roles.
(StylianouA16) [44] 2/0 2 optimization study SEM The Pareto optimal set, which is generated from models, supports managers better deciding on who will work on what and when. The perception of existence of an engineering system, impactful work, autonomy, and capability to complete tasks positively affect self-assessed productivity. In contrast, the possibility of mobility, compensation and job characteristics affect it negatively. The relationships of these factors to job satisfaction is statistically significant in many models for different work contexts.  Software productivity has a multi-factor structure. Productivity is highly associated with social productivity (an intangible asset related to social life, information awareness, fairness, frequent meeting, reputation, social debt, team communication and cohesion) and moderately associated with social capital (intangible resources related to group characteristics, norms, togetherness, sociability, neighborhoods, volunteerism and trust). The productivity of software development was found to be higher for smaller software teams.

Software Engineering Professional Practice
The Software Engineering Professional Practice (SEPP) is concerned with the knowledge, skills, and attitudes that software engineers must possess to practice software engineering in a professional, responsible, and ethical manner [33]. The second half of Table 14 summarizes the findings of the respective experiments and surveys.
The study findings on the professional practice of software engineering are rather qualitative. In [3], software productivity is determined to be highly associated with social productivity (an intangible asset related to social life, information awareness, fairness, frequent meeting, reputation, social debt, team communication and cohesion) and moderately associated with social capital (intangible resources related to group characteristics, norms, togetherness, sociability, neighborhoods, volunteerism and trust). In [4], social productivity is decomposed into affects (emotions, moods and feelings), valence (the attractiveness of an event), dominance (change in the sensation of control of a situation), and arousal (the intensity of emotional activation), although arousal does not provide additional explanatory power in their usage together. However, in [60], high valence, arousal and dominance are positively related to self-assessed productivity in studying the resolution of issue reports stored in source code repositories.

Software Processes, Quality, Models and Methods
Finally, the findings of studies covering the remaining KAs are grouped in this section. They address software engineering processes (SEP), software quality and software engineering models and methods. Table 15 summarizes the corresponding experiments and surveys together with the respective findings. A synthesis of these rather diverse studied relationships is presented in Appendix A.12. There is no evidence of improved labor productivity or productivity growth in companies with appraised software quality levels. Companies with appraised quality maturity levels are more or less productive depending on their business nature, capital's main origin, and maintained quality level. There is statistically significant evidence that software productivity variance decreases as a company with appraised quality levels moves towards higher levels.
(StaplesEA14) [101] 6/0 1 quasiexperiment SEMM Lines of proof is a problematic measure, and so improved size measures are required. Effort is highly correlated with proof size. Since there are proofs that are much simpler and less complex than other proofs, it would be expected that effort and productivity depend on proof complexity. Still, empirical data do not provide support for this belief.

Related Work Discussion and Indirect Study Finding Compilation
Many related studies on software productivity have an indirect character, in that they provide systematic reviews and mappings covering third-party studies. Systematic indirect studies are treated separately from primary and mixed-type studies here to prevent double-counting of study results and comparing studies that adopt substantially distinct methodologies. Table 16 presents a synthesis of this related work. Interestingly, the present research is unique in the sense that it is a mixed tertiary study [110], once not only primary but also indirect studies are analyzed here. After duplicate removal, 240 references were selected. After random sampling, 4 publications were included and analyzed.
No single classification exists for software productivity factors, but they are organized in product, process, project and people categories. The reviewed literature studies 35 influential factors over which organizations must intervene to obtain software productivity improvements.
The focus of the present study on industry data and practitioner views yields a set of specific goals and a choice of a distinguished methodology in comparison to related work. The paper selection criteria adopted here were defined taking this focus into account. Only one has a practitioner co-author [2] among the papers mentioned in Table 16. Some differentiate academic and laboratory studies from those in industrial settings ( [17,25,29]), although studies from both sources are analyzed in each case. Moreover, the review methodology adopted here is slightly different from related work, in the sense that there are no specific concerns with publication media ( [27,29]) nor attempts to score individual studies according to pre-established quality criteria ( [17,29]), given the assumed practical or industrial relevance of each analyzed paper and the existence of some subjectivity in establishing quality criteria for paper scoring, respectively.
Software productivity is analyzed here considering the evolution of this subject over time. This is not a novelty per se, as shown in the tables and graphs in [2,[25][26][27], but software productivity has not been analyzed elsewhere in connection to KA, study type and study goal breakdowns. This approach enables the development of historical trend analyses of software productivity research, as reported in Section 3.
The present study provides an overview of software productivity with intra-and inter-KA analyses, whereas related work usually focuses on specific KAs. In particular, some related literature reviews are organized in this way, such as [17,25] (which report minor or negative impacts of TDD on software productivity), [26,30] (report inconclusive results concerning the adoption of Scrum and agile methods) and [18] (on selectively positive impacts of reuse). On the other hand, the approach adopted here enables the identification of gaps in productivity research on specific KAs, as reported in Section 3.1, which should be investigated in future research.
In addition, the present research builds upon the general technical and methodological findings reported in the related work, particularly [2,8,[27][28][29]. A summary of findings and a derivation of methodological recommendations based on included studies are respectively developed in Sections 5.3 and 5.4. The present study also has similarities with the mixedtype study reported in [24]. Whereas a structured synthesis method is employed in [24] to determine the level of certainty attributed to each study findings, here the GRADE system [16] is applied, as described in Section 5.2.

Systematic Mapping Findings and Recommendations
Systematic mappings and literature reviews evaluate individual quality of evidence and provide high certainty in a body of evidence. Following the PRISMA guidelines [15], the GRADE system [16] is adopted in this section to achieve these goals.
The GRADE system prescribes a structured approach to synthesizing evidence in literature reviews and systematic mappings. First, a risk of bias assessment is developed (Section 5.1). Next, an evaluation of certainty in the body of evidence (Section 5.2) is conducted. Finally, an evidence profile and a summary of findings table are constructed, together with a narrative discussion of the main study findings (Section 5.3). In addition, some methodological recommendations are derived here from the lessons learned in the analysis and synthesis of included paper findings (Section 5.4).

Risk of Bias Assessment
Systematic mappings and literature reviews are based on analyses of included primary and indirect studies, which may present risks of research, reporting and other biases. There is a plethora of sources of research bias that may affect the credibility of reviewed studies, such as participant selection and allocation, missing data or non-response, observations and measurements, study performance and withdrawals or exclusions. Moreover, included studies may suffer from reporting biases, which correspond to the publication or not of research findings depending on the nature and direction of results. Specifically, reporting biases can be classified in publication, selective reporting, time-lag, language, citation, multiple publication and location biases. Finally, other sources of bias come from conflicts of interest, which occur whenever professional misjudgment or unduly influences happen due to secondary interests, usually from author affiliations, sources of funding, and supply and demand relationships.
Some sources of risk of bias are not present here due to the adoption of stringent search, exclusion and inclusion criteria. For example, a frequent source of risk of reporting biases is multiple publication. However, in the present case, this risk was mitigated by the criterion of excluding subsumed publications. Moreover, language biases are not present here due to the adoption of a search string written in the English language and the exclusion of papers written in other languages.
Despite the risk mitigation procedures adopted by the respective authors, included studies may still present residual biases. That is why a risk of bias assessment was developed for each included paper. These papers were assessed individually, covering each of the three sources of risk mentioned above, namely research, reporting and other sources of bias. The checklists of investigated risks appear in catalogofbias.org/biases/ (accessed on 6 December 2021) and Chapter 7 of [111]. The probabilities of occurrence of individual risks are graded as low, unclear and high depending on notable concerns regarding biases. In individual assessments, the baseline risk level is considered to be low. Depending on any notable concern about biases, the perceived risk is increased in one or two levels. Table A1 in Appendix B provides the overall assessment of the risk of bias in each included paper. Therein, each paper's overall risk of bias is computed as the median of the three individual risk grades. Table A2 in Appendix B presents detailed judgments of the identified risks with quotes extracted from the studied papers or texts produced by the author. Finally, Figure 5 summarizes the risk of bias assessment through a diagram generated using the robvis tool (www.riskofbias.info/welcome/robvis-visualization-tool, accessed on 6 December 2021) [112].

Evaluation of Certainty in the Body of Evidence
According to GRADE, the evaluation of certainty in a body of evidence is based on the risk of bias in each included paper and on the certainty perceived in the respectively reported findings. At the present stage, all included articles are considered in this evaluation, without exclusions due to perceptions of moderate or high risk, since the findings reported in each included paper are useful for triangulation purposes [29].
The level of certainty in each included paper is evaluated as very low, low, moderate or high. The baseline level of certainty in each included article is moderate or low, depending on the reported study type. Reviews, surveys and experiments, and papers that mix many study types, are regarded to have a moderate baseline level of certainty. Articles with studies of other types are regarded to have low baseline certainty. According to some qualifiers, the initial level of certainty may be increased in one level. The level of certainty is increased if a paper reports indirect studies performed in a systematic manner (that is, it corresponds to systematic literature review, systematic mapping or meta-analysis). Other papers have their initial level of certainty upgraded in one level if all the reported studies are performed using randomization or adopting control groups or methods. The final evaluation of the level of certainty in each included paper is reached by weighing the computed level of certainty, taking into account the perceived risk of bias in the paper. For example, an article with a high calculated level of certainty and low risk of bias is considered a high certainty level. On the other hand, if a paper has a moderate calculated level of certainty and an unclear or high risk of bias, the level of certainty in the paper is downgraded, respectively, to low or very low. The table in Appendix C details the level of certainty computed for each included article.
For many reasons, the certainty gradation used here is different from those adopted in evaluating certainty in other subject areas, such as in Medicine [113]. In other fields, government organizations massively fund scientific research. Consequently, surveys achieve more extensive coverages and experiments present more dramatic effects [110]. In SE, on the other hand, these studies are funded mainly by private organizations or using student and research grants, consequently achieving more modest coverages and more restricted results. Furthermore, in SE, empirical studies such as case studies and simulation studies are important for practitioners, due to budgetary reasons, and they help in investigating hypotheses in specific environments, such as in-house software production processes [103]. As a general rule, in SE, the available objective evidence is comparatively weaker than in other fields and consequently the certainty in individual papers should be evaluated taking this context into account.

Evidence Profile and Summary of Findings Table
Now, an evidence profile and summary of findings table for the whole systematic mapping is developed. An evidence profile is produced to ascertain the quality in the studied body of evidence. This profile records justified perceptions of aggregated study findings quality based on their risk of bias, review limitations, inconsistency, indirectness and imprecision. Quality is graded here in the way described in Section 5.2. The summary of findings table synthesizes the aggregated evidence.
It is crucial to mention that the specific characteristics of the present study lead to customizations in the GRADE instruments, something allowed by the respective guidelines. Indeed, the GRADE handbook recognizes that the importance of outcomes can vary within and across cultures or when considered from different perspectives [16]. In turn, the literature identifies that SE systematic mappings and literature reviews are mostly qualitative [114]. This happens because, e.g., it is very difficult to observe the same SE phenomena on different occasions. The hypotheses and measures formulated in studies are rarely shared and their results are not easily replicable [24]. Moreover, studies with observational or simulation nature are often useful in SE. So, information required by GRADE sometimes does not exist, such as the number of subjects and the significance levels, confidence intervals and error rates adopted in studies.
Consequently, a particular process of aggregating evidence from included papers is performed. This process considers the productivity factors analyzed in Section 3.4, possibly complemented with evidence supplied by systematic indirect studies. Given that the existence, direction and significance of their relationships have been mapped, a particular process of determining the overall quality of aggregated evidence is required. This process takes the following criteria into account:

1.
Remotion of individual studies with low certainty: Only studies with moderate or high certainty are considered; 2.
Inclusion of findings that have been deemed collectively important: Only outcomes determined in at least three high-or moderate-certainty papers are considered; 3.
Formulation of each aggregated finding definition: Analysis of individual definitions and formulation of an aggregated relationship, involving the productivity factors mentioned in the original studies, any directionality of effects and significance of results, considering the lowest significance and more general scope conclusively reported; 4.
Computation of the numbers of papers and studies leading to the finding;

5.
Evaluation of the pooled risk of bias for the finding: Computed by weighing the individual paper risk of bias ratings according to the respective number of reported studies and their assessments of risk of bias, using the same criteria of Section 6.1; 6.
Determination of inconsistency, indirectness, imprecision and review limitations related to the finding: Usage of the GRADE criteria for determining these aspects; 7.
Computation of the overall quality of the finding: Usage of the lowest quality of evidence level among the respective studies as the baseline quality of the outcome, possibly downgraded (according to what was determined in the previous two steps) or upgraded (depending on the findings reported in any systematic indirect study with the same coverage) in one or two certainty levels; 8.
Registration of any relevant comment. Table 17 presents the derived evidence profile and summary of the systematic mapping findings. In the table, there are two distinct categories of study findings. The first one addresses organizational and managerial aspects related to software productivity. The second one is concerned with technical aspects. There is great diversity in the study findings and the quality of evidence varies from high (less frequent) to low (more frequent), pointing out that more focused and authoritative practice-oriented industrial-scale studies regarding software productivity are required.
The findings related to organizational and managerial aspects are confirmatory. The quality of evidence concerning the negative influence of project size on software development productivity is high. It is moderate in the case of the significant negative influence of team size on software project productivity. Professional experience, technical and managerial capabilities are determined to have significant positive impacts on software project productivity with moderate and low levels of certainty, respectively.
The findings related to technical aspects are somehow more diverse. They cover the positive contribution of adoptiing development tools, reuse and rapid application development to software productivity, although these results are not determined with standard significance levels and with high quality of evidence. On the other hand, the influence of programming languages and software artifact complexity on software productivity are obtained with the standard significance levels and moderate quality of evidence. Finally, a negative result concerning the productivity of test-driven development is also determined with the same quality of evidence. Interestingly, the adoption of development tools, modern programming languages, intense software reuse, prototyping and Test-Driven Development are characteristics of lean and agile development methods, which still need to confirm as a whole their positive and significant contributions to software productivity. Indeed, the relationship between productivity and culture in agile methods merits a thorough future investigation [73].

Methodological Lessons Learned and Recommendations
The methodological lessons reported in the analyzed papers or learned in the process of conducting the present study are now used in the derivation of recommendations for research and practice. These recommendations are derived here from applying the methodology described in the present study or from the contents of the included papers while investigating the systematic mapping research questions. Consequently, they are not a result of any systematic investigation process. Nevertheless, each recommendation is formulated because it was derived in the conduct of present study or it was suggested/implied in/by at least three included papers. The importance of deriving methodological recommendations for the research and practice of software productivity is recognized in the literature. Petersen [29] mentions that different approaches should be compared with each other to provide valuable recommendations. Murphy-Hill et al. [89] point out that the impact of productivity research in SE would be improved with a multidimensional toolbox of productivity metrics and instruments, validated through empirical study and triangulation. Despite their generic formulation, these arguments highlight the importance of methodological recommendations as general principles to be considered in software productivity research and practice. In the broader context of SE, Brereton et al. [114] admit that some modifications to standard practices could significantly improve their value as a research tool and a source of evidence for practitioners.

Software Productivity Standards
Standards have paramount importance in ensuring non-ambiguous and uniform understandings of the terms and definitions adopted in the software productivity field. According to Boehm [21], it is vital to establish measurement standards. Indeed, standards related to software productivity provide lists of measures that serve as guidelines for collecting productivity data in different phases of development processes [11].
Two international standards excplicitly address the software productivity theme, ISO 9126-4 and IEEE Std. 1045, but, unfortunately, their adoption in the research and practice communities is not widespread. Still, the literature recognizes the necessity of specific standards. For example, Maxwell, Wassenhove and Dutta [108] identify the need for an international standard for lines of code encompassing all procedural languages. Moreover, Trendowicz and Münch [105], in connection to software productivity standardization, mention as a drawback that many organizations assume measuring software productivity is similar to measuring other forms of productivity. In addition, Cheikhi, Al-Qutaish and Idri [11] suggest that standards would bring convergence and consensus on productivity measures and their factors, facilitating benchmarks and the repeatability and reproducibility of software productivity studies.

Lesson 1.
The software productivity community should seek to reduce the uncertainty concerning definitions related to software productivity by participating in standardization initiatives and standardization boards, apart from effectively adopting standards in research and practice.

Practitioner/Industry Involvement and Participation
There are challenges in achieving effective practitioner and industry participation in software productivity studies. Indeed, Bibi, Ampatzoglou and Stamelos [98] recognize that it is difficult to find volunteer professionals for experiments in industrial settings. On the other hand, Kitchenham and Mendes [22] mention the invaluable participation of executives in studies since they may have different perspectives on particular research problems, e.g., by accessing productivity as a more complex attribute than researchers.
It is evident that the potential benefits of collaboration must be made clear for attracting industry practitioners and researchers. Rubin [78] proposes the implementation of corporate dashboards, presenting the selected measures from technological, business, customer and enterprise shareholder perspectives. Lavazza, Liu and Meli [93], in connection with the function point metric, mention that many public administrations and private organizations adopt contractual cost models based on the size of the software to be delivered as the only independent variable. They suggest that empirical studies can help apply the best practices based on objective knowledge, thus avoiding macroscopic mistakes. Based on comparative performance metrics, Tsunoda et al. [53] suggest that project managers should take into account the balance of delivery date and cost in planning their activities, consider a decrease in productivity in replacement projects, the trade-off between loss of productivity and savings in terms of staff costs through the use of outsourcing. These cases illustrate the benefits of participation.
However, there are risks for industry participation in studies, such as disclosing industrial or commercial secrets, and sensitive data or strategies. Software productivity researchers should propose appropriate risk mitigators, such as data anonymization.

Lesson 2.
In order to motivate involvement, software productivity researchers should seek the participation of industry practitioners and researchers in studies by presenting them the potential benefits together with the identified risk mitigators.

Software Productivity Data Collection
According to Scacchi [103], to understand the variables that affect software productivity, one has to answer the questions who and what to measure, as well as how to measure productivity. Section 3.2 focused on answering these questions.
Siok and Tian [72] provide some guidelines for data collection in empirical studies on software productivity: make sure that collected data is verifiable and complete, and understand the macro-and micro-level software processes and their assumptions. However, it is sometimes difficult to follow these guidelines. For example, Petersen [29] points out that studies should become more consistent in the way of describing the context and strive for high coverage of context elements.
It is important to take into account not only the convenience for stakeholders but also the employed resources in data collection. Indeed, Scacchi [103] identifies that programmer and manager self-reported data are the least costly to collect, although they may be of limited accuracy, and that outside observers can often collect such information but at a higher cost than self-report. Moreover, automated tools are recognized to be useful, but require more insight into what should be measured and how [103].
Another concern is the scope of data collection processes. Kitchernham and Mendes [22] mention the definition of random samples from well-defined populations as an outstanding problem. They also identify the need for methods for drawing conclusions from nonrandom and quasi-random datasets.

Lesson 3.
Software productivity data analysts should be concerned with data collection processes and data quality. They should always characterize the context and population under study in a precise way; propose in a justified manner sample, experiment or case study size; describe data sources, studied variables and data collection processes, with their time spans and collection instruments. Whenever possible, randomization should be adopted.

Usage of Productivity and Open-Source Code Databases
The efforts to standardize data definition and collection in public databases and OSS code repositories have been considered relevant and welcome, as mentioned in Section 3.2. According to Maxwell, Wassenhove and Dutta [108], data validity and comparability is maximized as all companies collect data using the same tool and every variable is defined. Lavazza, Liu and Meli [56] also point out that many public databases projects represent consolidated practices and languages.
However, there are many challenges in adopting databases and repositories for software productivity research. Although there have been some national and international research organizations responsible for creating specific projects for establishing databases of productivity measures along with the respective factors, in general, these projects have ended fading away, as noted by Hernández-López et al. [32]. Indeed, according to Premraj et al. [50], among the challenges that researchers face there are the potential risks with analyzing complex datasets without good channel communication with those associated with the actual dataset collection. Moreover, there is suspicion that innovative applications will always be in the minority in these source, given their recentness [56]. In addition, according to Rodríguez-García et al. [39], extensive reprocessing is required to apply statistical or data mining techniques on public databases due to ambiguities, missing values, unbalanced datasets, etc.
Balancing the challenges and opportunities of public database and repository adoption in software productivity research, there is good potential for practical and useful applications. For example, it would be possible to use them as registers serving as sources of information concerning studies being carried out (before they are published). This practice would facilitate the assessment of risk of bias in studies [111].

Lesson 4.
Software productivity data scientists should seek to adopt and expand the practice of compiling productivity databases towards exploring new and innovative applications, taking into account the best practices and the associated challenges and opportunities.

Software Productivity Measurement and Analysis
As discussed in Section 3.2, it is a good practice to choose data analysis methods that best fit the problem under analysis, but the preconditions for their application are frequently not discussed in published papers on software productivity.
The first and foremost requisite for software productivity analysis is to understand productivity measurement. There may be dimensionality in this activity and the definition of the respective scales should be an initial concern . According to Scacchi [103], the efforts to develop productivity measures for large-scale systems may lead one away from traditional quantitative measures towards symbolic and qualitative models that incorporate nominal, ordinal, interval and ratio measures. In addition, Cheikhi, Al-Qutaish and Idri [11] argue that productivity measures may be multidimensional and consider quality factors. Furthermore, Storey et al. [47] point out that productivity is multi-faceted (i.e., various factors influence it) and highly perceptual, since capturing developers' views of their own productivity can be a way to measure performance.
According to Hernándes-Lópes et al. [32], the analysis level should be considered in defining productivity measures, as they may cover a country, a business sector, an organization, a department, a project, a unit or an individual. Diverse measurement goals may exist in different contextual levels. This classification can be used not only for productivity factors, but also for software productivity itself [105].
Boehm [21] highlights that there are two primary ways of analyzing software productivity: the "black-box" or influence-function approach, and the "glass-box" or costdistribution approach. Both make sense from a corporate perspective and managers should choose the approach that better serve their needs. Siok and Tian [72] provide guidelines for software productivity measurement and analysis: understand how the applicable analysis methods work and when to use them, and make corporate decisions based on the data and their analysis.
The comprehension and discussion of adopted analysis methods should be performed to justify any choice and increase confidence in study findings. The best practices should be considered. In case statistical methods are adopted, the discussion of frequency distribution, missing data, homoscedasticity, colinearity, goodness of fit, statistical power and effect size should be done whenever appropriate.

Lesson 5.
Software productivity data analysts should choose productivity measurement and analysis methods considering the problem at hand. They should take into account the measurement level and approach, the corporate goals and the best practices in terms of analysis methods.

Confounding Factors
As identified in Section 6.2, few papers discuss confounding factors related to software productivity. Moreover, Petersen [29] points out that it is not always clear whether or not there are unknown confounding factors that influence study outcomes.
A first step to treat this situation consists in clarifying the distinction between productivity and other SE dimensions in each study. For example, Kitchenham and Mendes [22] mention the possible confounding between productivity and size differences. Cardoso et al. [26] point out that outcomes related to project performance may be confounded to productivity, such as customer satisfaction, product and process quality, team motivation, and cost reduction.
In addition, it is vital to identify confounders using specific knowledge regarding the studied context and problem. Management and process changes and product maturity can be considered confounding software productivity factors in a corporate environment [9]. However, process and product maturity are not entirely orthogonal and may also be confounded [40]. Many other variables may correspond to confounding factors. Tsunoda and Amasaki [55] list development type, unadjusted FP, duration, business sector, development platform, programming language and FP attributes as potential confounders. Mohagheghi and Conradi [18] mention context, size, programming language, complexity, task and methods concurrency, skills and knowledge. Bibi, Ampatzoglou and Stamelos [98] regard upgrading a system version to another one as a confounding factor. As shown, the same variables can be regarded as productivity factors or as confounders depending on the study setting.
It is interesting to note that studies adopting contemporary data collection and analysis techniques or addressing emerging topics in SE introduce novel potential confounding factors in studies. For example, Krein et al. [94] considers that months without submissions to software repositories represent a confounding factor in developer productivity studies. Rafique and Misic [17] mention that the simultaneous adoption of other agile practices might confound the effects of TDD. Mantyla et al. [60] identify that, as a consequence of removing or disguising emotions in developer communications, comments collected in chats may become a cause of confusion. In connection to self-assessed productivity, Kuutila et al. [102] point out that experiences and events not related to work can have a confounding effect on mixed-effects models.

Lesson 6.
Authors of software productivity studies should clarify and analyze the software engineering dimensions that may be confounded with software productivity and the factors that may confound software productivity analysis.

Conduction of Studies on Software Productivity
Finally, the lessons learned in conducting the present study are presented. They are related to the formulation of research questions, the adoption of taxonomies, the writing of search strings and the avoidance of conflicts of interest in studies, particularly in literature reviews and systematic mappings.
Regarding formulating research questions in empirical studies on software productivity, at least two different approaches have been adopted. Here, the Goal Question Metric (GCM) methodology [19] is used since it was defined within the context of SE and appeared to be more aligned to the present study's specific subject and general objectives. A similar approach was adopted in [18]. In other SE studies, such as in [25,27,29], and different fields, such as Medicine, the Patient Intervention Comparison and Outcome (PICO) approach has been preferred. Both approaches help formulate research questions and facilitate the search for precise answers.
Although the GRADE system [16] explicitly prescribes PICO adoption, some difficulties in framing software productivity problems according to this approach exist. In particular, studies with observation, analysis and description goals would not be entirely compatible with the requirement of formulating intervention and outcome elements. In turn, research questions for measurement and prediction studies would have to be formulated in specific ways to comply with the PICO framework. From the methodological perspective, strictu sensu, interventions and outcomes would only be addressed in studies with action goals.

Lesson 7.
Authors of software productivity studies should prefer GCM over PICO. The adoption of PICO should always be justified in terms of the study goals and characteristics.
Another aspect that deserves attention is the use of SE taxonomies in software productivity studies. Even though adopting the SWEBOK KAs [33] here was the basis for study and paper classification, identifying specific KAs in papers was very time-consuming. This situation mainly happened because contemporary SE subjects are marginally addressed in the SWEBOK, such as those elicited in Sections 3.3 and 3.4: agile/lean practices, web development techniques, service-oriented architectures (e.g., SaaS), global development and others. For facilitating study and paper classification, it would be paramount to update the SWEBOK contents and provide therein more practice-oriented guidance, which would facilitate practitioner adoption.

Lesson 8.
The IEEE SWEBOK should be updated to cover emergent software engineering subjects and should contain more practice-oriented guidance.
A word of caution is required regarding the formulation of search strings for systematic mappings and literature reviews. While adopting multiple alternative search keys may return a nearly intractable number of references, extremely narrow search criteria (or even mistakes in formulating queries) may miss important references that should be analyzed. In retrospect, it is also possible that a search string that produced a reasonable result on one occasion will fail to have the same results in the replication of a study, as reported in Section 2.4. Consequently, search strategies should be formulated considering variations in the search string and the adopted bibliographic reference databases, apart from adopting alternative methods of reference discovery.

Lesson 9.
Authors of systematic literature reviews and mappings on software productivity should formulate strategies of paper screening considering variation in the adopted search string and bibliographic reference databases, apart from using alternative methods of reference discovery.
It is also important to highlight the importance of evaluating conflicts of interest for determining the findings of the present study, as they may have impacted included study design, conduct and reporting [111]. Conflicts of interest were identified as the second most frequent cause of perceived risks of bias in included papers, before risks of research biases and ahead of risks of publication bias, as reported in Section 6.1. The most frequent justifications for these perceptions of conflict were author affiliations and sources of data, technology or funding. Despite this, it is crucial to recognize that perceived conflict identification was possible only due to the reporting transparency of the included papers.
In order to manage or avoid conflicting situations, it would be necessary for authors of empirical studies on software productivity to adopt specific guidelines for ensuring research quality and transparent reporting. These comprise clear and explicit statements of author affiliations, sources not only of funding but also of technology and data, as well as of conflicts of interests in papers. The incentives for study participation and disclosure limitations on research data and findings should also be reported.

Lesson 10.
Authors of software productivity studies should ensure research quality and transparent reporting by including in their papers clear and explicit statements of author affiliations, sources of funding, technology and data, and conflicts of interests, apart from transparently reporting incentives for study participation and disclosure limitations on research data and findings.
These learned lessons and recommendations for industrial practice are not complete and should be used in conjunction with others formulated from different perspectives (cf. [110,114]).

Construct and Internal Validity
Systematic mappings and literature reviews are susceptible to subjectiveness and inaccuracy in the chosen notions and the lack of rigor and precision in the formulated definitions, leading to construct validity threats. In order to mitigate the former kind of threat, the adopted notions were discussed with experienced researchers in meetings and conferences. The latter type of threat was mitigated by selecting authoritative taxonomies whenever available. That is why the definitions in the SWEBOK [33] and the study types and productivity approaches defined in [18] were adopted here.
The extent to which the design and conduct of each systematic mapping and literature review are likely to prevent systematic error, that is, their internal validity, is threatened by reviewer biases. Here, the main threat to internal validity is that a single researcher conducted the entire study. In order to mitigate the associated threats, the procedures and findings reported here were manually checked and later rechecked using the ROBIS tool (www.robis-tool.info, accessed on 24 April 2022) [115]. First, the tool identifies concerns with the review process by assessing study eligibility criteria, study identification and selection procedures, data collection and study appraisal, and synthesis and findings. Next, a judgment is reached concerning the possibility of review bias. The self-application of the tool questionnaire in the present case resulted in a judgment of low risk of bias. The dissemination of the review data and protocol in the way described in the Supplementary Materials item ensures additional confidence and transparency of the reported findings.
The possibility of bias in including publications for review also threatens internal validity. The standard way to avoid paper selection threats is to follow a definite research methodology and review protocol. In the present study, the recommendations suggested by Kitchenham and Charters in [14] and the PRISMA guidelines [15] were simultaneously adopted, apart from a pre-established review protocol. However, the choice of the year 1987 to define the beginning of the reference search period could represent a threat to the reported research. Although this choice was arbitrary, considering the existence of a series of impactful publications from that year onward, previous publications most likely would not comply with exclusion and inclusion criteria, if they were available online for analysis. Moreover, including the few remaining publications in the present study would not significantly affect the systematic mapping findings.
An additional source of internal validity threats in systematic mappings and literature reviews is the existence of relevant undetected papers. Indeed, [14] alerts that no single search can find all relevant studies. In the present case, DBLP and Scopus were adopted as sources of bibliographic references. DBLP is an open and curated tool covering the most relevant sources of SE research and Scopus is one of the main expertly curated sources of scientific research. Recursive backward snowballing was also adopted to identify additional bibliographic references. The paper screening process returned 495 references, from which 99 papers were selected for inclusion in this study. The number of included papers is more extensive than those mentioned in each line of Table 16. Despite the mitigators, it is essential to recognize that many other studies on software productivity based on empirical methods exist, particularly those published in the proceedings of regional events and journals. Still, in practice, it is almost impossible to cover all regional sources of publication concerning SE while ensuring fairness of treatment, due to constraints such as paper availability and knowledge of many different foreign languages. For the same reason, apart from the fact that it is difficult to identify in SE [29], the Grey literature was not reviewed here. Furthermore, it is important to point out the lack of online paper availability as another source of similar threats. However, this is a usual limitation in systematic mappings and literature reviews, as recognized in most of the related work analyzed in Table 16.
The validity of included studies may be threatened by confounding factors, which make it impossible to distinguish the effects of two interventions from each other. The SWE-BOK [33] lists economic friction (everything keeping markets from having perfect competition), ecosystems and outsourcing/offshoring as confounding factors related to software engineering economics in general and software productivity in particular. However, confounding depends on the context of each study and require specific knowledge to be identified. Still, confounding factors are rarely studied in the reviewed literature [18], providing evidence that the risk of bias due to confounding is regarded as low in included papers. This situation suggests that additional attention is needed to confounding factors in studies related to software productivity.

External Validity
External validity corresponds to the extent to which the reported results are reliable and can be generalized to other populations and settings [14]. It is challenging to perform external valitidy analyses of concerning literature reviews and systematic mappings because they analyze and synthesize the diverse findings of other studies. Nevertheless, the analyzed papers correspond to a representative sample of the industrial practice of software productivity in the studied period and the adopted methodology is sufficiently transparent to be replicated considering other time frames and studies.

Concluding Remarks
This paper provides evidence of different empirical perceptions of software productivity within the distinct business sectors and KAs covered in the industrial practice of SE. There are also many commonalities in approaching software productivity in these KAs and sectors, primarily due to the adopted analysis methods and their respective measures. The research findings in included studies were analyzed and synthesized and a list of recommendations for industrial research and practice was derived based on lessons learned in the respective papers and in conducting the present study.
The main contributions of the reported research have practical and methodological significance. From the methodological perspective, applying the PRISMA guidelines in SE, as outlined here, is innovative and demonstrates the feasibility of borrowing empirical study analysis methods from other fields, particularly the GRADE system from the healthcare sector. In addition, it is expected that the set of methodological recommendations derived here will help industry practitioners in addressing the software productivity subject and developing further research. From the practical perspective, factors that affect software productivity were elicited from included studies and classified according to organizational/managerial and technical categories. In particular, the reported research demonstrates that the impacts of agile development practices on software productivity have great variability and still need to confirm their positive and significant contributions.
The strengths/trends and weaknesses/gaps in analyzed studies suggest directions for future research. A more holistic approach in software productivity studies is needed [11], covering more or unabridged KAs (SR and SEMM), sectors (noticeably industry, retail and health care) and environmental factors (economic friction and ecosystems), while treating the lack of standardization and sufficient reporting in studies. The development of more confirmatory, replication [50] and multi-company studies is required [32], together with studies with analysis, description and action goals. Over the years, the historical evolution of this field-with a noticeable increase in the number of published studies-provides continued evidence that software productivity is still considered an important subject within SE. Only with more authoritative practice-oriented industrial-scale studies will the quality and certainty in the body of evidence increase.

Appendix B. Risk of Bias Assessment Tables
See in Table A1 the assessment of the overall risk of bias in each included study.   [26] Unclear Unclear Low Unclear (HernandezLopezPG13) [8] Unclear Unclear Low Unclear (MohagheghiC07) [18] High Low Low Unclear (OliveiraVCC17) [27] Unclear Unclear Low Unclear (OliveiraCCV18) [28] Unclear Unclear Low Unclear (Peter11) [29] Unclear Low Low Low (RafiqueM13) [17] Low Low Low Low (ShahPN15) [30] Unclear Unclear Low Unclear (WagnerR08) [2] High Low Unclear Unclear D1 = risk of research bias, D2 = risk of reporting bias, D3 = other Sources of risk of bias, and D4 = overall risk of bias. Table A2 the justifications for increasing the risks of bias levels perceived in some included studies.

Key
Explanation for Downgrading