Genetic Algorithm-Based Fuzzy Inference System for Describing Execution Tracing Quality

Execution tracing is a tool used in the course of software development and software maintenance to identify the internal routes of execution and state changes while the software operates. Its quality has a high influence on the duration of the analysis required to locate software faults. Nevertheless, execution tracing quality has not been described by a quality model, which is an impediment while measuring software product quality. In addition, such a model needs to consider uncertainty, as the underlying factors involve human analysis and assessment. The goal of this study is to address both issues and to fill the gap by defining a quality model for execution tracing. The data collection was conducted on a defined study population with the inclusion of software professionals to consider their accumulated experiences; moreover, the data were processed by genetic algorithms to identify the linguistic rules of a fuzzy inference system. The linguistic rules constitute a human-interpretable rule set that offers further insights into the problem domain. The study found that the quality properties accuracy, design and implementation have the strongest impact on the quality of execution tracing, while the property legibility is necessary but not completely inevitable. Furthermore, the quality property security shows adverse effects on the quality of execution tracing, but its presence is required to some extent to avoid leaking information and to satisfy legal expectations. The created model is able to describe execution tracing quality appropriately. In future work, the researchers plan to link the constructed quality model to overall software product quality frameworks to consider execution tracing quality with regard to software product quality as a whole. In addition, the simplification of the mathematically complex model is also planned to ensure an easy-to-tailor approach to specific application domains.


Introduction
Execution tracing is a tool used in the course of software development and software maintenance to identify the internal routes of execution and state changes while the software operates. As a term, it is used interchangeably with logging in the scope of this study. Execution tracing quality has a high influence on the duration of the analysis required to locate software faults, especially in cases when reproducing the issue with a debugger is not a feasible option [1]. Nevertheless, execution tracing quality has not been described by a quality model, nor is it included in software product quality frameworks in an appropriate manner, which is an impediment while measuring software product quality and formulating quality targets for software products [1,2].
Moreover, the ability to describe and assess a software product from any point of view that involves human evaluation results in uncertainty to some extent [3]. Nevertheless, the evaluation of how the software operates and relates to its environment and the extent to which the end user is satisfied with it are aspects that cannot be covered by static code analysis, only by human assessment. The input variables of logging quality elicited in [1]-accuracy, design and implementation, legibility, and security-involve human assessment. Thus, modelling the problem requires the implicit uncertainty to be handled.
Consequently, there is a significant gap in the field, which the present study addresses and attempts to solve by defining a quality model for execution tracing. Such a quality model makes it possible to consider each previous individual endeavour for logging improvement and to see their relations to each other and to the quality as a whole, as well as to articulate general guidelines or requirements with which good execution tracing quality can be achieved. Furthermore, the model would support the measurement and assessment of logging quality, allowing researchers to compare the same project at different milestones of its development and maintenance or to compare different projects.
To describe the uncertainty associated with the quality modelling process, fuzzy logic [16,17] was used in this study. A basic and concise summary of the main concepts of fuzzy set theory and fuzzy logic is presented in [18]. The current research preferred fuzzy logic to evidential reasoning [19,20] due to its efficient modelling language implemented by linguistic rules and its capability to build adaptive systems. Linguistic rules are if-then rules that contain linguistic variables in the form of fuzzy sets [21,22]. As fuzzy sets make the description of a partial membership in a given set possible, the antecedents of the linguistic rules do not require a full match with the given conditions. In such cases, the consequent part is also partially applied according to the degrees of memberships. The effects of the different consequent parts of the linguistic rules are combined by the inference mechanism implemented, which is Takagi-Sugeno-Kang inference [23] in the present study. The basic steps of how to define a fuzzy inference system can be summarised in the following manner: (1) determining the inputs and input partitions-i.e., the regions to describe with fuzzy sets, (2) determining the outputs and output partitions, (3) determining the type of inference, (4) defining the membership functions for the inputs and for the output, (5) defining the linguistic rules, and (6) testing the system. The parameters of the membership functions and the number of the linguistic rules can be tuned adaptively [24][25][26][27].

Research Question
The goal of the study can be formalised in the form of a research question in the following manner: RQ: How can execution tracing quality be modelled with regard to the uncertainty inherently present in the quality measurement process?

Research Protocol
Conducting the research encompassed the following steps: (1) investigating the current state of the research in the field, (2) determining the goal of the study, (3) designing the data collection including (a) the identification of the study population and (b) the determination of the sampling method, (4) collecting the data with an online questionnaire [28], (5) performing exploratory data analysis and preprocessing the data, (6) constructing the model by the extraction of the linguistic rules of a fuzzy inference system by means of genetic algorithms from the collected data, (7) verifying the constructed model's performance through a different machine learning approach (ANFIS) [24,25], (8) carrying out pre-validation and model adjustments in a mini-focus group, (9) validating the constructed model by engaging international experts through another online questionnaire [28] and tuning the model based on the feedback, and (10) writing the research report.

Contributions
The study contributes to the body of knowledge in the following fields: (1) data collection based on human experiences regarding execution tracing quality, (2) using intelligence systems to extract knowledge from the accumulated human experiences on logging, and (3) constructing a quality model to describe execution tracing quality with the consideration of the inherent uncertainty present in quality measurements. The constructed quality model can appropriately describe execution tracing quality.

Structure of the Study
The structure of the research report follows the steps of the research protocol. Section 2 gives an account of the works related to the current study; Section 3 presents the research methods, including the data collection, the sampling strategy and modelling of human experiences; Section 4 highlights the results with regard to the collected data, the exploratory data analysis, and the model construction; Section 5 is devoted to the validation of the study; Section 6 outlines the limitations; Section 7 presents the discussions, and the research closes with the concluding remarks and future work plans in Section 8. Appendixes A-F highlight important steps of the model construction stages, including verification and model validation.

Related Works
The studies related to the current research lie in two different fields: (1) logging quality, which forms the problem domain; and (2) quality modelling with adaptive methods, including the consideration of uncertainty, which constitute tools to deal with the problem domain from the point of view of modelling. The major points in both fields relevant for the research conducted are briefly summarised below.

Logging
Overall guidelines on how to construct logging do not exist but would be desirable as stated in [4,5,8,10,29]. Present studies mainly focus on three distinct areas: (1) where to insert log statements in the source code of the application [6][7][8][9]11,[13][14][15], (2) what data to log to make sufficient information available for error analysis [10,[12][13][14] and (3) how to log [5,13,14], as summarised in Table 1. Moreover, tool support is provided to offer assistance for each of the three facets listed above [6,[10][11][12][13][14][15]. The Apache Common Best Practices formulates recommendations for creating consistent severity levels in the logging code statements, but the description does not contain overall guidelines for execution tracing [30]. Zeng et al. explicitly claim that the usefulness of logging depends on the quality of logs, and support is needed for logging decisions with respect to (1) the severity level, (2) logging code location, (3) text in the log message and (4) when to update the log statements [31]. x [11] x [8] x x x x [13] x x x x [14] x x [6] x x [10] x x [12] x [5] Yuan et al. conducted a study on the efficiency of logging practices, in which they included 5 large software systems, 250 randomly chosen error reports, and the analysis of the related source code [13]. They found that the error analysis based on the existing log data caused a problem in 57% of cases [13]. Furthermore, log messages can reduce the median time of analysis required to locate errors and accelerate the diagnosis between 1.4 and 3 times in terms of median values [13,32]. In addition, 39% of the program faults were not logged in the investigated cases [13].
Moreover, Yuan et al.'s measurement in [32] including four large C/C++ software projects concluded that logging is pervasive, as 30 lines of source code contains a log statement on average. The study was replicated by Chen at al. investigating Java projects, who found on average a log statement for each 51 lines of source code [33], which supported the claim of tje pervasiveness of logging.
Logging mechanisms can be constructed by a proprietary log implementation or by the use of log libraries. Kabbina et al. investigated 223 Apache Software Fundation projects and found that log library migration took place in 33 of them [34]. More than 70% of the migrated projects had post-migration errors [34]. In addition, the median log library migration time amounted to 26 days for the 33 projects [34], which indicates that log library migration is not a straightforward and trivial task.
Zheng et al., based on the analysis of 1.444 open-source Android applications, found that logging practices deviated from the practices applied in desktop and server applications; moreover, log statements in the mobile applications are less pervasive and far less maintained than in their desktop and server counterparts [31].
Galli et al. investigated the complete software product quality models and their tailored variants published in the past 20 years [2]. The term complete software product quality model refers to quality models that aim to cover and describe each presently known aspect of software product quality. The study identified 23 software product quality model families, as listed in Table 2, and found that none of them considers logging quality in an appropriate manner [1,2]. Ulan et al. [92] In spite of the extensive research in the field, the basic issue remains: there are no overall guidelines available for logging, and no model exists to measure the quality of logging as a whole. Furthermore, it is not feasible at present to quantitatively compare logging quality in the same project at different milestones or to compare the logging quality of different projects. Galli et al. explored the variables on which logging quality depends [1]. This step opened the way to the creation of a quality model for execution tracing to consider each identified aspect of its quality and to perform quantitative measurements.

Adaptive Methods in Quality Modelling
As human experiences and human qualities need to be considered while modelling execution tracing quality, artificial intelligence (AI) methods were used in the present study. AI and machine learning have found several application areas in the field of software quality modelling, including software defect prediction, bug report classification, uncertainty modelling, and multi-criteria decision-making problems as summarised below. Nevertheless, none of the listed studies focused on logging quality.
Lai et al. lay down key performance metrics for computer network quality and predict customer satisfaction by means of recurrent neural networks [93]. Software defect prediction became an intensively researched field with different machine learning methods, including AI, used to forecast errors or system breakdowns [94]. Pradhan et al. use machine learning to predict software defects in large systems [95]. Madera and Tomon apply machine learning to identify source code artefacts that are probably endangered by software defects [96]. Software defect prediction is highly influenced by the amount of availability of data for training the machine learning models, including neural networks, SVM, KNN, K-Means Clustering, Naive Bayes, decision trees, logistic and linear regression models, as well as their combinations with ensemble learning [97]. Khan et al. use transfer learning to utilise the data of different projects and to overcome the data availability barrier [97]. Blas predicts the quality of the software by means of simulation and modelling based on the architecture defined in [98]. Xing et al. predict software quality based on complexity-related quality metrics by means of SVM [99]. Lafi et al. apply classification for bug reports to reduce the maintenance efforts [100]; their method assigns an action to the fault report with the following labels: (1) faults to repair, (2) new functionality desired, (3) existing functionality needs change, and (4) the software needs adaptation to new environments. Ubayashi et al. give an account of the uncertainty that developers face in the software development [101]. Their publication reveals the potential for the use of fuzzy logic. Singh et al. determine four quality metrics-(1) separation of concerns, (2) coupling, (3) cohesion, and (4) size-to describe the reliability of aspect-oriented software systems while they use fuzzy modelling [102]. The linguistic rules were constructed on the basis of the opinions of experts. Li et al. defined a quality model for software products involved in digitalising antiquarian resources with the use of fuzzy logic [62]. Li et al. took over concepts from the ISO/IEC 9126 standard [64], but they defined a separate quality model [62]. Fuzzy mathematics is applied to evaluate the membership of the defined quality properties in the fuzzy sets: excellent, good, general, unqualified [62]. Liang and Lien extend the quality model of the ISO/IEC 9126 standard [64] to encompass enterprise resource planning (ERP) software-related properties [65]; moreover, they describe the selection of the optimal ERP software as a multi-criteria decision-making problem and use a fuzzy analytic hierarchy process to produce a ranking with regard to the criteria considered [65]. In addition, Aggarwal et al., Nerurkar et al., Canfora et al., and Mittal and Bhatia applied fuzzy modelling to consider qualitative and quantitative data to assess the reusability and maintainability of software components in [103][104][105][106]. Galli et al. carried out a pilot study in [18,107] to test the performance of the different fuzzy inference methods with different membership functions in the context of modelling execution tracing quality. The study resulted in the findings that (1) Takagi-Sugeno-Kang inference with overlapping Gaussian-shaped membership functions achieved the best performance, and (2) the manual linguistic rule creation is error-prone, and contradictions can easily be introduced in the rule-base. For these reasons, the study concludes that Takagi-Sugeno-Kang inference with overlapping  Gaussian-shaped membership functions should be used to describe execution tracing quality with an adaptive fuzzy system, where the linguistic rules are constructed automatically  from the collected data. Malhotra and Lata outline the machine learning approaches used for software maintainability in [108].

Methods
In this section, the research methods are briefly explained, including the data collection, modelling, pre-validation, and validation stages. As the study is empirical, it collects and processes experiences of software professionals accumulated in the IT domain, as illustrated in Figure 1.
Uncertainty and vagueness are inherently present in the software product quality measurement and assessment process as explained in Sections 1 and 2. In the scope of a pilot study [107], it was tested how the uncertainty and vagueness in the context of execution tracing quality can be approximated in the most appropriate manner. The findings of the pilot study [107] were used in the current research; therefore, an adaptive fuzzy system with overlapping, Gaussian-shaped membership functions at the inputs and with Takagi-Sugeno-Kang inference [23] was selected for modelling.

Data Collection
The study population was defined as an approachable and geographically localised group of software professionals in Hungary including software developers, software maintainers, and testers, from software companies of the competitive sector, the main activity of which is in the IT service delivery domain (TEAOR ID 620× at the Hungarian Central Statistical Office); furthermore, the employee count of the companies exceeded 1000 people. Many companies may have the TEAOR ID 620× or its sub-IDs beside having a different main TEAOR ID, but only those were included in the study population for whom the main activity belonged to the TEAOR ID 620× or to one of its sub-IDs. Such software houses are usually international software companies with coding guidelines and a company culture that spans across countries, including premises also in Hungary. For this reason, the geographical localisation does not impose an impediment for the generalisability of the study. However, extra measures were taken, as described at the validation stage, to mitigate this risk.
The size of the study population is estimated to be N = 12.107 (data originate from August 2017) full-time employees. Multistage random sampling [109,110] was applied to achieve the desired target; i.e., the companies of the study population were sampled, and then the employees of the selected companies were sampled. Random sampling possesses several advantages over stratified sampling [110], but the major advantage in the research context can be characterised by the fact that the different strata do not need to be known in the study population in advance to achieve a trustworthy sample.
Information and consent letters were sent to the management of the chosen companies, and in the case of approval, employees were also selected. Before involvement, each selected employee was informed about the research to be able to give informed and voluntary consent for their participation in the study.
The data were collected through an online questionnaire [28], which was tested by two software professionals before sending it out. Furthermore, the ω coefficient and Cronbach's α were computed for the collected data. The online questionnaire [28] collected the input variable values-i.e., values for (1) accuracy, (2) legibility, (3) design and implementation, and (4) security-and the corresponding values for the output variable-i.e., for execution tracing quality to produce the data for supervised machine learning. The online survey contained three parts.
Part 1 of the questionnaire [28] introduced eight use cases with a textual description of each use case to provide background information and sample log data. The demonstrated use cases showed different execution tracing mechanisms and trace outputs ranging from poor quality to good quality, where the respondents had to assign a score value in the range of [0; 100] to each use case for the inputs (1) accuracy, (2) legibility, (3) design and implementation, and (4) security; moreover, they assigned a score for the output: execution tracing quality. Necessary variable definitions were provided for the respondents. Altogether, 40 questions were contained in this part of the survey.
Part 2 of the questionnaire [28] collected data about the quality of logging in two real projects to which the respondents were assigned in the past. In addition to the input variables and the output variable given above, the software professionals had to answer a question about the type of the application: (1) server application, (2) desktop application, (3) web UI, (4) mobile application, or (5) embedded application. Altogether, 12 questions were contained in this part.
Part 3 of the questionnaire [28] introduced extreme input variable value combinations for (1) accuracy, (2) legibility, (3) design and implementation, and (4) security, which might not be frequent in real-life but which need also to be considered by the model construction. The respondents had to assign an output variable value-i.e., execution tracing quality-to each listed combination of inputs. This part of the survey contained 23 questions altogether.
Two further questions were also included in the questionnaire [28] about the professional background of participants and about the opinion of the respondents on whether the type of the application-(1) server application, (2) desktop application, (3) web UI, (4) mobile application, or (5) embedded application-had an influence on the responses to the questions in the survey.

Data Processing, Exploratory Data Analysis, and Modelling
After concluding the data collection stage, the data were cleaned-i.e., checked for valid responses-and the distribution of the data was analysed, exploratory data analysis was carried out, outliers were removed, and the confidence intervals were determined. The variable with the worst-i.e., the largest-confidence interval was considered; moreover, with regard to the confidence interval and to the statistical reliability of the sample, the model validation was designed. The collected data were reshaped to achieve the format necessary for machine learning, and the data were randomly split into a training and a checking set. The training set was used to fit the model to the data, while the checking set was used to test the training process and to avoid overfitting. Constructing the quality model was accomplished by an adaptive fuzzy system, in which genetic algorithms provided the learning capability [27]. The performance of the constructed model was verified by a different machine learning approach: ANFIS [24,25].

Model Adjustment and Pre-Validation
The created fuzzy systems with different numbers of linguistic rules were reviewed by a mini focus group containing three software professionals with many years of experience in industrial and academic settings. The optimal model was selected, adjusted, and checked for validity.

Model Validation
The data used for model validation were collected by a second online questionnaire [28]. The responses stemmed from selected international experts located in different countries, including Austria, Germany, and Hungary (countries are listed in alphabetic order). The questionnaire [28] contained the linguistic variables and the model in the form of linguistic rules; moreover, the impacts of the model's pairwise inputs in the output were also depicted in charts. The experts were asked to check the model and its fit and the model's responses to the changes in the inputs through an online meeting if necessary. In addition, the authors also validated the model prepared by face validity. Thus, the UK was also represented in the list of countries of origin of the experts to perform the validation. The model was tuned according to the responses of the experts. Consequently, the risk of geographical localisation in the course of the data collection from international companies was mitigated.

Data Collection
The employee count of the software companies which declared themselves to be ready to participate in the study amounted to approximately 6000 people. Thus, the first stage of sampling covered nearly 50% of the study population. The goal was to achieve a completely random sample from the companies involved, possibly selecting individuals randomly from the register and then contacting them, but this was practically infeasible. Thus, all the employees of selected areas at the selected companies received an email notification with information and with the link to the online questionnaire [28] to fill in. The individuals of the selected areas of the selected companies showed low response rates, which might have been caused by the long and complex questionnaire [28]. Overall, 41 software professionals filled in the survey, which had three parts with 77 questions in total.
The reliability indicators, ω coefficient, and Cronbach's α in Table 3 were computed for each separate part of the questionnaire [28] after removing the rows with NA values and removing the outliers as discussed in Section 4.2. The extreme high reliability indicator values for the use cases might have been caused by the repeating questions, which are valid in the context of the present research as each use case represents a different setting, where the respondents rated the corresponding input and output variable values.

Data Processing and Exploratory Data Analysis
The data collected were examined for each question answered, and the responses were depicted in box plots to see the dispersion and the outliers. The questions in part 2 of the survey constituted an exception from the outlier detection as each respondent could think of a different project in the past, leading to valid input-output data pairs from the point of view of the data collection, but they did not necessarily come from the same probability distribution.
The box plots depict the Q1 and Q3 quartiles with a box, which is called the interquartile range; i.e., 50% of the data fall into this range. The median is shown by the middle bold line in each box, while the outliers are highlighted with dots over the whiskers. The distance of the whiskers (W) from the Q1 and Q3 quartiles is computed by 1.5 times the interquartile range. If the exact W value computed is not present in the data set, then the whiskers are placed to the closest value towards the median.
The chart on Figure 2 illustrates the dispersion of the assigned quality scores-i.e., the output values-of the defined use cases. An upwards trend in the median value can be observed, which mirrors the description of the use cases and the quality of the log excerpts rated by the respondents. The box plots of the input variable values are given in Appendix A and depicted in the charts of Figures A1-A4. In addition, Figure 3 highlights the Q1, Q3, and median values of the quality scores assigned to the combination of the extreme input values. The data show an asymmetric dispersion and skewness regarding the majority of the collected variables. Nevertheless, quantile-quantile plots were computed to check how the collected data approximate the theoretical normal distribution. The charts, presented in Appendix A.2, show the 0.95% confidence intervals and the computed Sharpio-Wilk normality tests. Before computing the quantile-quantile plots, the outliers and the rows with NA values were removed. The use case data are illustrated in Figures A5-A12. The following variables do not approximate the theoretical normal distribution on p Sharpio−Wilk < 0.05: use case 1: all variables; use case 2: accuracy, security; use case 3: accuracy; use case 4: none of the variables; use case 5: design and implementation; use case 6: design and implementation, quality; use case 7: accuracy, quality; use case 8: accuracy, design and implementation. The collected data about real projects are not normally distributed, as shown in Figure A13, but they may represent different projects. The following variables do not approximate the theoretical normal distribution on p Sharpio−Wilk < 0.05 regarding the data assigned to the extreme input variable values Q2, Q4, Q8, Q10, Q12, Q13, Q14, Q15, Q16, Q17, Q18, Q19, and Q21 as illustrated in the charts of Figures A14-A17. The variables Q20 and Q22 were excluded from the normality test as they only possessed identical values. As many of the variables collected were not normally distributed, as illustrated above, the p = 90% confidence interval was computed by the non-parametric Wilcoxon test [111]. To highlight the uncertainty related to all questions in a meaningful way, the lower and upper values of the confidence intervals are not reported below but the range they encompass.
The ranges of confidence intervals for the input and out variable values without the outliers, related to the described eight use cases, possess the following characteristics: min value: 9.99, max value: 24.99, median: 12.49. The 24.99 range of the confidence interval emerged for the values of the input variable security for use case 1 and use case 2, while the ranges of the other variables approximated the median.
The assigned execution tracing quality-i.e., the output variable-to the extreme input variable value combinations showed a lower dispersion than the responses to the use cases. The most definite answers were found when all the four input variables had the maximum value of 100 for question Q20 and when they all had the value of 0 for question Q22. In these cases, 100 was assigned for Q20 and 0 was assigned for Q22 by all respondents, not counting the outliers. The ranges of the confidence intervals show the following characteristics after removing the outliers: min value: 1, max value: 15, median: 10.
In conclusion, the collected data after removing the outliers show a median value of 12.5 for uncertainty related to the constructed use cases and median value of 10 for uncertainty related to the extreme input variable value combinations on a scale from 0-100 when p = 90%. Thus, the uncertainty on the scale 0-100 at this specified probability value lies between 10 % and 12.5 % with regard to the medians of the confidence intervals. An approximately 10% uncertainty while rating software quality on a ratio scale is an acceptable value; however, the small sample size (n = 41) makes a stronger validation necessary, as usual.

Modelling: Machine Learning
The data collection and exploratory data analysis resulted in a tabular representation of the data in which the questions of the survey form the columns; however, another data format is necessary for machine learning that contains the input and output pairs. To achieve this, the data were transformed into a matrix representation, where the columns represent (a) the four input variables-(1) accuracy, (2) legibility, (3) design and implementation, and (4) security-and (b) the corresponding output: execution tracing quality. The rows of the matrix represent input and output pairs. Where an outlier was marked in the rows of the matrix, the complete row was removed. After reshaping the data this way and removing the outliers, 1185 pairs of inputs and the corresponding output were created; i.e., the matrix possessed 1185 rows suitable for machine learning. The data set was randomly split into proportions of 70% and 30% for training data and checking data, as introduced in Section 3.
Modelling was performed in the Matlab R2021a environment, where a fuzzy inference system was created with four inputs and one output, according to the data set above. The input membership functions and their partitions are shown in Figure 4. The output membership functions were defined as five constants; i.e., zero-order membership functions were used in the output at the values {0, 25, 50, 75, 100} for the singleton fuzzy sets {very poor, poor, medium, good, very good}. The fuzzy inference system was constructed with the inference type Takagi-Sugeno-Kang [23] with no initial linguistic rules. In the scope of machine learning, the system was adapting to minimise the root mean squared error of the input-output data pairs, while the learning was carried out with genetic algorithms [112], and the numbers of possible linguistic rules were limited to {1, 3,5,8,12,16,20,40,80, 160} per run. Genetic algorithms have a non-deterministic nature; i.e., several runs of the same algorithm with the same parameters in the same problem domain may achieve different but similar results. Therefore, each run with the same parameters and with the same upper bound settings for the linguistic rules was executed five times, and the results were recorded. In addition, the whole run-suite was executed six times.
Thus, six iterations of five models were created for each rule limit in the set {1, 3, 5, 8, 12, 16, 20, 40, 80, 160}. In addition, the best results of the five models were collected for each rule limit in each of the six run suites. The predictions of the models with different numbers of maximal rules were analysed with regard to the mean absolute error (MAE), root mean squared error (RMSE), and minimal and maximal deviation from the desired target in the course of each run. The runs with the best performance, which come from run suite 2, are depicted in Figure 5, and the corresponding data are illustrated in Table 4 with regard to the training data set; moreover, the same runs are also shown in Figure 6 with regard to the checking data set, and the corresponding data are highlighted in Table 5. The RMSE indicators for the training data and the checking data are shown in Appendix B in Figures A18 and A19; furthermore, the corresponding data are highlighted in Tables A1 and A2. The machine learning process was ceased when the new generations produced by the genetic algorithm reached a saturation point with regard to the mean RMSE value of the individuals in the population.
It is important to consider that when the maximal number of linguistic rules would have allowed 80 or 160 rules, then the number of rules learnt by the systems was usually below 80 linguistic rules.    The data shown in Figures 5 and 6, as well as the data in Appendix B in Figures A18 and A19, indicate that increasing the number of linguistic rules results in a better accuracy up to the approximation with 12 rules. Exceeding this value does not lead to better performance. Consequently, the extracted rule set with 12 rules became the target for further processing. The extracted rule sets of the selected run suite 2 are listed in Appendix C with their evaluation charts for the checking data set.

Modelling: Verification of the Constructed Model by a Different Machine Learning Approach
As introduced in Section 4.3, the quality model for execution tracing was prepared by machine learning through genetic algorithms, adapting a fuzzy inference system to the collected data set; i.e., the model was trained on the training data and tested on both the training and on the checking data, while the input and output membership functions were kept unchanged and the linguistic rules were identified. To test whether the model created at the end of the machine learning stage can be verified by a different machine learning approach, the authors applied ANFIS [24,25] and compared the same error indicators.
Thus, a new model was created by ANFIS [24,25]; i.e., the linguistic rules of the fuzzy inference system were specified with grid-partitioning at the start, and each linguistic rule was associated with a distinct output membership function, while the same initial input membership functions were assigned as in the case of the previous model trained by genetic algorithms. ANFIS [24,25] adjusts the parameters of the input and output membership functions. Grid partitioning, selected due to the homogeneous coverage of the whole problem domain, resulted in as many linguistic rules as the possible variations of the input partitions; i.e., four inputs with three partitions (3 4 = 81). Consequently, 81 output membership functions were required. Thus, the efficient human interpretation of the rule set to understand the problem domain, after training the model, is not feasible, in contrast to the model created in Section 4.3.
Furthermore, ANFIS training was implemented in two different ways: with (1) backpropagation and (2) hybrid approaches. For each learning approach, a separate model was trained with one of the initial step sizes from the set {0.1, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5}. The best-performing models with regard to the maximal errors and RMSE errors were selected in the course of 100 learning epochs. Table 6 presents the best error indicators with regard to the initial step sizes. The adaptation of the initial step sizes during the learning process can be observed in Figures A77, A79, A81 and A83.
The change of the RMSE error for all the best four models in Table 6 as a function of learning epochs for the training and checking data are depicted in Appendix F in Figures A76, A78, A80 and A82. In addition, the evaluation charts of the four best models in Table 6 with the ANFIS approach [24,25] The evaluation charts and the best error indicators in Table 6 show only minor differences in comparison to those achieved with genetic learning in Section 4.3. Thus, the results of Section 4.3 could be verified by a different machine learning approach on the same training and checking data sets.

Modelling: Adjustments and Pre-Validation
The rule set with the 12 upper-bound value of the linguistic rules became, for the above reasons, the starting point for the investigation in which direct human experience was also incorporated into the results gained by computational intelligence. A mini focus group with three software professionals-a software maintainer, a software tester, and a system administrator-whose job roles require intensive log analysis, with many years of experience in industry and in academia, investigated the established rules, their plausibility, and the evaluation charts of Figures A36 and A37. In addition, the pairwise effect of the inputs on the output was also analysed as shown in Appendix D in Figures A60-A65.
The goal of the discussion in the mini focus group was to incorporate direct human experiences in the rule base through investigating the existing linguistic rules, (1) leaving the correct and plausible linguistic rules in the rule set, (2) removing or changing the incorrect ones, and (3) adding the missing ones. Furthermore, the adjustments did not attempt to improve the error indicators of RMSE, MAE, maximal error, and minimal error for the training and checking data sets as the genetic algorithm used had already approximated the optimal fit to the data sets with the extracted rules. The adjustments also involved a pre-validation by the three participants of the mini focus group.
The following problems were identified: the initial evaluation charts in Figures Figure A61; (P5) while increasing the input variable accuracy for all values of the input variable legibility, the output in Figure A60 in the {good} domain of accuracy should not produce a decreasing effect; moreover, (P6) while increasing the input variable legibility from the {medium} domain and keeping the variable accuracy in the {poor} domain, the output in Figure A60 should not drop.
The rules R1, R2, R4, R6, R7, and R10 were taken over from the original rule set, without any change. The rules R3, R5, R9, R11, and R12 underwent changes to gain approval by the mini focus group, while the rules from R13 to R19 were added to resolve the issues from P0 to P6 listed above. resolves P2, as illustrated in Figure A70; R15.
While identifying the necessary changes and applying the adjustments, the fuzzy model was tested with the combination of vectors [0, 50, 100]; i.e., each input received a value from the vector, and the output of the model was analysed. Furthermore, the response of the model was also investigated while three inputs of the model were fixed and one input underwent a continuous increase and decrease. The final set of the linguistic rules sent for validation is summarised below:

R1. If (Accuracy is poor) and (Legibility is poor) and (DesignAndImplementation is poor)
and (Security is poor) then (Quality is very poor); R2. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is good) and (Security is poor) then (Quality is medium); R3. If (Accuracy is poor) and (Legibility is poor) and (Security is medium) then (Quality is very poor); R4. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is medium); R5. If (Accuracy is medium) and (Legibility is poor) and (DesignAndImplementation is good) and (Security is poor) then (Quality is medium); R6. If (Accuracy is poor) and (Legibility is good) and (DesignAndImplementation is good) and (Security is good) then (Quality is poor); R7. If (Legibility is poor) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is poor); R8. If (Accuracy is medium) and (Legibility is poor) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is very poor); R9. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is good) and (Security is medium) then (Quality is medium); R10. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is poor); R11. If (Accuracy is good) and (Legibility is medium) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is poor); R12. If (Accuracy is medium) and (Legibility is poor) and (Security is medium) then (Quality is medium);

R13. If (Accuracy is good) and (Legibility is good) and (DesignAndImplementation is good)
and (Security is good) then (Quality is very good); R14. If (Accuracy is medium) and (Legibility is poor) and (DesignAndImplementation is medium) and (Security is poor) then (Quality is very poor); R15. If (Accuracy is good) and (Legibility is medium) and (DesignAndImplementation is medium) and (Security is not good) then (Quality is medium); R16. If (Accuracy is good) and (Legibility is poor) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is medium); R17. If (Accuracy is poor) and (Legibility is medium) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is poor); R18. If (Accuracy is good) and (Legibility is good) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is good); R19. If (Accuracy is good) and (Legibility is good) and (DesignAndImplementation is good) and (Security is medium) then (Quality is good).

Modelling: Validity
The final step of modelling was conducted within the validation stage, which was implemented (1) by sending out a second questionnaire [28] to selected international experts located in Austria, Germany, and Hungary, and (2) considering their feedbacks and tuning the created model. After tuning the model, the authors also validated the results by face validity.

Validation by International Experts
The online questionnaire [28] contained the input variable definitions, the input membership function, their partitions, and the brief description of the technique used for modelling; moreover, the linguistic rules adjusted by the mini focus group, as introduced in Section 4.5. Furthermore, the survey ensured the possibility of giving textual feedback. It also encompassed the description of the effects of the pairwise inputs on the output as shown in Appendix E in Figures A66-A71. An online model demonstration was offered to the experts, the possibility of which was accepted by one expert.
Based on the feedback from the experts, rule R19 was changed, as presented below: R19. If (Accuracy is good) and (Legibility is good) and (DesignAndImplementation is good) and (Security is medium) then (Quality is very good).
The experts did not propose or request any further change to the constructed model. In addition, two insights from two different experts were summarised to highlight the nature of the quality property security and its possible impacts on execution tracing quality: (1) increasing security has a negative impact on execution tracing quality, as leaving out sensitive or GDPR-related data, including login names, reduces analysability; (2) security requirements are domain-specific-security is a must for medical software, and without a certain base-level quality for this property, the whole application is inadequate. These opinions illustrate that, even with regard to one individual quality property, contrasting requirements may exist in connection with the output, and a possible optimum needs to be found. This also supports the use of fuzzy logic in the problem domain, which can adequately describe contradicting information. Table 7 highlights the different errors of the created model after each modelling stage. The error indicators were measured on the training and checking data sets and are rounded to four decimal places in Table 7. The measurements indicate an error increase in the scope of model adjustments and tuning, which can be explained (1) by the approximated optimal fit of the linguistic rules, after the machine learning had taken place with genetic algorithms, to the collected data in the training and checking data sets; (2) by the addition of new information to the model, which was not present in the initial data sets, in the scope of the mini focus group; and (3) by the incorporation of the opinions of experts into the created model to tune it, which further increased the deviation from the collected data. This deviation and error increase between the above stages could probably have been diminished or eliminated if the sample size of the initial data set could have been increased.

Face Validity by the Authors
The authors investigated the outcome of the modelling stage after incorporating the feedback of the international experts. They found the formalised linguistic rules reasonable and plausible; moreover, they also examined the pairwise impacts of the inputs on the output, which is reported below the evaluation charts in The chart illustrates that the effect of accuracy is far stronger than the effect of security, which is similar to the case with the variables security and design and implementation. • Legibility-design and implementation in Figure  Design and implementation-security in Figure 12: In the {poor} range of design and implementation, the increase of security produces a slightly increasing effect in the output, but this effect saturates at low security values. In the {medium} domain of design and implementation, the rise of security has an effect only in the {poor} security domain, which is increasing with regard to the output. In the {good} range of design and implementation, the increase of security contributes no effect to the output in the relation of these two variables. In the {poor} domain of security, the increase of design and implementation produces an increasing effect in the output. This effect is stronger in the {medium} and {good} ranges of security if design and implementation increases, but then the contribution to the output saturates at lower design and implementation values.
The chart is very similar to the chart of accuracy-security; i.e., the variable design and implementation shows a far stronger effect towards the output than the variable security. In the {good} range of design and implementation, the increase of security contributes no effect to the output in the relation of these two variables, which does not mean that the increase of security, in the relation to all inputs, contributes no effect to the output.      The above comparisons describe solely three dimensions: two inputs and one output. All input variable combinations are depicted to highlight their effects in the output. Consequently, the described effects refer to the output with regard to the presented input combinations.
All the input combinations above, depicted in the three-dimensional charts and described in the listing, have a reasonable effect on the output, which is in accordance with the experience of the authors. In conclusion, the face validity by the authors is satisfied.

Validation
This section is devoted to the validity of the study. In complex modelling studies, validity is a scale with different degrees that can describe the distinct validity facets of the conducted research. Internal, external, content, and construct validities are investigated and documented below. Every reasonable effort was made in the context of the study to satisfy the documented validity aspects.
Internal validity establishes that the research conducted represents the truth with regard to the phenomenon studied [113]. In the context of the conducted study, this is based on the (1) representativeness of the sample drawn from the defined study population, on the (2) human experiences incorporated in the rule base, and on the (3) internal consistency of the study. The online questionnaire [28] was tested before sending it out; moreover, the reliability indicators ω total and Cronbach's α were computed for the collected data, and they showed high-reliability values as presented in Section 4.1. The sample size (n = 41) in the research is low; however, the median value of the ranges of the p = 90% confidence intervals are acceptable with regard to the problem domain, as discussed in Section 4.1. Nevertheless, 1185 input-output data pairs were collected to accomplish machine learning; furthermore, additional data collection was carried out in the mini focus group while investigating the performance of the created model, as described in Section 4.5. The internal consistency of the study was achieved (a) in the scope of the machine learning by training the model on the training data set, which contained 70% of the data randomly selected from the whole data set, and checking the performance of the created models on both the training and on the checking data; moreover, (b) the linguistic rules generated in the scope of machine learning underwent adjustment in the mini focus group; (c) finally, the model created was investigated by experts and tuned based on their feedback, as explained in Section 4.6. In addition, machine learning was implemented in two different manners and the errors of the two approaches were compared, as reported in Section 4.4; furthermore, the non-deterministic algorithms were executed several times, and the models with the best performance runs were selected, as described in Sections 4.3 and 4.4.
External validity establishes the generalisability of the findings identified in the study [113]. This kind of validity was satisfied by including a broad range of software professionals in the study: software architects, software developers, software maintainers, software testers, and system administrators. On the other hand, the study population was defined with a localisation to Hungary; however, international companies with employee counts over 1000 were involved in the study, with international coding habits and guidelines extending across geographical boundaries. In addition, the model underwent final tuning based on the opinions of international experts located in Austria, Germany, and Hungary, also counting authors in the UK.
Content validity [114] in the context of this study refers to the extent to which all possible opinions of software professionals were considered. A higher sample size could have helped to better satisfy this kind of validity, but the p = 90% confidence intervals and the variance of the data presented in Section 4.1 show acceptable ranges for the context of the research. Nevertheless, the error indicators in Table 7 increase with regard to the collected data set as direct, valid human experiences are incorporated in the rule base in the scope of the mini focus group and in the course of tuning the model based on the feedback of the international experts. This increase could probably have been diminished or eliminated if the sample size could have been increased.
Construct validity represents the extent to which the findings reflect the content of the constructs in the phenomenon investigated [114]; i.e., whether the created model really reflects execution tracing quality, including the input variables and their impact on the output. On the one hand, the content of the constructs with regard to the input variables was elicited in the authors' previous publication [1] while defining the quality properties of execution tracing. On the other hand, the experiences collected from the software professionals describe the content of the phenomenon with regard to the effect of the input variables in the output as (1) the p = 90% confidence intervals show acceptable ranges for the context of the research in Section 4.1, (2) an adjustment of the created model was performed by a mini-focus group in the scope of an adjustment and pre-validation stage, as reported in Section 4.5, and (3) international experts also validated the model, as reported in Section 4.6. In addition, the method applied for modelling can capture and describe vagueness and uncertainty; therefore, the deviations among the opinions collected from the software professionals did not cause a problem.
In summary, the internal validity, external validity, content validity, and construct validity of the conducted research are satisfactory. Nevertheless, increasing the sample size of the data collected for machine learning could improve the above validity indicators. However, the low sample size does not cause a serious impediment in the present study, as the technique used for modelling, fuzzy logic, allows the direct incorporation of human experiences accumulated in the problem domain, by which further data were incorporated in the model during the mini focus group discussion and in the course of tuning the model based on the opinions of international experts.

Limitations
The research was planned with great care, including the study population and sampling. However, the desired number of responses could not be achieved in the scope of sampling, which resulted in a low sample size. Nevertheless, the investigated confidence intervals of the data in Section 4.1 were satisfactory with regard to the problem domain. In addition, the low sample size was complemented by incorporating data collected at two further stages: (1) a mini focus group and (2) validation by international experts located in Austria, Germany, Hungary, and counting also authors in the UK.
The initial study population was restricted to Hungary, but the population included international companies with premises in different countries, with the same company-wide coding guidelines and coding habits; therefore, the initial restriction to a geographically localised population does not represent an impediment for the generalisability of the study, but it imposes a risk. This risk was mitigated by the stronger validation in the scope of the model tuning, in which international experts participated from different countries; i.e., the model has been validated by these international experts.
Furthermore, the constructed and validated model is a general model for describing execution tracing quality. The application domain used-i.e., (1) server application, (2) desktop application, (3) web UI, (4) mobile application, or (5) embedded application-can also have an influencing effect on the importance of the single linguistic rules identified. Thus, tuning the model with regard to the application domain can improve the performance.

Discussions
The application of fuzzy logic in the context of the present research shows the following main advantages: (1) uncertainty related to execution tracing quality and its inputs can be captured and described, (2) linguistic rules that describe the problem domain in an human-understandable manner are available, (3) human experience can directly be incorporated in the rule base to tune the model, and (4) adaptive methods exist to extract the linguistic rules from a data set containing the experiences of software professionals. The properties of fuzzy systems mentioned in point 2 and 3 outperform those of the neural networks (NN). NNs neither produce a rule base in a human-understandable manner nor make the direct incorporation of human experiences in the system possible. In addition, the linguistic rules can also be regarded as guidelines for software professionals on how to implement a good-quality execution tracing mechanism, as they explicitly formulate which input variable combinations result in a {good, very good} output, and they also describe the poor results in linguistic terms; however, the rules also impact each other's output.
In the scope of a pilot study [107], the authors tested with which approach the uncertainty and vagueness in the context of execution tracing quality can be captured in the most appropriate manner. The findings of the pilot study [107] were used in the current research; therefore, an adaptive fuzzy system with overlapping, Gaussian-shaped membership functions at the inputs with the Takagi-Sugeno-Kang inference mechanism [23] was selected for modelling. In contrast to the pilot study, the present research is based (1) on the elicited quality properties of execution tracing [1], which formulate the input variables of the created quality model; (2) on the experiences of many software professionals from a well-defined study population; (3) on the linguistic rules extracted from the collected data by machine learning; (4) on the incorporation of direct human experiences into the model created by computational intelligence from software professionals with many years of experience in academia and in the industry; and (5) on the feedback from international experts, who also validated the created model.
While establishing the knowledge base of the fuzzy system in the form of linguistic rules, it is easy to introduce contradictions based on the different opinions of individual experts in the problem domain [107]. To avoid this issue, the linguistic rules have been identified in an adaptive manner by fixing the input membership functions and the output membership functions and letting the system find the optimum of the root mean squared error (RMSE) by fitting the rules to the collected data by genetic algorithms [26,27,112,115]. The same training procedure was repeated in a manner by which all possible variations of the input partitions were combined into linguistic rules with separate output membership functions. In this second case, the linguistic rules were fixed, and the adaptive process looked for the minimum of the RMSE while changing the parameters of the membership functions by an adaptive-network-based fuzzy inference system approach (ANFIS) [24,25]. Both the GA and the ANFIS approach achieved the same accuracy; i.e., they supported the outcome of each other with regard to the data. The adaptive process was implemented as supervised learning; i.e., the optimal parameters of the membership functions and the best matching rules were identified in the scope of the learning process, where input and output data pairs were provided for the system. After achieving the desired accuracy and performance, the trained system became available for carrying out the specified tasks. In the context of the present research, the output of the GA approach was used for further processing as it defined the knowledge base in the form of linguistic rules with an upper bound, which was further tuned and validated by experts.
The application domains of (1) server applications, (2) desktop applications, (3) web UIs, (4) mobile applications, and (5) embedded applications have different characteristics. These differences might influence the created model, and the formalised linguistic rules can have a different importance in each of the listed domains, which opens further tuning possibilities with regard to the specificities of these domains.

Conclusions and Future Work
The previous studies introduced in Section 2 focus on individual aspects of logging, such as what to log, how to log, and where to embed log statements in the source code of the applications. Overall guidelines on how to construct logging do not exist but would be desirable, as stated in [4,5,8,10,29]. In addition, measuring logging quality quantitatively, setting quality targets for it, and comparing the log quality of different projects or the same project at different points in time were not possible. For these reasons, it was necessary to create a quality model for execution tracing, which was done in the scope of the conducted research.
Uncertainty and vagueness are inherently present in the software product quality measurement and assessment process. In addition, the dispersion of the collected data in comparison to the range of the possible answers the respondents could give was high, which indicates vagueness associated with the rating process. Consequently, the application of fuzzy logic was an ideal candidate to capture and describe the phenomenon; on the other hand, more detailed guidelines can be elaborated regarding how the respondents should judge the variables in question to decrease the uncertainty.
In addition to the constructed quality model, the study identified the following findings: (1) the inputs accuracy and design and implementation have the most influence on the quality of execution tracing, and these quality properties approximately determine the quality with the minimum-relationship; i.e., the lower value determines the quality of logging. (2) Legibility is a benefit, as it helps to reduce the effort while localising errors; moreover, it reduces the psychological load on the software development and software maintenance professionals while performing the analysis. The absence of this quality property causes an unnecessary load on the staff and deteriorates analysis performance, but it does not block the analysis itself, which might be the case with the quality properties of accuracy and design and implementation if they are missing. (3) The quality property security is a feature that does not lie in the primary focus while localising errors, but this feature also needs to be satisfied to some extent to avoid leaking sensitive information and to observe legal regulations. Tracing highly sensitive information might block the deployment of the application in certain domains due to legal regulations, including the medical and financial fields. The quality property security has a reverse relationship with execution tracing quality as far as the amount of information is concerned. On the other hand, tracing sensitive information can also cause additional work to be incurred to remove it if the logs need to be passed on to other teams for further analysis.
In future work, the authors plan to link the constructed quality model to overall software product quality frameworks to consider execution tracing quality with regard to software product quality as a whole. In addition, the simplification of the mathematically complex model is also planned to ensure an easy-to-tailor approach to specific application domains.

Acknowledgments:
The authors are grateful to all the professionals who participated in the study; moreover, they also express special thanks to the management of the companies involved for making the organisation of the survey possible.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Data Processing and Exploratory Data Analysis
Appendix A. 1

. Rating the Input Variables of the Use Cases
The charts below depict the responses for each input variable of the defined use cases.    The quantile-quantile plots below depict the distribution of each collected variable. The 0.95% confidence interval is also shown; moreover, the computed Sharpio-Wilk normality test values are added to the charts. The extreme input variable value combinations Q20 and Q22 contained solely identical values; therefore, they were excluded from the normality test.

Appendix C. Extracted Rule Sets and Model Performance
This section contains the extracted rule sets in the course of run suite 2 according to the upper bounds for the maximal number of the linguistic rules specified for the algorithm. Each extracted rule set represents a distinct model. In addition, the charts are also presented that show the performance of each model on the checking and the corresponding training data. The original data are sorted to ensure that the deviations from the desired targets are easy to observe.

If (Legibility is poor) and (DesignAndImplementation is medium) then (Quality is medium) 2. If (Accuracy is good) and (Legibility is medium) and (DesignAndImplementation is
good) and (Security is medium) then (Quality is good) 3. If (Accuracy is poor) and (Security is poor) then (Quality is poor) Figure A24. Evaluation of the Best-Performing Three-Rule Model with Regard to Maximum Error for Checking Data.

If (Legibility is poor) and (DesignAndImplementation is good) and (Security is medium) then (Quality is medium) 2. If (Accuracy is poor) and (Legibility is good) and (DesignAndImplementation is good)
and (Security is good) then (Quality is poor) 3. If (Accuracy is medium) and (Legibility is good) and (DesignAndImplementation is medium) then (Quality is medium) 4. If (Accuracy is medium) and (Legibility is good) and (DesignAndImplementation is good) and (Security is good) then (Quality is very good) 5. If (Accuracy is poor) and (Legibility is medium) and (DesignAndImplementation is poor) then (Quality is poor)    1. If (Accuracy is medium) and (Legibility is good) and (DesignAndImplementation is good) and (Security is poor) then (Quality is medium) 2. If (Accuracy is good) and (Legibility is good) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is very good) 3. If (Accuracy is poor) and (Legibility is medium) then (Quality is poor) 4. If (Accuracy is good) and (Legibility is good) and (DesignAndImplementation is poor) then (Quality is medium) 5. If (Accuracy is medium) and (Legibility is good) and (DesignAndImplementation is medium) and (Security is poor) then (Quality is poor) 6. If (Legibility is poor) and (DesignAndImplementation is poor) and (Security is poor) then (Quality is very poor) 7. If (Accuracy is medium) and (Legibility is good) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is good) 8. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is medium) then (Quality is medium) Figure A32. Evaluation of the Best-Performing Eight-Rule Model with Regard to Maximum Error for Checking Data.   1. If (Accuracy is poor) and (Legibility is poor) and (DesignAndImplementation is poor) and (Security is poor) then (Quality is very poor) 2. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is good) and (Security is poor) then (Quality is medium) 3. If (Accuracy is poor) and (Legibility is poor) and (Security is medium) then (Quality is poor) 4. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is medium) 5. If (Accuracy is medium) and (Legibility is poor) and (DesignAndImplementation is good) and (Security is poor) then (Quality is very poor) 6. If (Accuracy is poor) and (Legibility is good) and (DesignAndImplementation is good) and (Security is good) then (Quality is poor) 7. If (Legibility is poor) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is poor) 8. If (Accuracy is medium) and (Legibility is poor) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is poor) 9. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is good) and (Security is medium) then (Quality is very good) 10. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is poor) 11. If (Accuracy is good) and (Legibility is medium) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is medium) 12. If (Accuracy is medium) and (Legibility is poor) and (Security is medium) then (Quality is good)    Appendix C.6. Maximal Number of Linguistic Rules: 16

If (Legibility is good) and (DesignAndImplementation is medium) and (Security is poor) then (Quality is good) 2. If (Accuracy is poor) and (Legibility is good) and (DesignAndImplementation is poor)
and (Security is good) then (Quality is good) 3. If (Accuracy is medium) and (Legibility is good) and (DesignAndImplementation is poor) and (Security is poor) then (Quality is very poor) 4. If (Accuracy is poor) and (Legibility is medium) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is poor) 5. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is medium) 6. If (Accuracy is good) and (Legibility is medium) and (DesignAndImplementation is good) and (Security is medium) then (Quality is very good) 7. If (Accuracy is good) and (Legibility is medium) and (DesignAndImplementation is good) and (Security is poor) then (Quality is poor) 8. If (Legibility is poor) and (DesignAndImplementation is good) and (Security is medium) then (Quality is medium) 9. If (Legibility is poor) and (DesignAndImplementation is medium) and (Security is good) then (Quality is medium) 10. If (Accuracy is good) and (Legibility is good) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is good) 11. If (Accuracy is good) and (Legibility is medium) and (Security is medium) then (Quality is good) 12. If (Accuracy is poor) and (DesignAndImplementation is poor) and (Security is poor) then (Quality is very poor) 13. If (Accuracy is medium) and (Legibility is good) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is very poor) 14. If (Accuracy is poor) and (Legibility is poor) and (Security is good) then (Quality is very poor) 15. If (Accuracy is poor) and (Legibility is poor) and (DesignAndImplementation is good) and (Security is medium) then (Quality is medium) 16. If (Accuracy is good) and (Legibility is medium) and (DesignAndImplementation is poor) then (Quality is poor)        19. If (Accuracy is poor) and (Legibility is medium) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is good) 20. If (Accuracy is medium) and (Legibility is good) and (Security is good) then (Quality is very good) 21. If (Accuracy is medium) and (Legibility is poor) then (Quality is medium) 22. If (Legibility is poor) and (DesignAndImplementation is poor) and (Security is good) then (Quality is poor) 23. If (Accuracy is good) and (Legibility is good) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is very poor) 24. If (Accuracy is medium) and (Legibility is poor) and (DesignAndImplementation is medium) and (Security is poor) then (Quality is very good) 25. If (Accuracy is medium) and (Legibility is good) then (Quality is good) 26. If (Accuracy is medium) and (DesignAndImplementation is medium) and (Security is poor) then (Quality is poor) 27. If (Accuracy is poor) and (Legibility is good) and (Security is good) then (Quality is very poor) 28. If (Accuracy is poor) and (Legibility is medium) and (DesignAndImplementation is poor) then (Quality is poor) 29. If (Legibility is poor) and (Security is poor) then (Quality is very poor) 30. If (Accuracy is good) and (DesignAndImplementation is poor) and (Security is poor) then (Quality is medium) 31. If (Accuracy is medium) and (Legibility is good) and (DesignAndImplementation is poor) and (Security is good) then (Quality is good) 32. If (Accuracy is poor) and (Legibility is poor) and (Security is good) then (Quality is poor) 33. If (Accuracy is poor) and (Legibility is medium) and (DesignAndImplementation is poor) and (Security is poor) then (Quality is poor) 34. If (Accuracy is good) and (Legibility is medium) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is medium) 35. If (DesignAndImplementation is poor) and (Security is medium) then (Quality is very poor) 36. If (Accuracy is good) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is poor)

If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is very poor) 18. If (Accuracy is poor) and (Legibility is poor) and (DesignAndImplementation is poor)
and (Security is medium) then (Quality is good) 19. If (Legibility is poor) and (DesignAndImplementation is poor) and (Security is medium) then (Quality is very poor) 20. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is good) and (Security is medium) then (Quality is medium) 21. If (Accuracy is poor) and (Legibility is poor) and (DesignAndImplementation is medium) and (Security is poor) then (Quality is good) 22. If (Accuracy is good) and (Legibility is medium) and (DesignAndImplementation is good) and (Security is poor) then (Quality is medium) 23. If (Accuracy is poor) and (Legibility is medium) and (DesignAndImplementation is medium) and (Security is poor) then (Quality is poor) 24. If (Accuracy is poor) and (DesignAndImplementation is good) and (Security is medium) then (Quality is medium) 25. If (Accuracy is medium) and (Legibility is good) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is very good) 26. If (Accuracy is poor) and (Legibility is poor) and (DesignAndImplementation is good) and (Security is poor) then (Quality is good) 27. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is good) and (Security is poor) then (Quality is very poor) 28. If (Accuracy is poor) and (Legibility is good) and (Security is good) then (Quality is poor) 29. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is medium) then (Quality is medium) 30. If (Legibility is medium) and (DesignAndImplementation is good) and (Security is medium) then (Quality is very good) 31. If (Accuracy is good) and (Legibility is good) and (DesignAndImplementation is poor) and (Security is good) then (Quality is medium) 32. If (Accuracy is good) and (Legibility is poor) then (Quality is medium) 33. If (Accuracy is medium) and (Legibility is medium) and (DesignAndImplementation is poor) and (Security is poor) then (Quality is poor) 34. If (Accuracy is medium) and (Legibility is poor) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is poor) 35. If (Accuracy is poor) and (Legibility is poor) and (DesignAndImplementation is good) and (Security is medium) then (Quality is very poor) 36. If (Accuracy is poor) and (Legibility is poor) and (DesignAndImplementation is poor) and (Security is good) then (Quality is very poor) 37. If (Accuracy is good) and (Legibility is medium) and (DesignAndImplementation is medium) and (Security is medium) then (Quality is good) 38. If (Accuracy is poor) and (Legibility is good) and (DesignAndImplementation is good) and (Security is poor) then (Quality is medium)

Appendix E. Evaluation Charts II
The evaluation charts show the pairwise effect of the input variables on the output after carrying out the adjustments in the mini focus group, as reported in Section 4.5.

Appendix F. The ANFIS Approach
This section presents the training results of the ANFIS approach performed to verify the results of the genetic learning illustrated in Section 4.3. In the scope of both the (1) backpropagation and (2) hybrid approaches, a separate model was trained with one of the initial step sizes from the set {0.1, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5}. The best-performing models with regard to the maximal errors and RMSE errors were selected in the course of 100 learning epochs and are demonstrated below in Figures A72-A75.     The changes of the RMSE error indicator are depicted in Figures A76, A78, A80 and A82 to highlight the convergence of the error on the training and checking data. In addition, the adaptation of the initial step sizes during the learning process is depicted in Figures A77, A79, A81 and A83.