The Influence of Criteria Selection Method on Consistency of Pairwise Comparison

The more criteria a human decision involves, the more inconsistent the decision. This study experimentally examines the effect on the degree of pairwise comparison inconsistency by using the (im)possibility of selecting the criteria for the evaluation and the size of the decision-making problem. A total of 358 participants completed objective and subjective tasks. While the former was associated with one possible correct solution, there was no single correct solution for the latter. The design of the experiment enabled the acquisition of eight groups in which the degree of inconsistency was quantified using three inconsistency indices (the Consistency Index, the Consistency Ratio and the Euclidean distance) and these were analysed by the repeated measures ANOVA. The results show a significant dependence of the degree of inconsistency on the method of determining the criteria for pairwise evaluation. If participants are randomly given the criteria, then with more criteria, the overall inconsistency of the comparison decreases. If the participants can themselves choose the criteria for the comparison, then with more criteria, the overall inconsistency of the comparison increases. This statistical dependence exists only for males. For females, the dependence is the opposite, but it is not statistically significant.


Introduction
In the realm of multi-criteria decision making, the process of selecting from options involves the ranking of a finite set of available alternatives. As a method for coping with this relatively simple task, pairwise comparisons have been the primary approach for several decades. Comparing alternatives has been a significant topic in fields of study such as cognitive science, decision sciences, psychology and computer science [1][2][3], and has enabled the establishment of modern multi-criteria decision-making methods such as multi-attribute value theory and the analytic hierarchy process [4]. The method involves two steps. First, pairwise comparisons of the alternatives are conducted. Second, the overall ranking is synthesised by using an appropriate algorithm [5].
The issue is that this procedure is connected with inconsistencies [6]. Consistency is "a cardinal transitivity condition of preferences on triplets of decision elements and represents the full decision-maker's coherence" [7]. Simply put, inconsistency as a concept associated with the pairwise comparisons technique is based on the idea of transitivity, i.e., establishment of two pairwise comparisons among three alternatives determines the last comparison between as yet unpaired alternatives [8]. If preferences are presented as ratios, then their consistency is based on the following idea: if an alternative a i is preferred to an alternative a j x times and the alternative a j is preferred to an alternative a k y times, then the alternative a i should be preferred to the alternative a k x × y times [9]. When pairwise comparisons are executed, a priority vector of alternatives can be determined and utilised for the final ranking [5]. One of the first presentations of the inconsistency concept in pairwise comparisons was provided by Kendall [10] decades ago.
When working with n priorities, a decision-maker has to conduct a set of (n − 1) basic comparisons. Nonetheless, the issue is that this approach pushes an individual to make a direct choice of one object over another during the comparison rather than comparing all objects simultaneously [11]. A decision-maker is forced to make both a selection of one alternative over another without the possibility of comparing all alternatives at once, and set values to each pair of alternatives. Considering these assumptions, it is almost infeasible to reach an absolutely consistent priority matrix. Not surprisingly, the inconsistency in comparison of priorities also arises due to mistakes made by decision-makers [8]. There are many mistakes made by individuals who try to achieve good value judgements. They range from incomprehension of the decision context and doubts about their judgements, to the vagueness of the judgements and the omission of checking the comparisons of priority for their consistency [5,12].
In the realm of inconsistency research, there is an acknowledged anticipation that the growing number of required comparisons is related to increasing inconsistency [13][14][15]. The main motivation for this study is associated with the endeavour to develop more realistic assumptions. Factors influencing inconsistency are neither well nor completely understood. When inconsistency values are generated (e.g., for simulation purposes), the aforementioned anticipation presupposes that they are only linearly dependent on the problem size. Our research investigates factors that can contradict this presumption. The identified research gap reflects situations in two separate research domains in which experts cope with inconsistency. Both types of studies, theoretical and empirical, exploring properties of the inconsistency quantification methods and comparison matrices have already been published in the fields of mathematics and computer science [16][17][18][19]. Empirical comparison matrices are seldom analysed. In practice, randomly generated (simulated) comparison matrices are used almost exclusively. As far as our knowledge extends, there are only a few studies dealing with alternative comparison matrices based on empirically collected data. For a few exceptions from this practice, see for instance [20][21][22][23][24][25][26]. These studies are grounded either in a demonstrative experiment [27], or a regular experimental study [21]. In addition to this perspective, there are also research works based on empirical studies from the fields of cognitive science or psychology [28][29][30][31][32]. Nevertheless, published manuscripts are neither directed at individuals and their multi-criteria decision making peculiarities, nor do they utilise available numerical measures enabling if not cardinal, then at least ordinal comparison.
The aim of the present paper is to analyse the effects of subjective and objective problem types on the inconsistency of the decisions, as measured by selected inconsistency coefficients. The results of an experimental investigation clarify whether the size of the problem, the choice of the alternatives, or gender, moderate different problem types with regard to the level of inconsistency.

Inconsistency of Pairwise Comparison Matrices
The analytic hierarchy process (AHP) developed by Saaty [7] is a well-known method for pairwise comparison. It has been modified or extended in various ways from the very beginning in order to avoid its weak points [33,34]. In specific situations, such as group decision making, it is recommended to substitute AHP for more appropriate methods such as step-wise weight assessment ratio analysis [35] or Eckenrode's rating technique [36]. In order to enable the practical usability of pairwise comparison, a threshold inconsistency was determined. Comparison matrices with inconsistency values below the threshold are considered as usable for further analysis and the decision model developed by a decision-maker is considered as acceptable. Inconsistency values above the threshold indicate the need to reconstruct the model. Liang et al. [37] state that without any threshold, the decision-maker is left with significant issues of deciding whether judgements need to be revised or can be accepted. Furthermore, the number of criteria and the scale of evaluation have to be considered, which makes the situation even more tangled. The generally accepted 10% rule of thumb associated with AHP [38] has long been criticised [14,39]. Hence, several amendments have been developed, such as values of 5 and 8% for three and four criteria, respectively [40]. Thresholds were also determined based on various statistical studies [41][42][43][44]. Although some other methods have been proposed to determine consistency thresholds [45,46], the majority of them are associated with complete pairwise comparison matrices. This feature prevents them from being used directly for incomplete pairwise comparison matrices.
The purpose of the pairwise comparison matrix is to capture partial information about all pairs of alternatives that the decision-maker compares with each other. In each comparison, the decision-maker assigns weights to alternatives expressing his/her preferences for the alternatives. However, the weights are not given directly. Instead, the decision-maker enters (estimation of) weight ratios corresponding to the alternatives being compared (if a multiplicative scale was used). Applying mathematical formalism, a matrix of pairwise comparisons is a mathematical structure in the form of a square matrix A = (a ij ) n×n , where a ij > 0 is an estimation of the weight ratio w i to w j ; w i , w j being weights for alternatives i, j, respectively. The matrix A is said to be consistent if, and only if, a ik = a ij a jk for all i, j, k. It can immediately be seen that for a ij = w i w j , A is consistent. Since in general the elements of A are estimations of weight ratios, it is easy for this matrix to be inconsistent. There are several methods for quantifying the inconsistency contained in a pairwise comparison matrix (see e.g., [9,38]).
Quantification of inconsistency can be associated with either ordinal or cardinal comparisons. Some measures determine parameters with the help of a large set of comparison matrices generated on a random basis [15,47,48]. Methods and techniques applied to the quantification of inconsistency vary from the perspective of ease of calculation, degree of similarity to other indices, behaviour [9], or their focus on either means or extreme values [49]. Basic informally defined characteristics of inconsistency indices are provided by Brunelli [38] and are focused on (1) the most inconsistent part of the matrix (e.g., generalized K index [50]); (2) index formula as a reference to or function of the w vector (e.g., Consistency Index [7]); and (3) the existence of analytic solution (e.g., Euclidean distance). There are studies using different sets of matrices with values determined by selected criteria in order to compare and evaluate inconsistency quantification techniques [14,51,52]. A complete illustration and explanation of these indices goes beyond the focus of this study. Readers are invited to find relevant details in the original sources. It is important to note that some indices were originally tied to the consistency concept while others were associated with the measurement of inconsistency. Despite this terminological issue, the main rationale of all such indices is the same: a greater inconsistency is connected with a greater value of an index.

Participants and Materials
A call for participation was issued at the authors' university in order to acquire subjects for the data gathering process. Potential applicants were motivated by a reward of CZK 3000 (equivalent to EUR 120) for five randomly selected participants. Only participants who provided an email address from the university domain at the end of the trial could participate in the draw. The email addresses were stored separately without the possibility of connecting them with experimental outcomes retrospectively. Altogether, 358 subjects enrolled in the experiment, consisting of students of various study programs ranging from soft disciplines such as tourism management or financial management to hard disciplines such as applied computer science.
The acquired set of subjects represented a heterogeneous group of individuals. Therefore, the domain used for inconsistency measurement had to be thoroughly selected, as the testing domain and related task had to be comprehensible to all subjects. Eventually, a simulated decision-making task of selecting a mobile phone (subjective problem type) and an area-based ordering of geometrical shapes (objective problem type) were presented to the subjects.
For the former, a set of 15 criteria associated with the properties of mobile phones was prepared; namely, the manufacturer, display size, resolution of the front camera, resolution of the back camera, battery capacity, the possibility of changing the battery, memory size, type of external memory card, operating system, weight, processor type, cordless battery charging, availability of original accessories, dual SIM and the resistance of the mobile phone to environmental forces.
For the other task, 7 shapes with known unequal areas were drawn: a circle, ellipse, rectangle, square, trapezoid, triangle and rhombus.

Procedure
The experiment was conducted in a dedicated computer lab located in the Faculty of Informatics and Management, University of Hradec Králové. A proprietary web-based application was developed based on the Hypertext Markup Language, Cascading Stylesheet, PHP: Hypertext Preprocessor, MySQL and JavaScript. This application assisted with the gathering, checking and saving of the data. The core of this application was focused on acquiring the input and formatting the output. The third-party public library, a product of Codeproject, was used for calculations of the eigenvalues. The measurement of the time was another functionality used for identifying unreliable data segments.
At the beginning of the experiment, the subjects were informed about the main objective and purpose of the project and an introductory explanation of the pairwise comparison was presented. The evaluation process was not associated with any time limitation. All subjects were allowed to participate in the study only once. A task assigned to a participant was based on a random selection of a combination of testing modes.
There were three tested properties: the number of evaluated criteria (a matrix size of either five or seven), objective or subjective problem type (i.e., unique correct alternative ordering exists or does not), and the possibility of the subjects selecting the criteria for the comparison from a predefined set (free criteria choice method), or a random assignment of the criteria.
Each participant evaluated two comparison matrices only: one for the objective problem type and another one for the subjective problem. Because the two problems are completely different, all findings came from a "between-subject" experiment (see [53] for details), thereby avoiding the anchoring effect [54].
Despite its weak points, AHP was found to be a sufficient and good enough technique for the purpose of this study as it can be found on the list of the most popular tools implemented in practical situations [55,56]. That is why it has been implemented with a multiplicative preference evaluation model (entry values being the integer numbers 1 to 9 and their corresponding inverse values). The subjects were given all the cells of the comparison matrix at once (as opposed to cell-by-cell) with permission to revise once selected pairwise comparison values before the final submission of the matrix.

Applied Measures
As they have been long dominant in the field and are most widely used for measuring the inconsistency degree [38,56,57], we decided to apply three fundamental inconsistency quantification methods for the purpose of our study. First, Saaty's Consistency Index and the the Consistency Ratio were calculated. Second, the Euclidean distance was calculated.
Let A = (a ij ) denote a multiplicative pairwise comparison matrix of dimension n, and let B be a matrix defined as B = (ln a ij ).
The Consistency Index (CIndex) was defined by Saaty [7] as where λ max is the principal eigenvalue of A; CIndex ≥ 0. The Consistency Ratio (CRatio) represents a standardised version of CIndex. It is expressed as a ratio with CRatio in a numerator and RI in a denominator. RI is a real number determined as the average CIndex of a large amount of randomly generated matrices of size n.
The Euclidean distance (EDA) is defined as where V is a priority matrix V = (v ij ) = (w i − w j ) such that w i is an arithmetic mean weight vector w i = 1 n ∑ i∈N b ij . The Euclidean distance can be normalised, yielding the Euclidean normalised distance.

Statistical Analysis
The data were pre-processed using the R statistical package. The main analyses were accomplished in IBM SPSS. First, the incomplete and corrupted records with inconsistency zero were removed and a dataset with 358 subjects was obtained (dataset is available at https://doi.org/10.6084/m9.figshare. 13317458). Further, the data were cleaned in respect of the time the subjects spent filling in the form. An exceedingly long time for filling the form implied that the subject declined to complete the task. The detection of outliers was based on Tukey's interquartile range, since Tukey's technique is less sensitive to extreme values [58]. The filtered dataset contained inconsistency measurements of 276 subjects. The associations between the studied factors were tested with the Pearson chi-squared indicator. Repeated measures ANOVA was conducted to study the inconsistency of decisions in respect of given conditions. Repeated measures ANOVA controls were used for the same subjects participating in more than one condition. The design of the analyses included the type of the problem, such as within-subject factor and size of the problem, choice of criteria and gender as between subject factors. The analyses were executed incrementally by adding the above stated factors. The key assumption for the repeated measures ANOVA is so called sphericity; that is, the equality of variances computed based on differences between factor levels. However, in the present study, each factor consists of only two levels and thus sphericity was not an issue. The post hoc tests involved pairwise assessments of estimated marginal means between experimental conditions and the mean differences (M) and corresponding 95% confidence intervals (CI). In the present study, the results of the statistical tests were considered significant if the p-value < 0.05. The partial eta squared statistic was used to measure the size of the effect. The interpretation of the partial eta squared is the amount of variance explained by the independent variable. Cohen et al. [59] state that the indicative effect sizes can be small, medium or large with values 0.01, 0.06 or 0.14, respectively.

Results
The inconsistency of the decisions resulting from the comparison tasks of 276 subjects were analysed. The basic characteristics of the subjects are presented in Table 1. The frequencies across studied factor levels are shown in Table 2. The Pearson chi-squared tests confirmed that there were no associations between size and choice χ (1) = 0.012, p = 0.912, size and gender χ (1) = 0.021, p = 0.884 or choice and gender χ (1) = 1.542, p = 0.214. That means that there are no significant differences between frequencies across factor levels.

The Effect of Problem Type
Repeated measures ANOVA with problem type as within-subject factor revealed that there is a significant effect of problem type on the inconsistency of comparison in all inconsistency coefficients (see Table 3). The table shows that the significance is p < 0.001 for all coefficients and also the effect sizes measured by the partial eta squared point to large effects. The pairwise comparisons with estimated marginal means show that students exhibit significantly higher levels of inconsistency in assessing alternatives of the subjective problem as opposed to the objective one (see Table 4). Figure 1 shows the estimated marginal means of inconsistency indices scaled on the max-min range. The error bars are based on the 95% confidence interval.

The Effect of Size and Choice of Criteria
Repeated measures ANOVA reapplied with between-subject factors revealed that there is a significant interaction in the problem size and the choice of criteria. The interaction is significant in all indices, however, only the results of CIndex are reported to avoid redundancy. The results indicate that inconsistency differs depending on the choice criteria and size. The pairwise comparison of estimated marginal means with Bonferoni adjustment (see Figure 2) revealed that in case of the problem size, 7 of the subjects achieved significantly lower levels of inconsistency if employing randomly assigned criteria as opposed to subjects who could select their own criteria (mean difference M = 0.07 ± 0.03, 95% [0.01, 0.13], p = 0.032). This contrasts with problem size 5 where the opposite pattern is exhibited; however, the difference is only with reduced significance M = 0.06 ± 0.03, 95% [0, 0.12], p = 0.062. Viewed from the perspective across sizes, the subjects who could select their own criteria achieved significantly lower inconsistency in problem size 5 compared to problem size 7. The subjects with randomly given criteria demonstrated a reversed trend that is, however, not significant (M = 0.05 ± 0.03, 95% [−0.02, 0.11], p = 0.138). Tables 5 and 6 summarise the withinand between-subject effects for size and choice.

The Effects of Size and Choice of Criteria including Gender
In order to further study the interaction between size and choice of criteria, the analyses have been extended to include gender as another between-subject factor. The results indicate that there is a significant interaction between size, choice of criteria and gender F(1, 268) = 5.955, p = 0.015, partial eta squared = 0.022 (small effect). That means that the interaction between size and choice of criteria varies in respect of gender. The subsequent pairwise comparison with Bonferoni adjustment revealed that in the case of a random choice of criteria, males achieved significantly lower inconsistency in problem size 7 compared to problem size 5 (mean difference M = 0.103 ± 0.037, 95% [0.03, 0.177], p = 0.006). The subjects with selected criteria have higher inconsistency in size 7 compared to size 5 (mean difference M = −0.096 ± 0.039, 95% [−0.172, −0.019]). Females, on the other hand, have an increasing pattern between sizes 5 and 7 in both random and selected choice of criteria that was, however, not significant (see Figure 3). Tables 7 and 8 summarise the within-and between-subject effects for size and choice.

Discussion
In this study, we investigated inconsistency from the perspective of decision science in general and multi-criteria decision making in particular. We consider this a topic that needs to be studied and dealt with as a decision-making task with multiple attributes requiring quantitative analysis and formal representation. In the controlled experiment, inconsistency in decision making of participants was derived from empirical data.

Contribution
The experiment provides several findings. The problem type (subjective vs. objective) has a significant effect on the inconsistency of the decision making. This result (higher inconsistency level for subjective decision-making problems) confirms the findings reported in [21]. This effect is independent of the number of criteria as well as gender. Interpretation can be based on an argument that there exists one perfectly consistent objective solution that can be reached or one can get close to. Moreover, finding this single solution has a cognitive base. In contrast to the objective problem, there may not exist a perfectly consistent subjective solution, which is probably based mostly on preferences and attitudes. As humans are capable of bearing mismatching preferences and attitudes, it may be more challenging to decrease the overall inconsistency of decision making, which involves preferences and attitudes. Thus, in case of the subjective task, reaching a good enough solution can be considered as sufficient and acceptable.
A novel element of this work is the exploration of the influence of free/unfree criteria choice on the inconsistency of the decision making. This impact has not been previously detected.
The method of choosing the criteria (free selection vs. random assignment) was shown to have an impact on the inconsistency. Although not significantly, female decisions were more inconsistent for larger criteria sets, regardless of whether the criteria choice method was free selection or random assignment. At this time, we can only make suppositions as to why the free criteria selection leads to higher inconsistency for a bigger set of criteria. Probably the respondents preferred to select criteria that were (positively) important to them and subsequently it was difficult to order them on the scale available. If that was the case, the ordering task became more difficult as the size of the set of criteria increased.
For males, the inconsistency in decisions increased with the size of the problem when the criteria were freely selected, and it decreased with the size of the problem when the criteria were randomly assigned. This increasing/decreasing difference is statistically significant. Hence, the mode of choosing the criteria has been identified as another factor that has an impact on the inconsistency of the decisions (other factors include, e.g., explanation of inconsistency [60] and information sharing in a group [61]).
The data set attached to this paper can be considered as an additional contribution. Since few experimental studies in inconsistency research exist, real decisions of nearly 280 participants from a controlled experiment give an open opportunity for further real decision analysis.

Limitations of Our Work
The results of our work are valid under the settings of the experiment. From the point of view of basic descriptive indicators (gender, age), the analysed sample of subjects is considered representative with respect to the defined population. However, experimental studies of university students are difficult to generalise. For generalisation purposes, experiments with distinct target groups have to be designed and performed.
It is also necessary to mention that although the participants were asked to make decisions in a domain presumably familiar to them, the level of their expertise in this particular domain remained unknown.
A final limitation is the entry type used in comparison matrices. As was recently shown [22], the resulting inconsistency is biased by the preference evaluation model.

Future Work
The experimental procedure was defined in such a way that it could be replicated by other authors. Further research may focus on different dimensions of matrices, a different type of subjective task or a larger sample of females.
Furthermore, this study was based on the calculation and comparison of three selected measures. Although the acquired results are consistent across all of the studied coefficients, other available measures can be applied in order to verify achieved results. As Brunelli [38] found out, there is a kernel of measures within the set of indices with a high degree of similarity. Thus, this verification would be interesting with calculation of outliers (e.g., Singular Value Decomposition [62], Kulakowski's E index [63] or Barzilai's RE index [64]) as consistent comparisons of kernel measures can be anticipated.
Inconsistency in female decisions does not allow statistically sound conclusions. It is worth noting that for females, the free criteria selection seems to have a different effect on the inconsistency of the decisions, when comparing smaller and larger criteria sets. An experiment performed with a larger female cohort is needed to confirm or to disprove this effect.
As inconsistency is influenced by the preference evaluation model [22], a combined effect of the preference evaluation model and criteria selection method may be worth exploring.

Conflicts of Interest:
The authors declare no conflict of interest.