Formal Inconsistencies of Expertise Aggregation Techniques Commonly Employed in Engineering Teams

: Engineering managers leverage the expertise of engineers in their teams to inform decisions. Engineers may convey their expertise in the form of opinions and/or judgements. Given a decision, it is common to elicit and aggregate the expertise from various engineers to capture a broader set of experiences and knowledge. Establishing an internally and externally consistent aggregation framework is therefore paramount to yield a meaningful aggregation, that is, to make sure that the expertise of each engineer is accounted for reasonably. However, we contend that most de facto aggregation techniques lack such consistency and lead to the inadequate use and aggregation of engineering expertise. In this paper, we investigate the consistency or lack thereof of various expertise aggregation techniques. We derive implications of such inconsistencies and provide recommendations about how they may be overcome. We illustrate our discussion using safety decisions in engineering as a notional case.


Introduction
Engineers contribute to project decisions in two main ways.First, they generate and characterize alternatives.Second, they identify and characterize uncertain events associated with the potential execution of those activities [1].When multiple engineers work together on a given decision, the alternatives generated by the various engineers in the team can be simply accumulated.That is, the alternatives can be joined as a larger set of alternatives to be evaluated.The same approach can be used for the events that the different engineers identify (even though, in this case, the process is a bit more sophisticated, as relevance relationships need to be accounted for [2]).However, when various experts provide different characterizations of a specific uncertain event, these characterizations must be aggregated.That is, the set of characterizations needs to be consolidated into a single characterization for that event so that it can be used in the decision model [3].These characterizations may be provided in the form of an opinion (a belief distribution on an event) or a judgement (a bet on a specific outcome).Such an aggregation process is the focus of this paper.
In the past, several methods have been proposed to aggregate the expertise of subject matter experts (SMEs).These can be broadly classified as behavioral and mathematical [4].Behavioral methods involve an information exchange and the negotiation of opinions among the SMEs to arrive at a consensus.Mathematical aggregation involves obtaining quantitative values (for the information sought) from the SMEs and applying some statistical averaging technique to combine the values [5].These expertise aggregation methods have been applied in several domains and contexts [6], and their characteristics, such as robustness, traceability, prediction capabilities, and accuracy, have been studied [4].However, the literature scarcely provides concrete guidelines in expertise aggregation specific

Literature Review
In this section, we review the main expert aggregation techniques found in the literature, together with some of their potential drawbacks and limitations in their use.We start by presenting the discursive dilemma and impossibility result, together with some possible solutions for such a dilemma, and continue to present methods that are based on math to perform the aggregation.

Discursive Dilemma and the Impossibility Result
The impossibility result [10] has important implications in domains such as political theory, social epistemology, metaphysics [12], economics, logic, and computer science [13].A common understanding of the impossibility statement is that, when a group of judges hold a rational set of judgements, it is not always possible for them to aggregate their judgements into a collective one in conformity to all possible constraints.When all the judgements of group members are required to be aggregated, it might be logically impossible to rationalize outcomes, which implies that perfect integrity is unattainable.Perfect integrity is the state of constancy and coherence [14].Coherence is the quality of being logical and consistent, and constancy is the quality of being dependable [15].Therefore, being unable to achieve perfect integrity implies that logical relationships cannot always be deduced from the aggregated outcome in any aggregation scheme.
An initial step in the task of aggregating judgements is that it should be distinguished from "aggregating people's sets of credences in respect of certain propositions" [10].That is, aggregating judgements should be distinguished from the task of belief or opinion elicitation.According to List and Pettit [10], a judgement is "an on or off affair", meaning that the verdict of an individual concerning a proposition is binary (yes or no, true or false) and should not be mixed with belief of the individual which can be modelled as a probability distribution.Eliciting expert judgement from engineers is, in our experience, common.Examples include, among others, asking experts to assess if the system will meet the performance or not, if the test will be successful or not, if the design will be finished in time or not, or if the material will provide the necessary attenuation or not.Other examples are similar, but asking for a specific value, such as when an analysis may be finished or what power consumption will a system have.
A well-known problem in judgment aggregation is the discursive dilemma [10,13,16].A simple illustration of the discursive dilemma is shown in Table 1.Let p, q, and r be three logically interconnected propositions, such that (p ∧ q) ↔ r.In a system of three people (1, 2, and 3) making a judgement, each judge expresses his/her judgement on p, q, and r abiding by the judgment rule (p ∧ q) ↔ r.When we observe the aggregated majority of the individual propositions, the group's collective judgment violates the logic (p ∧ q) ↔ r.The discursive dilemma indicates that a procedure like, for example, systematic majority voting or simple majority voting on each premise cannot guarantee a rational set of collective judgements [10].However, this type of aggregation is, in our experience, not uncommon in engineering contexts.The impossibility result also shows that such problems of deductive cogency and consistency are not only confined to a majority procedure but also persistent in other aggregation procedures [13,16].The other aggregation procedures that might be affected by the impossibility results include unanimitarianism (virtue of pertaining to unanimity), minoritarianism and other procedures that have conditional majoritarianism such as twothirds [10].This further indicates that if the group follows some behavioral aggregation or even a structured aggregation process [3] of coming to a group consensus, then they might also fall into the problems of the discursive dilemma.
In fact, according to List and Pettit [10], an aggregation method does not exist that jointly satisfies the following set of requirements: • There must be a universal domain.The aggregation must accept any logical judgement set (personal profile-individual expert's judgement set).• The judgements must be collected anonymously.The collective judgement must provide the same result irrespective of the order in which the individual judgement sets are collected.• The rule of aggregation must be systematic.All individual propositions are treated equally (no special weighting for any individual judgement set).• The set of judgements must be consistent and deductively closed.
If all these criteria are fulfilled, then the aggregated judgement set can be termed perfect.However, it is impossible for any aggregation method to jointly satisfy all these constraints and yet be collectively rational [10,12].Such aggregation problems are applicable to most situations that require combining binary evaluations of individual voters.

Potential Solutions to the Dilemma
The literature states that any aggregation procedure that satisfies the conditions of universal domain, anonymity, and systematicity comes at the cost of collective rationality, consistency, or deductive closure.In addition, there can only be a procedure that roughly approximates adherence to all three principles [10].However, there are two ways that have been proposed to avoid the paradox: the premise-based procedure (PBP) and the conclusion-based procedure (CBP) [13,17].
In the PBP, the individual judges express their judgement on the set of propositions p and q (that is, the premises in Table 1), the majority of the premise is then aggregated, and a conclusion is drawn from the aggregated premise set.In the PBP, the view taken by the majority on the conclusion will be rejected, thereby ignoring individual responsiveness and adhering to collective rationality [10].
In the CBP, the judges express their final verdict on conclusion r, which abides to the logic (p ∧ q) ↔ r.Then, the majority of the conclusion is aggregated as the final judgement.Thus, in the CBP, significance is given to the group's conclusion, that is, maintaining individual responsiveness but violating collective rationality [10].
In the literature, there has been considerable support for utilizing the PBP over the CBP.An important result of Hartmann et al. [13] is that, in a voting procedure, the reasons should carry more weight than the conclusion.Mosleh et al. [18] also supported the use of decomposition to utilize experts' information.It was stated that decomposition is particularly useful when different experts have more information about different aspects of the problem.Raiffa [19] and Armstrong [20] support decomposition, which indicates indirect support of premise-based aggregation.The decomposition method is the technique where the expert is asked to respond to a series of questions on parts of the problem rather than the composite final question.Then, the analyst synthesizes the responses to construct the forecast.While this is a preferred approach in the literature, we have not found any evidence that it proves better results.Note that, regardless of the approach taken, there is still a latent problem in aggregating judgement, since both approaches achieve consistency by ignoring information from the experts.
A potential way to abide by both precepts of individual responsiveness and collective rationality is to practice modus tollens instead of modus ponens [10].Modus tollens is the propositional logic that states that if a conditional statement ("if p then q") is admissible, and the consequent is not true (not-q), then the negation of the antecedent (not-p) can be inferred.
Modus ponens is the propositional logic that states that if a conditional statement ("if p then q") is accepted, and the antecedent (p) holds, then the consequent (q) may be inferred [21].As in the previous case, there is still a latent problem in aggregating a judgement that is resolved by assuming certain precepts on the experts' judgement procedure.

Endorsements for Mathematical Aggregation Techniques over Aggregation of Judgement or Consensus
Mathematical aggregation is the integration of independent opinion assessments into a singular opinion.Opinion here refers to a belief characterization of an uncertain event.The level of complexity in mathematical aggregation procedures varies from simple statistical averaging to approaches based on axiomatic information gathering methods that incorporate the 'expertise' of the expert into the aggregation [5,22].Mosleh et al. [18] classified 'mathematical aggregation' as potentially beneficial compared to group consensus methods.In fact, there is conclusive evidence that indicates that mathematical methods for aggregation generally provide better results than the behavioral methods [23][24][25][26][27], yet mathematical aggregation techniques are rarely used in practice [18].Mosleh et al. [18] presented the concepts of substantive goodness and normative goodness as measures of quality for expert elicitation.Substantive goodness refers to the knowledge of goodness relative to the problem at hand.Normative goodness refers to the expert's ability to accurately express that knowledge in accordance with the calculus of probabilities.The latter is important to mathematically aggregate opinions.However, choosing appropriate 'weights' for the experts, defining proper methods for elicitation, and then combining them might be overly computationally intensive [28] and requires expert guidance for the process itself.
Mathematical aggregation can be generalized into linear and logarithmic opinion pools [5,28,29].Contextually in belief aggregation, 'opinion' has been used extensively and is often referred to as numerical statements that represent the experts' degrees of belief on a concerned subject [29].The central idea in belief aggregation is to find a consensus distribution that satisfies a set of reasonable axioms [22,28,30].In the following sections, linear, logarithmic, and Bayesian information pooling are explained.
Mathematical and statistical aggregation procedures are highly prevalent, yet most of the literature is focused on aggregating probability distributions.That is, when one must aggregate beliefs then a statistical aggregation procedure has been deemed useful.For example, when we want to consider the individual team member's competence to arrive at the (factually) right conclusion or when the conclusion's (prior) probability is known, we can rely on Bayesian analysis to aggregate judgements [13,28].Bayesian analysis is also recommended when experts are exposed to incomplete or even misleading information that shapes their beliefs.Rufo et al. [28] provided a Bayesian procedure to aggregate experts' information in group decision making.Here, the belief of each expert is elicited as a multivariate prior distribution, followed by a linear or logarithmic combination method to represent a consensus distribution.The choice of the strategy depends on the decision maker.

Linear and Logarithmic Aggregation (Opinion Pooling)
One common averaging technique for aggregating expertise in the form of opinion is classified as linear opinion pooling.Here, the combined final value that captures the expertise of the group of experts is the weighted linear combination of the probability distribution of the belief of each expert [5,22,29].The linear opinion pooling was introduced by Stone [31] and can be expressed as where π(θ) is the aggregated probability distribution, θ is the quantity of interest on which the experts express their belief, n represents the number of experts who are involved in the assessment process, π i (θ) is the ith expert's individual probability distribution, w i is the weight allocated to the ith expert, with weights adding up to 1, and π represents a mass function when θ is discrete and a density function when θ is continuous [22].
The other common form of opinion pooling is logarithmic pooling, where the combined probability distribution can be expressed as where k is the normalizing constant.
In logarithmic opinion pooling, the individual belief distributions are multiplied and renormalized [28].When the weights are equal to (1/n), then π log (θ) is proportional to the geometric mean of the individual distributions.A generalization of the linear and logarithmic opinion pools is presented in [32].
Linear and logarithmic opinion pooling have been used in a wide variety of domains such as medical consultation [33], marketing, banking, weather forecasting [22,34], and for candidate selection of football games [35].In most of these applications, the weights of the experts were determined using empirical studies and were usually based on the subjective trust or confidence that the decision maker has on the reliability of the experts [5].Linear and logarithmic pooling often lead to distinctively different distributions [28].However, there seems to be no exceptional advantage of one method over the other [36].One seeming advantage of logarithmic pooling over linear pooling is that, when combining density functions, the results are always unimodal, whereas linear pooling of density functions can produce combined distributions that are multimodal in nature and may cause a bias for the analyst [5].Unimodality can be interpreted here as a more accurate representation of the collective beliefs of the experts, as the logarithmic effects dampen strong differences in the assessments.
Other probability combination techniques based on other statistical methods, such as frequency theory, have also been studied for opinion pooling [29].However, for circumstances such as catastrophic disasters, common in engineering decisions, where the interpretation of frequencies by the expert is stretched to the limits of plausibility, they present greater difficulties to accurately represent the expert beliefs quantitatively.Furthermore, it has been established that frequency theory fails when data are sparse, unavailable, or subjected to non-sampling errors, which are common scenarios in engineering.Such problems gave endorsements to the development of Bayesian aggregation methods [29].

Advocacy for Bayesian Methods
There is extensive research in using a Bayesian aggregation scheme to support risk analysis that requires a group of experts to provide information to a decision maker [29,37,38].When a prior probability distribution is available for the parameter of interest, then a Bayesian technique can be applied to update the prior distribution with a combined opinion pool [28].The main advantage of Bayesian methods is that they allow for the incorporation of the 'expertise' level of the expert and dependencies among the experts into the aggregation model [22].With n experts, providing information regarding a parameter of interest θ and the probability distribution of θ is known in prior π(θ), then the analyst/decision maker can make use of Bayes' Theorem to update π(θ) [22] using the following relationship: where L is the likelihood function associated with the experts' information.However, Bayesian-based methods have been shown to be difficult to apply in practice [22] due to their computational complexity [5] and the difficulty of estimating the likelihood function L, since it must account for the prediction accuracy and bias of the individual expert.

Application of Expert Aggregation to Safety Decisions
In this section, we apply some of the behavioral and mathematical aggregation schemes discussed in the previous section to notional cases that involve a panel of engineers assessing the criticality of safety hazards.Safety is used because safety-related decisions are common to most engineering endeavors.The different aggregation schemes are used to show the different effects that such schemes have on the resulting decisions.The application of the chosen case and the different methods are representative of their application in practice as per the experience of the authors.

Case Description
We used the SRMPs of the FAA and the U.S. Navy [7,8] as a reference for the safety assessment process.Specifically, we explored cases in which the experts were tasked to collectively categorize the criticality of a safety hazard as a function of the severity of its consequences and its likelihood of occurrence (in this case through a risk matrix, as shown in Figure 1, or a version of it).A conceptual argument was presented to verify if those solutions were applicable to the safety risk assessment.We studied the conditions under which the different aggregation methods preserved or failed to preserve the premise and consistency.
However, Bayesian-based methods have been shown to be difficult to apply in practice [22] due to their computational complexity [5] and the difficulty of estimating the likelihood function L, since it must account for the prediction accuracy and bias of the individual expert.

Application of Expert Aggregation to Safety Decisions
In this section, we apply some of the behavioral and mathematical aggregation schemes discussed in the previous section to notional cases that involve a panel of engineers assessing the criticality of safety hazards.Safety is used because safety-related decisions are common to most engineering endeavors.The different aggregation schemes are used to show the different effects that such schemes have on the resulting decisions.The application of the chosen case and the different methods are representative of their application in practice as per the experience of the authors.

Case Description
We used the SRMPs of the FAA and the U.S. Navy [7,8] as a reference for the safety assessment process.Specifically, we explored cases in which the experts were tasked to collectively categorize the criticality of a safety hazard as a function of the severity of its consequences and its likelihood of occurrence (in this case through a risk matrix, as shown in Figure 1, or a version of it).A conceptual argument was presented to verify if those solutions were applicable to the safety risk assessment.We studied the conditions under which the different aggregation methods preserved or failed to preserve the premise and consistency.The following aggregation methods were studied:


Judgement aggregation: The following aggregation methods were studied: Belief aggregation for likelihood of hazard occurrence; c.
Belief aggregation using linear pooling; d.
Belief aggregation using logarithmic pooling.
For the assessments of likelihood, we used the definitions provided by the FAA: likelihood is "the estimated probability or frequency, in quantitative or qualitative terms, of a hazard's effect or outcome" [39].Table 2 provides the likelihood definitions followed by the FAA based on which the experts assert their assessments for determining the safety risk of a given hazard.All data used in the studies are synthetic yet reasonable.

Judgement Aggregation
A notional case study involving a panel of three experts was presented to demonstrate the possibility of the discursive dilemma (a problem giving rise to premise and judgement inconsistencies) in a safety decision-making scenario.Their judgements on severity and likelihood of an assumed hazard were synthetically established and the safety risk was determined through a logical relationship between safety risk, severity, and likelihood.The logical relationship was derived with reference to the safety risk matrix presented in [7].The solutions to overcome the dilemma presented in earlier were applied to this case and a 'what if' scenario analysis analogous to the practices in project management [40] was performed.The critical impacts of the potential 'solution' methodology were discussed.

Base Case
Consider a safety risk assessment required to be performed by a team of three engineers.The judgement of all three engineers is assumed to be equally trustable and relevant for the decision.Using the risk matrix in Figure 1, the hazard is currently considered High Risk because its severity is judged as level 3 (Major) and its likelihood of occurrence is judged as A (Frequent).The experts are asked to judge such assessment.
To avoid issues of internal consistency, we assumed without loss of generality that the judgement set of each engineer (that is, the severity, likelihood, and safety risk) satisfied the three conditions of completeness, consistency, and deductive closure [10].A personal judgement set is said to be complete, consistent, and deductively closed if, for all propositions proposed by the expert, the final judgement is available in the universal domain set (high, medium, and low risk), it conforms to a logical deduction, and the logic remains true for the combinations of the propositions.In this context, the goal for aggregating the engineers' expertise is to find an aggregation technique that conforms to the minimum requirements of completeness, consistency, and deductive cogency [10].
The engineers' individual judgements, prescribed by the SRMPs coded in Figure 1, follow the logic given below (note that the symbol '∧' represents the AND operation): (Severity (Major 3) ∧ Likelihood (Frequent A)) → Safety Risk (High Risk) Assume now, without loss of generality, that the three engineers provide assessments as given in Table 3, and that the decision maker uses majority voting to arrive at a conclusion (shown in the last row of the table).In such a case, as shown in Table 3, the requirements of deductive cogency and consistency are violated.The majority asserts that the severity of the hazard is Major, and that the likelihood of occurrence is Frequent.However, the majority rules out the proposition that the safety risk of the hazard is High.Therefore, the aggregated assessment of the engineers is illogical.We discuss next the implications of aggregating only the conclusions or aggregating only the conditions as potential mechanisms to avoid illogical reasoning.If the CBP is being practiced, then the individual engineers perform their personal assessment of severity and likelihood to assess the criticality of the hazard, but only communicate to the decision maker their assessment of criticality, not those of severity or likelihood.The decision maker then aggregates their criticality assessment (i.e., the engineers' conclusions) to determine a criticality level for the hazard that reflects the engineers' assessments.The result of the CBP with respect to the premises of the engineers presented in Table 3 is shown in Table 4.The result shows that the majority would decide that the hazard is NOT High Risk.When the PBP is practiced, the group's assessments of the severity and likelihood are first aggregated and then a conclusion is drawn from the aggregation.In other words, engineers are asked in this case to assess the causes (i.e., hazard severity and likelihood), not the conclusions (i.e., hazard criticality).Table 5 shows the application of the PBP for the case shown initially in Table 3.In this case, the (aggregated) hazard is assessed as High Risk based on the logical relationship shown earlier.From this, it can be seen that the same experts having the same judgement can potentially end up with different assessments depending on the questions asked.This indicates inconsistency and retraces back to the problems in expert elicitation [41].A normative aggregation method, however, should yield the same assessment, regardless of the method of elicitation, and should abide by the logic used to derive the judgement.
In these cases, Hartmann et al. [13] suggest that, in a voting procedure, the reasons should carry more weight than the conclusion.This implies that collective rationality should be maintained and carries more significance.Therefore, we can choose to forego adherence to individual responsiveness.If this reasoning is adopted to make safety decisions in engineering, following the PBP leads to taking the safer route, since the relationship between the conclusion and the premises is essentially a logical AND operation.In a worst-case scenario, hence, the PBP leads to judge a safety risk higher than the individual engineers intended.

Practicing modus tollens instead of modus ponens
Although practicing modus tollens could be a possible technique to avoid the dilemma, it could have negative implications in safety risk assessments.If modus tollens is practiced, then engineers are asked to assess either the severity or the likelihood, and their opinion on the criticality of the hazard.Then, the unquestioned variable (either the severity or the likelihood) is inferred from the relation.This method maintains deductive cogency and collective reasoning, evading the dilemma.
Following the earlier example, consider now that the engineers are asked about their agreement with the criticality assessment of hazard.If their answer is affirmative, then the severity of the hazard is Major (3) and the frequency of the hazard is Frequent (A).If one parameter is fixed, then the modus tollens method can be used to identify an inconsistency in the judgement of an SME.Consider the judgement set of SME 3 in Figure 2. The SME thinks that the safety risk posed by the hazard is severe but does not seem to accept that the frequency of the hazard is Frequent.Such problems are attributed to inconsistencies in expert elicitation and behavioral aggregation, where the judges may violate principles of deductive cogency without realizing its implications.
Strategies that yield collectivized reason [10] Systems 2024, 12, x FOR PEER REVIEW 11 of 17 Strategies that yield collectivized reason [10] Using a convergence strategy.By practicing interpersonal deliberations or by other methods, if the views of the engineers are made to converge, then the conclusion of the collective majority will be complete, consistent, and deductively closed.By doing so, we relax the universal domain.This suggests adopting a behavioral technique where the group comes to a consensus on the engineering assessment.However, while this strategy can be useful to avoid the paradox, it may do so at the expense of becoming unreliable.When interactions occur among the engineers, it might be difficult to distinguish if convergence occurs because the knowledge base of the engineers has unified or because the group is biased towards the same opinion.It is also difficult to acknowledge group-level consequences that emerge as a result of individual interaction systems [42].Finally, the group interactions could also give rise to a number of cognitive biases [43]   Using a convergence strategy.By practicing interpersonal deliberations or by other methods, if the views of the engineers are made to converge, then the conclusion of the collective majority will be complete, consistent, and deductively closed.By doing so, we relax the universal domain.This suggests adopting a behavioral technique where the group comes to a consensus on the engineering assessment.However, while this strategy can be useful to avoid the paradox, it may do so at the expense of becoming unreliable.When interactions occur among the engineers, it might be difficult to distinguish if convergence occurs because the knowledge base of the engineers has unified or because the group is biased towards the same opinion.It is also difficult to acknowledge group-level consequences that emerge as a result of individual interaction systems [42].Finally, the group interactions could also give rise to a number of cognitive biases [43] that might go undetected.
Using an authoritative strategy.In the example, where majority voting was used, each engineer was considered to have equal weight and all their assessments were treated as equally important.However, if modifying the group structure is permitted so that only the input of some members is taken into consideration, then it is easy to avoid the paradox.This is achieved by relaxing the rule of anonymity.This, however, is still subject to the problems of assigning proper weights to the experts.Furthermore, considering the assessments of only a limited number of engineers will place into question why the others were included in the assessment in the first place and ignored later.Ignoring those engineers may lead to selection biases [44], where only information that the decision maker wants to hear is used to make the decision.Essentially, using an authoritative strategy may easily result in not utilizing the expertise of the engineers involved in the decision.
Using a priority strategy.In this strategy, the team must decide on a set of propositions to be given higher priority over the others.The team's decision process of the other set of propositions differs from the prioritized set and is determined on the overall decision of the prioritized set of positions.If a conclusive judgement set is contradictory and/or does not belong to the prioritized judgement set, then the team eliminates the judgement set that has a lower priority, thus eliminating its impact on the final judgment.By following a priority strategy, the rule of systematicity is relaxed.The implication of relaxing the rule of systematicity can be that the logic used to arrive at the aggregated judgement might be inconsistent.Although the dilemma might be evaded, the rationale behind the aggregated judgement might be incorrect.
Using a special-support strategy.In this strategy, a proposition set proposed by each expert must be endorsed or receive special support by the majority of the other experts.This strategy can be thought of as a majority of the majorities.Each expert presents his/her judgment set to the remainder of the group and asks for endorsements.The judgement set that has the highest number of endorsements or amount of support from the supermajority is concluded as the final collective judgement.By using this strategy, the rule of completeness is relaxed.However, the collective assessment may still be flawed if the engineer receiving the special support does not have a logical judgement.Furthermore, this strategy also suffers from a potential bias of group thinking.
A summary of the implications of using each of these four strategies in safety risk assessment is given in Table 6.

Convergence
Unreliable method: gives rise to potential group biases that might go undetected.

Authoritative
Assigning weights or determining the authoritarian might lead to selection bias.Priority Logic between premise and conclusion might become deterred.

Special support
Rationale behind the safety risk assessment might be illogical; leads to potential group bias.

Implications of the Dilemma
The FAA and the U.S. Navy prescribe that, once a safety risk has been assessed and documented, the decision process is solely based on the documented value of the safety risk [7,8].If the hazard has been determined as High Risk, the decision process is assigned to higher authoritarian members of the organization.If the safety risk has been assessed as Medium or Low, then the decision authority or advisors responsible for providing instruction for the mitigation and control activities of the hazard are members of lower authoritative ranks.Hence, the dilemma poses a risk that the rightful authorities might not be called to act on the hazard that has been incorrectly judged as Medium, Low, or High.
Furthermore, when undesirable outcomes emerge as a consequence of the decision based on the safety risk assessment, the aggregation procedure might not be questioned but rather the SMEs judgement (prediction accuracy) might be deemed unreliable.This results in the ability to improve decision-making processes.

Arithmetic and Geometric Averaging of Severity Rankings
Consider a notional case where an engineering manager requests five engineers to assess the severity of three hazards.Assume that the engineering manager aggregates the assessment using arithmetic (AM) and geometric (GM) averaging, as shown in Figure 3.
The FAA and the U.S. Navy prescribe that, once a safety risk has been assessed and documented, the decision process is solely based on the documented value of the safety risk [7,8].If the hazard has been determined as High Risk, the decision process is assigned to higher authoritarian members of the organization.If the safety risk has been assessed as Medium or Low, then the decision authority or advisors responsible for providing instruction for the mitigation and control activities of the hazard are members of lower authoritative ranks.Hence, the dilemma poses a risk that the rightful authorities might not be called to act on the hazard that has been incorrectly judged as Medium, Low, or High.
Furthermore, when undesirable outcomes emerge as a consequence of the decision based on the safety risk assessment, the aggregation procedure might not be questioned but rather the SMEs judgement (prediction accuracy) might be deemed unreliable.This results in the ability to improve decision-making processes.

Arithmetic and Geometric Averaging of Severity Rankings
Consider a notional case where an engineering manager requests five engineers to assess the severity of three hazards.Assume that the engineering manager aggregates the assessment using arithmetic (AM) and geometric (GM) averaging, as shown in Figure 3. First, the results show that the choice of averaging technique yields different results for some of the hazards.This is due to the calculation procedure behind the averaging method.The AM is the sum of the assessments divided by the number of assessments and the GM is the nth root of the product of the assessments, where n is the number of assessments.In geometric averaging, the effect of 'outliers' is dampened.However, since the possible set of assessments here lies only in the range of 1 to 5, the characteristic of dampening outliers by the GM might not make a critical impact.
In aggregating severity rankings, when the assessments of the experts vary between extreme points, then applying either the AM or the GM discounts the assessment of the expert.For instance, in the assessments of Hazard X and Hazard Z (ref.First, the results show that the choice of averaging technique yields different results for some of the hazards.This is due to the calculation procedure behind the averaging method.The AM is the sum of the assessments divided by the number of assessments and the GM is the nth root of the product of the assessments, where n is the number of assessments.In geometric averaging, the effect of 'outliers' is dampened.However, since the possible set of assessments here lies only in the range of 1 to 5, the characteristic of dampening outliers by the GM might not make a critical impact. In aggregating severity rankings, when the assessments of the experts vary between extreme points, then applying either the AM or the GM discounts the assessment of the expert.For instance, in the assessments of Hazard X and Hazard Z (ref. Figure 3), Expert 3 has assessed the severity of that hazard to be catastrophic.However, when the assessments of all the experts were averaged, the aggregated assessment indicated that the hazard was not as 'severe' as anticipated by Expert 3. In a situation where Expert 3 has the highest prediction accuracy, then the averaged aggregated assessment has discounted the input of Expert 3 and has yielded a lower severity level, thus leading to an improper safety risk assessment and thereby to inadequate mitigation and control strategies required for the hazard.
Furthermore, it is worth noting that assessing rankings is a form of judgement.Therefore, this kind of ranking inherits the problems associated with judgement presented earlier, regardless of the sophistication of the mathematical approach used to aggregate the judgments.
Finally, there are mathematical issues when averaging rank-ordered severity assessments where the scale between the values is not linear.The notion of (Low + High)/2 = Medium does not hold when the scale is not linear, which is generally the case, and particularly true for the definitions from the literature used in this paper.

Arithmetic and Geometric Averaging of Likelihood Rankings
Consider a similar case to the previous one, but this time the engineering manager asks their team of engineers to assess the likelihood of different options, and the judgements shown in Figure 4.
lier, regardless of the sophistication of the mathematical approach used to aggregate the judgments.
Finally, there are mathematical issues when averaging rank-ordered severity assessments where the scale between the values is not linear.The notion of (Low + High)/2 = Medium does not hold when the scale is not linear, which is generally the case, and particularly true for the definitions from the literature used in this paper.

Arithmetic and Geometric Averaging of Likelihood Rankings
Consider a similar case to the previous one, but this time the engineering manager asks their team of engineers to assess the likelihood of different options, and the judgements shown in Figure 4. First, the problems indicated earlier when using the AM and the GM to aggregate severity assessments also apply to aggregating likelihood assessments.Furthermore, there are some mathematical effects due to the underlying mathematical structures of beliefs and likelihoods that can easily introduce significant errors in the assessment of likelihoods.
For example, when two experts assert the likelihood of a hazard being Probable, their probability (belief) distribution could be quite different from each other's.(Note that objective probabilities do not exist, so it is sensible that each expert will have their own belief distribution [2].)For instance, both belief distributions could follow a Poisson process yet can have different mean rates of occurrence.Similarly, both engineers could have belief distributions that are different in shape but with the same mean rates.Or even, different engineers may use a different target to define the meaning of each likelihood category (e.g., different engineers assign different confidence targets to the occurrence rate of the event of a given likelihood category).Therefore, even though their assessment may be identical in the risk table, their internal interpretation is different.

Belief Aggregation Using Linear Pooling
Consider the same example but this time, instead of assessments, the experts' beliefs on the likelihood of the hazards are elicited as belief distributions.For simplicity and without lack of generality, let us assume these are defined as dichotomic, assessing the probability of a hazard existing to be 36%, 21%, and 45%, respectively.First, the problems indicated earlier when using the AM and the GM to aggregate severity assessments also apply to aggregating likelihood assessments.Furthermore, there are some mathematical effects due to the underlying mathematical structures of beliefs and likelihoods that can easily introduce significant errors in the assessment of likelihoods.
For example, when two experts assert the likelihood of a hazard being Probable, their probability (belief) distribution could be quite different from each other's.(Note that objective probabilities do not exist, so it is sensible that each expert will have their own belief distribution [2].)For instance, both belief distributions could follow a Poisson process yet can have different mean rates of occurrence.Similarly, both engineers could have belief distributions that are different in shape but with the same mean rates.Or even, different engineers may use a different target to define the meaning of each likelihood category (e.g., different engineers assign different confidence targets to the occurrence rate of the event of a given likelihood category).Therefore, even though their assessment may be identical in the risk table, their internal interpretation is different.

Belief Aggregation Using Linear Pooling
Consider the same example but this time, instead of assessments, the experts' beliefs on the likelihood of the hazards are elicited as belief distributions.For simplicity and without lack of generality, let us assume these are defined as dichotomic, assessing the probability of a hazard existing to be 36%, 21%, and 45%, respectively.
With linear pooling, the beliefs of the three experts are aggregated using the following formula: where n = 3 and Let us assume, as a start, that the engineering manager trusts each expert equally, so he/she decides to assign w 1 = w 2 = w 3 .This leads to an aggregated probability assessment for the hazard of 34%.
Mathematically, linear pooling can be considered to be well constructed.However, its use must meet several conditions for it to be meaningful.In particular, it is key to understand both the purpose of the aggregation of beliefs and the way in which each expert's individual belief is formed.Let us start with one scenario that showcases these aspects.
Assume that the three experts have the same education, and their backgrounds only differ in the projects they have worked on.Experts 1 and 2 worked together on the same project (Project A), where the hazard under assessment did not occur.Expert 3, on the contrary, has worked on three projects (Projects B, C, and D), and the hazard occurred in two of them.Given that experts 1 and 2 are basing their beliefs on the "same" information, the engineering manager is effectively overcounting the particular experiences of Project A over Projects B, C, and D, which results in skewing the belief aggregation (even if the individual beliefs of the experts are accurate).
Linear pooling makes use of weights to calibrate or adjust aggregation to take these aspects into account.Because, while probabilities are subjective, they must not be arbitrary, meaning it is important to understand how engineers form their beliefs, so that the engineering manager can account for individual biases and for biases injected as part of the process of aggregation.Individual biases are discussed at length in the literature (e.g., [2]), and will not be addressed in this paper.For biases in the aggregation of beliefs, we focused on the alignment of the purpose of the aggregation and the bases to form experts' beliefs.
An engineering manager may desire to aggregate beliefs for two reasons.First, to calibrate the assessments of the experts.This means that instead of relying on the assessment of one expert, aggregating the beliefs of several experts can be used to filter the errors in their assessments.Here, error does not refer to a delta between the expert's assessment and an objective reality, but the delta between the elicited, tangible belief and their real, inner belief.Second, to comprehensively account for disparate experiences.This means that the engineering manager attempts to avoid overcounting or skewing the belief assessment on a particular set of experiences, reaching instead a wide variety of data points.
When attempting to calibrate experts' assessments, it is then essential that experts use the same set of information when their beliefs are elicited.Otherwise, one gets into the problems shown in the previous example.It is the engineering manager's responsibility then to guarantee that experts have access to the same information set.For example, an engineering manager could ask all experts to explain how they arrived at their belief (without sharing what their belief is, to avoid injecting individual biases), so that the information is levelled across all experts.Without doing so, the validity of using linear pooling to calibrate experts' assessments is jeopardized.
On the contrary, when attempting to account for different experiences, experts should use non-overlapping information.That is, the information set of each expert should be mutually exclusive, in order to avoid the overcounting of some experiences.For example, consider three experts providing their expertise to a decision.Two of the experts have had the same experience on the same project.Aggregating their expertise would be akin to double counting what happened in just that one project.Avoiding overcounting, in practice, is virtually impossible given that experts will likely share some educational or professional baseline.As in the previous case, it is the engineering manager's responsibility to assess the baseline information that each expert is using to assess if the aggregation is valid or not.Again, in this case, information sharing may be a valuable technique to at the same time be comprehensive in the elicitation and calibrate it.

Belief Aggregation Using Logarithmic Pooling
Logarithmic pooling further emphasizes differences in which expert's opinion to account for.Therefore, the same discussion as for linear pooling is applicable here.

Conclusions
We have shown that common techniques to aggregate expertise from a team of engineers lack internal consistency and lead to the inadequate use and aggregation of engineering expertise.Particularly in safety risk assessment, the plausibility of the discursive dilemma and its implications were demonstrated.We showed that, if a judgement aggregation must take place, then a 'better' process to evade the dilemma would be the PBP (premise-based procedure).However, all solutions explored had drawbacks, since all solutions operate by ignoring information from experts, which includes either ignoring individual responsiveness or violating collective rationality.
In general, we showed how using the current definitions of severity, which are abstract and qualitative, and using them to perform a mathematical aggregation of the severity judgments leads to problems associated with the proper use of each expert.Similarly, methods to aggregate likelihood assessments that are based on judgement led to an inaccurate

Figure 3 .
Figure 3. Severity assessments for hazards X, Y and Z by five experts.

Figure 3 .
Figure 3. Severity assessments for hazards X, Y and Z by five experts.

Figure 4 .
Figure 4. Likelihood assessments for Hazards X, Y and Z by five experts.

Figure 4 .
Figure 4. Likelihood assessments for Hazards X, Y and Z by five experts.

Table 1 .
Illustration of the discursive dilemma.

Table 3 .
Discursive dilemma in safety risk assessment.
that might go

Table 6 .
Aggregation strategies that yield Collective Reason.