Flexible Risk Evidence Combination Rules in Breast Cancer Precision Therapy

Evidence theory by Dempster-Shafer for determination of hormone receptor status in breast cancer samples was introduced in our previous paper. One major topic pointed out here is the link between pieces of evidence found from different origins. In this paper the challenge of selecting appropriate ways of fusing evidence, depending on the type and quality of data involved is addressed. A parameterized family of evidence combination rules, covering the full range of potential needs, from emphasizing discrepancies in the measurements to aspiring accordance, is covered. The consequences for real patient samples are shown by modeling different decision strategies.


Introduction
The Dempster-Shafer theory of evidence (DST) is a generalized framework in probability theory. First introduced by Dempster between 1966 [1] and 1968 [2] in the context of Bayesian inference [3], Shafer perpetuated his ideas into a comprehensive theory in a book in 1976 [4]. A short summary of DST with an illustrative example of how to create and combine pieces of evidence was given in 1986 by Zadeh [5].
In 1988 Smets [6] (Chapter 9) framed the concept of credibility in terms of mathematical logic. In contrary to Shafer [4] he propagated the "open-world assumption", thus the possibility of outcomes beyond the "frame of discernment" (e.g., example of broken coin). At the same time, in 1988, Dubois and Prade [7] gave an axiomatic description of how to define and combine pieces of evidence mathematically.
One common approach to DST is via the "transferable belief model" (TBM), which Smets introduced in 1990 [8]. In the TBM evidence is fully described by "basic belief masses" (BBM). Sometimes the BBM is alternatively called "basic belief assignment" (BBA) [9]. The open world-assumption is achieved by assuming a positive value for the BBM of the empty set, as discussed in 1992 [10]. Conditioned belief and plausibility were embedded into a generalized Bayesian theorem in 1993 [11]. A procedure for a two-step decision making process within the TBM was outlined in 1994 [12]. In the first step evidence is based on belief functions as defined in DST and is called "credal" level. The following is a reduction to general probability functions, which are then used for decision making. This step is called "pignistic" level.
Among others, an important elaboration of DST is given by the "Theory of Hints", which was outlined by Kohlas in 1995 [13]. In 1991, Gebhardt [14] introduced the "context model" to distinguish between vagueness and uncertainty which also covers topics such as refinement and coarsening. The "Dezert-Smarandache theory" (DSmT) [15] specifically targets the problem of imprecise, uncertain, and highly conflicting sources of data for information fusion.
While in probability theory the calculus with probabilities is inherent, in DST the exertion of influence between pieces of evidence opens a wide field of facilities. Combining two pieces of evidence to improved evidence is, in general, accomplished by evidence combination rules (ECR). There is certainly a large variety of meaningful ECRs. Dempster's original suggestion, which distributes inconsistent BBMs equally among others, was the most obvious Dempster ECR [4]. The rule is commutative and associative, but fails when sources of evidence become incompatible or conflicting. To overcome the problem of combining strongly contradicting pieces of evidence, Yager 1987 [16] suggested assigning inconsistent BBMs to the BBM of total ignorance. A good summary of about ten popular ECRs is given therein [17,18]. More sophisticated are "Proportional Conflict Redistribution" rules (PCR) [19,20] within DSmT or some of the improvements made [21].
As an alternative to Dempster's ECR and to overcoming the problem of conflicting evidence, Shafer suggested [18] to manipulate mass functions by weighting and discounting ("belief in belief") rather than to diversify the ECR itself. In DSmT [15] evidence from several origins can additionally be weighted by importance.
The TBM gives a procedure of how to convert evidence into probabilities. However, there is a push for decision making based on evidence. An axiomatic approach was given in 1990 [22]. The inclusion of loss functions for classification were discussed in 1997 [9]. A review of decision-making strategies based on the theory of Neumann and Morgenstern from 1943 [23] (60th anniversary reprint [24]) was given in 2019 in Denoeux [25].
There are many use cases for DST. An obvious one can be found in robotics for "Simultaneous Localization and Mapping" (SLAM) by combining data from different sensors [26]. Every sensor serves as an agent and is source of a piece of evidence. Decisions are based on the linkage of these pieces of evidence by ECR.
Another important application is in combining classifiers as outlined in 2002 by Al-Ani [27]. Elements of a mostly high-dimensional feature space are to be classified into a number of labeled categories. In general, this will require random forest classification or the like. A possible approach via DST will consider the set of labeled categories to represent a frame of discernment. Each classifier (for each feature vector individually) is then transformed into a single piece of evidence by assigning a BBM to all subsets of the appropriate labels. The conjunction of classifiers is again accomplished by linkage of these pieces of evidence using customized ECRs [28].
The described procedure is very close to our approach with the crucial difference, that in our model not the full feature space is mapped to categories, but leaves the option of a feature vector being mapped to an additional category labeled as "undecidable".

Dempster-Shafer Theory
Evidence theory by Dempster-Shafer (DST) is based on combining pieces of evidence rather than dealing with probabilities. An evidence can be seen as a generalization of a probability function. The essential difference is, that while in the former the sample space Ω is mapped to probabilities Pr(a), a ∈ Ω, in DST the power set of the sample space, now called "frame of discernment" (FOD), P (Ω) = 2 Ω is mapped to masses m(A), A ⊆ Ω. The mass function m(A) assigns basic belief masses (BBM) to the elements A ∈ P (Ω) and can be interpreted as degrees of trust in some proposition A. Figure 1 shows this for a sample space respectively FOD with the three possible outcomes Blue, Red and Green, thus Ω = {B, R, G}.  In an open world a basic belief mass is given to the empty set allowing the ability to consider completely unexpected events to the model (e.g., broken coin) or to deal with data of low quality.
We currently restrict ourselves to normalized evidence, but we will discuss the origin and opportunities of open world models later in the context of evidence combination rules (ECR) and vague FOD.
For every set S ∈ P (Ω) the mass function m(A) intrinsically defines two essential quantities of DST, the "Belief" and the "Plausibility" of the set S.
This is why DST is also called the theory of belief functions. For a normalized evidence we have Bel(Ω) = Pl(Ω) = 1 and Bel(∅) = Pl(∅) = 0. This implies that we are sure that the correct answer lies somewhere within Ω (closed world). For better understanding, Figure 3 shows Believe Bel(S) and Plausibility Pl(S) for two elements of P (Ω), namely {R} and {B, G}.

Evidence Combination Rules
One strength of DST is the flexibility in combining pieces of evidence with various ECRs in adjustment of necessities. We will show how to take advantage of this by customizing ECRs, depending on the origin of the data. Put simply, an ECR is a binary operator ⊕ that combines two mass functions m 1 (A) and m 2 (A) associated with two pieces of evidence with a third mass function m(A), representing a fused evidence.
Most ECRs are commutative, but only in very rare cases are they associative, not even pseudo-associative in terms of [17]. There is a neutral element m e (A), called the "vacuous mass function", satisfying m(A) = m e (A) ⊕ m(A) = m(A) ⊕ m e (A) with m e (Ω) = 1 and m e (A) = 0 for A = Ω. The evidence associated with m e (A) is also called "total ignorance", representing the lack of knowledge. Note that m(A) = m(A) ⊕ m * (A) does not necessarily imply that m * (A) is a vacuous mass function. A counterexample is given in [29].
Obviously, an inverse function for every m(A) does not necessarily exist. This is easy to understand when considering that for a given evidence represented by a mass function m(A) it is unlikely to find more evidence which results in total ignorance. In general, some knowledge brought together with some other knowledge cannot end in knowing nothing. In algebraic terms, the set of all possible m(A) therefore has the structure of an unital magma [30].
An easy way to combine pieces of evidence is by simply multiplying the intersecting mass functions [8].
This ECR is called the "conjunctive" rule [26] and is fully compatible with the open world assumption in the TBM framework. Unfortunately, the resulting mass function Figure 4 illustrates the Formula (3) for two different cases in mosaic plots, the left one with rather consistent evidence, the right one with rather contradictory evidence (for details see Appendix A).
The 49 rectangles within the two squares are colored in the color of the corresponding intersect, where the white areas are masses for contradicting evidence. Areas of the same color are added. Note that the mosaic plots in Figure 4 can be seen as an operation table for the operator ∩, thus e.g., {R} ∩ {B, R} = {R}, and so on. For a closed world, the white areas must be redistributed among all others. How to distribute the mass of the empty set m(∅) among all other masses m(A) depends on the needs of the model. If there is a high chance of contradiction between m 1 (A) and m 2 (A), most of m(∅) will be allocated to m(Ω). Contrarily, if there is a low chance of contradiction, m(∅) will be distributed equally along the singletons m({a}), a ∈ Ω. As discussed in the introduction, there is a wide range of possible allocations to do so.
For a model to distinguish between hormone receptor statuses it is convenient to use a parameterized family E = {⊕ λ } λ∈ [0,1] of ECRs, which is very similar to the one introduced [31], but uses a parameter λ ∈ [0, 1] to customize local requirements. Given two mass functions m 1 (A) and m 2 (A) we define The parameter λ in Formula (4) provides flexibility to adapt to circumstances. The restriction to λ ≤ 1 is motivated by restricting ourselves to an interpolation type ECR. The value λ > 1 would yield an extrapolation type ECR as described in [31].
Dempster's original ECR [4] is equivalent to setting λ = 1. This ECR is associative and commutative. Unfortunately, it turns out that this particular ECR causes significant problems when given pieces of evidence that are rather contradictory [4]. The reason for this is that only the non-contradictory intersect between the two concatenated pieces of evidence is used for the evaluation of the new masses. If this intersect is small, the re-scaling due to normalization blurs out information.
In contrast, Yager [16] distributes all contradicting mass to m(Ω) which is equivalent to setting λ = 0. For most applications this approach is too conservative and hinders merging similar evidence to a stronger evidence. However, if pieces of evidence originate from different types of sources this ECR could be very helpful.
Depending on the relation between the agents, different values of λ will be adequate. For pieces of evidence tending to contradict one another, such as combining gene expression with immunohistochemical measurements (IHC), a small value of λ will be favored. For pieces of evidence with low probability of being contradictory, such as combining gene expression from a receptor gene with the co-gene, a greater value of λ will better allow consolidation evidence gained by gene expression. In any case, we should always avoid giving too much weight to any element of P (Ω), especially singletons .
Another major benefit of introducing λ is when combining different receptor statuses to one hormone receptor status. We found that this operation can also be represented by some adequate elements ⊕ λ ∈ E . Our data suggests a value of λ ≈ 0.5 as optimum for this task. An illustrated example of evidence linkage and the implications of λ can be found in Appendix A.

Model Adaptation
For hormone receptor determination the FOD is restricted to two outcomes, hormone receptor positive and hormone receptor negative. We will assume a closed world, thus there are no other possible outcomes than the two elements of Ω.
The simplicity of this model allows us to describe all BBMs by using only two parameters, α and β.
The current model involves 6 independent data sources to generate evidence. Four of them originate in gene expression, two in IHC measurements. The gene expression data consists of normalized values for the abundance of estrogen, co-estrogen, progesterone and co-progesterone, where the co-genes are genes closely related to the receptor genes themselves. How to transform gene expression data into BBM given by α expr and β expr is the subject of our previous papers [32,33].
IHC data originates in the IHC-measurements of estrogen and progesterone receptors. These measurements can be either continuous or discrete (or even missing). How to transform this data into appropriate α ihc and β ihc is also previously discussed [32,33].
Putting these together into our existing model, the BBM m horm describing the evidence of the hormone receptor status is calculated as Missing data are represented by the vacuous mass function. On the basis of the Formula (4) ⊕ 1 stands for Dempster's ECR and ⊕ 0 stands for Yager's ECR respectively. The operator ⊗ does not represent a typical ECR, but a formal procedure reflecting common clinical decision making as given in the below Formula (8).
However, this model suffers from a couple of shortcomings. The following list of improvements addresses the problems and provides credible results.

•
The operator ⊗, as defined in Formula (8), is not fully compatible with DST. There is always a dependence between the two receptor status. However, DST in its original form requires independent BBMs. This is obviously not the case for estrogen and progesterone receptors. Our suggestion to absorb this correlation is to replace ⊗ by ⊕ 0.5 giving estrogen and progesterone a balanced contribution to both BBM. • The operator ⊕ 1 for combining pieces of evidence coming from gene expression and co-gene expression might be problematic in case of conflicting expression values. In a previous paper [32] we introduced mass limitsα andβ for the BBMs to tackle this issue. We retain these mass limits, but replace ⊕ 1 by ⊕ 0.9 as an additional reinsurance. • Combining gene expression evidence with IHC evidence, the operator ⊕ 0 will in case of conflict put too much weight into the mass of ignorance, m(Ω). Therefore we suggest slightly increasing λ and replacing ⊕ 0 by ⊕ 0.1 . On the lower end of the λrange, the influence of λ on the ECR is significantly less than on the upper end. As long as there is a profound confidence in the data, particularly in the IHC measurements, replacing ⊕ 0 by e.g., ⊕ 0.3 is therefore also an option. • In the past it turned out that the optimal choice for the co-gene of progesterone is mostly estrogen itself. If so, although m esr expr and m pgr co are calculated differently and so vary numerically, they are basically generated from the same gene expression data. A preferable assumption in DST is the independence of input data to generate evidence. In contrary to estrogen, progesterone expression data is often diffuse and it might be impossible to find a decent co-gene. This issue can be easily resolved by replacing m pgr co with the vacuous mass function. Currently, for the sake of consistency, we stick to the current configuration which uses estrogen as co-gene for progesterone.
Respecting all these issues above we suggest an improved model such as The operators ⊕ 0.1 and ⊕ 0.9 are small derivations from to the original model and mainly serve to increase prediction stability in the case as described in [5]. Graphic examples are shown in Figures 5 and 6. A further detailed explanation is required for the shift from ⊗ to ⊕ 0.5 . For a sample to be receptor positive, only one of the two receptors (estrogen OR progesterone) needs to be positive while for being negative both receptors (estrogen AND progesterone) have to be negative. Therefore, the operator ⊗ as given in Formula (8) will fail and produce a misleading shift towards hormone receptor positive. If one of the two receptors has medium evidence for being positive and the other receptor has strong evidence for being negative, the operator ⊗ will still result in an evidence favoring a positive outcome hormone receptor.
Moreover, there is a strong connection between the two receptor genes. A progesterone positive sample will almost always be estrogen positive while an estrogen negative sample is very likely to be progesterone negative. However, the approach given in Formula (8) is based on the assumption of almost independent receptors.
Combining receptor evidence with ⊕ 0.5 will, on the other hand, fix the above issues. In both cases it is still very likely that the hormone receptor status concluded from the evidence will be classified as "undecidable", but in case of misclassification the probability of erroneously positive classified samples will be reduced by a large amount. This is in line with clinical demands.

Examples
The data set for the following results consists of 2559 freely available breast cancer samples from the Gene Expression Omnibus [34]. For each sample, at least one IHC measurement of a hormone receptor was performed as part of the respective study. Details can be found in Appendix B.
In the first example (sample id 881 from the data set), gene expression data and IHC measurements are contradicting each other. In addition, gene expression of progesterone is not very accurate, and can be seen from the differing measurements. This leads to a final very diffuse evidence, and therefore no decision can reliably be made. Figure 5a shows the evolution of evidence for this particular sample. Figure 5b shows the importance of choosing the right λ for the ECRs. The example shows that merging gene expression evidence with unclear IHC evidence can result in a dubious prognosis when choosing a too large λ.
In the second example (sample id 1980 from the data set), there is strong conformity in the data. Although one IHC measurement is missing, pieces of evidence accumulate to a strong belief in hormone receptor positive, see left panel in Figure 6. The right panel in Figure 6 demonstrates that in case of consistent evidence the influence of the parameter λ can be neglected.
The last example (sample id 2365 from the data set) is a very contradictory example concerning the data at hand. Large amounts of the final mass are distributed to lack of knowledge, which can be seen from the large central circles in Figure 7. The influence of λ can change from case to case.

Analysis
As can be seen in Table 1a and Figure 8, the switch from our previous approach (Formula (7)) to an improved linkage between the two hormone receptors (Formula (9)) entails a shift towards receptor negative. This is reflected by 86 samples clinically classified as "uncertain", now being classified as "receptor negative" and 78 samples clinically classified as "receptor positive", now being classified as "uncertain". This shift can be quantified by a Cohen's κ = 0.877. Using a constant λ for ECRs instead of (9) only has an influence on numerically problematic samples, as can be seen in Table 1b.
The left panel of Figure 8 shows the change in the α (red dots) and β (blue dots), while the right panel illustrates Table 1 in an alluvial diagram.

Decision Making
We will not change our strategy for decision making as proposed in our previous work [32,33]. This means, we consider an outcome A as "true" if the belief in it has more mass than the plausibility of its complement A . Let T ⊆ P (Ω) be the subset of all "true" elements of P(Ω).
In this very simplified case with Ω = {+, −} it reduces to Note that Ω ∈ T will clearly always hold under the closed world assumption.

Quality of Data
In DST, the dogma "a (machine learning) model is only as good as the data it is fed" can be understood from a different perspective. This guiding principle is still valid, but lack of data quality can be coped with in the BBMs and ECRs by adequate parametrization. Here DST offers additional flexibility.
In our model with only two possible outcomes this is simple. The best example is the modeling of the BBM for the IHC status. The less confidence there is in the data, the more mass is assigned to subsets of Ω with more than one element, i.e., blurred decisions. With increasing confidence in the IHC measurements, the corresponding singletons (i.e., crisp decisions) are more highly valued.

From Data to Evidence
This issue was the subject of our earlier papers [32,33] and we will therefore only briefly discuss it. Gene expression values are converted into BBM using logistic regression. In addition, two mass limitsα andβ are introduced for the following purposes.
The most important is to consider the possibility of erroneous gene expression values by keeping masses significantly smaller than 1. A welcome side effect is to avoid some rare numerically problematic cases.
The conversion of IHC measurements into BBM is again realized as described in [32,33]. We assume that about 85% of the IHC measurements are correct.

The Functionality of λ in ⊕ λ
We introduced the parameter λ to specifically adapt decision strategy to the properties of data and its origin from which evidence is to be generated. The more data sources differ in nature, the smaller λ should be chosen. This prevents too much mass accumulating in the singletons when mixing conflicting evidence. We call this strategy a conservative ECR.
On the other hand, if the data sources are homogeneous, a large λ can be chosen. We call this case a risky ECR. In the case of a risky ECR, care must be taken to ensure that contradictory singletons do not enter simultaneously with masses close to 1. In our model, this case is prevented by the mass limitsα andβ.
Theoretically, it would also be possible to choose λ > 1 as suggested [31]. That would correspond to an extrapolation in the sense that two consistent pieces of evidence not only increase certainty but also amplify each other to something stronger than the sum of them. However, our focus is in finding possible contradictions in data and therefore we see no point in merging evidence for hormone receptor status determination with λ > 1.
There is even more potential in the variation of λ when combining the two hormone receptors, estrogen and progesterone. Our suggestion (Formula (9)) is choosing λ = 0.5. By varying λ in the linkage between the two hormone receptors, the amount of unclassified samples can be regulated conveniently. This is illustrated in Figure 9.

Training of λ in ⊕ λ on Real Data
The parameter λ is currently set according to intuitive arguments rather than strict mathematical rules. It would be interesting to investigate the existence of an algorithm to calculate λ depending on arbitrary training data and to develop a mechanism that suggests an optimal choice.
Due to a lack of clean training data of sufficient high quality we have done simulations to train λ appropriately. It turned out that this task is far from trivial and needs further investigation.

Enhanced Evidence Combination Rules
There are many ECRs available and existing ones are constantly being developed further. However, none of these developments could state with certainty or explain conclusively which ECR is beneficial for which application. Therefore we proposed to set the parameter λ according to expert knowledge on the nature of the data.
In any case, the parameterized ECR introduced in this way covers a very wide range of possible combinations of evidence. Unfortunately, it is difficult to assess whether certain additional mathematical requirements for an ECR, such as commutativity, (pseudo-) associativity, idem-potency, invertibility or other characteristics of binary operators could provide additional value.

An Evidence Combination Rule with Constant Ignorance
The larger the expected contradiction between the two BBMs, the smaller one will usually choose λ. On the other hand, if expected contradictions are small, increasing risk may be taken by choosing a larger λ. The question arises as to why one must choose λ at all and not take a λ adapted according to some formula. For example, the reciprocal value of the total mass could be used as the setting point: The net effect of this strategy would be to reduce the variability of m(Ω ) within the samples.
This approach can be pursued particularly elegantly if one assumes the resulting mass of total ignorance as an a priori constant, i.e., m(Ω) = ω. The corresponding ECR would then read For the special case Ω = {+, −}, as valid for hormone receptor determination, this is directly leading to α + β = 1 − ω, with ω representing the basic belief mass of total ignorance.

Modified Frame of Discernment
Our FOD consists of only two outcomes, "positive" and "negative", i.e., Ω = {+, −}. In practice, however, the hormone receptor status is not solely responsible for the therapeutic decisions on therapy. There are different types of receptor-positive patients, and not all will respond equally well to hormone therapy. Therefore expanding the presented model at a later time is inevitable. There are two possible approaches to do so: the first is a refined model. Here, the refinement lies in subdividing "+" further into {+} = {+ 1 , + 0 }, i.e., Ω = {+ 1 , + 0 , −}, which can easily be arranged with part of the clinical data, since the ESR receptor status is often given as a (quasi-)continuous parameter.
The second is to adapt the "open world assumption". There are patients for whom it is basically impossible to make a serious choice for the most suitable treatment method based on the receptor status -even if it is measured precisely and reliably. The outcome for such patients is therefore not covered by the FOD. Such a model can be implemented by allowing a strictly positive BBM of the empty set, ergo m(∅) > 0.

Risk Function for Decision Making
Finally, another open point is the need for a risk function. Wrong decisions regarding therapy are not symmetrical. Adjuvant chemotherapy is often vital, even if only hormone therapy is applied. Some preliminary investigations have already been carried out [25], but considering the specific case, it is still an open field of research. Creating such risk functions is a heavily investigated topic and we will come back to it in a succeeding paper.

Conclusions
Dempster-Shafer theory of belief functions represents a generalization of Bayes' probabilities. It provides a powerful framework to proactively involve the outcome "uncertain" in case of insufficient data availability to make confident decisions. Instead of probabili-ties pieces of evidence, represented by basic belief masses, are concatenated by evidence combination rules.
In this paper, we presented a manner to parameterize evidence combination rules to adjust models to the nature of the incoming measurements. Data with high potential to be contradictory (like gene expression and immunohistochemical measurements) is linked in a more conservative manner than data which is more likely expected to be in consent. Thus, evidence theory avoids several well-known problems with decisions based on conventional statistics.
As a major advancement, our work introduces flexible evidence-combination-rules offering the potential of adaptable risk. Changing the parameters for concatenating pieces of evidence (respectively data) alters the probability for a sample to be classified as "welldefined" or as "uncertain". This is especially helpful to adapt the Dempster-Shafer algorithm to possible different types of risks and directly related possibilities of a treatment decision. Examples are possible over-and under-treatment of particular patients.
To illustrate the strength of evidence theory we used a case study of hormone receptor status determined for breast cancer samples. As a key outcome we estimate that slightly too many patients have been classified as hormone receptor positive by conventional clinicaldecision-making in comparison to our approach. We do not advocate overruling clinical decisions, but rather flagging questionable samples as "uncertain" and suggesting further investigations for these particular patients.
Drawing on flexible evidence combination rules in our approach we see great potential for the advancement in personalized medicine.  Data Availability Statement: All data were downloaded from Gene Expression Omnibus [34].

Acknowledgments:
We thank Michael Cibena, Center for Medical Data Science, Medical University of Vienna, for technical support.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: