You are currently viewing a new version of our website. To view the old version click .
Future Internet
  • Article
  • Open Access

27 October 2025

Towards Fair Medical Risk Prediction Software

and
1
Department of Computer Science and Applied Cognitive Science, University of Duisburg-Essen, 47057 Duisburg, Germany
2
Department of Electrical Engineering and Computer Science, University of Applied Sciences Wismar, Philipp-Mueller-Straße 14, 23966 Wismar, Germany
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue IoT Architecture Supported by Digital Twin: Challenges and Solutions

Abstract

This article examines the role of fairness in software across diverse application contexts, with a particular emphasis on healthcare, and introduces the concept of algorithmic (individual) meta-fairness. We argue that attaining a high degree of fairness—under any interpretation of its meaning—necessitates higher-level consideration. We analyze the factors that may guide the choice of a fairness definition or bias metric depending on the context, and we propose a framework that additionally highlights quality criteria such as accountability, accuracy, and explainability, as these play a crucial role from the perspective of individual fairness. A detailed analysis of requirements and applications in healthcare forms the basis for the development of this framework. The framework is illustrated through two examples: (i) a specific application to a predictive model for reliable lower bounds of BRCA1/2 mutation probabilities using Dempster–Shafer theory, and (ii) a more conceptual application to digital, feature-oriented healthcare twins, with the focus on bias in communication and collaboration. Throughout the article, we present a curated selection of the relevant literature at the intersection of ethics, medicine, and modern digital society.

1. Introduction

Fairness in software is an interdisciplinary and rapidly evolving field with far-reaching implications across many domains of modern life, including law (e.g., assessing the risk of criminal re-offence), finance (evaluating creditworthiness), engineering (autonomous driving decisions), medicine (predicting cancer risk), and even social media (mitigating filter bubbles). There are many definitions of algorithmic fairness, some of which can be shown to be contradictory [1,2], particularly when established ethical principles or standards of social responsibility are in conflict. In non-polar prediction settings, where the interests of individuals and decision-makers coincide (as in many medical applications), fairness may be interpreted as the absence of bias.
When algorithmic fairness is considered, the primary focus is on evaluating the decisions, assignments, or allocations generated by an algorithm or model in a broad sense. In practice, fairness is rarely the sole consideration; other quality criteria such as accuracy, efficiency, or usability are also assumed or explicitly required from the algorithm. A well-designed fairness metric must therefore reflect these performance aspects, while remaining sensitive to the specific context and the viewpoints of relevant stakeholders. Additionally, contradictions between different fairness metrics lead to the question of what fairness understanding is fair. This question cannot be answered without considering the situation and its context [3].
To address these demands and potential conflicts among existing fairness definitions, we propose a meta-level approach, which we term meta-fairness (MF). Meta-fairness should provide a principled and explainable framework for selecting or combining fairness metrics based on the social and application context, prevailing conceptions of justice, and the utilities of both decision-makers and decision recipients. The foundation for this definition is given by an extensive literature overview of the subject of fairness from [4].
One of the first mentions of the term in the context of algorithms appears in the 2004 paper by Naeve-Steinweg [5]. From a game-theoretic perspective, she highlighted the need to address how agents should be treated when they disagreed on what constituted a fair solution, introducing the demand for “a new kind of properties” capturing “meta-fairness and meta-equity.” In the same year, Hyman [6] explored parallel challenges in mediation, asking, first, how parties and mediators addressed their own sense of justice and fairness; and second, whether and why mediators brought such notions into the process. He stated that “no meta-ethics tells mediators which measures of fairness are appropriate;” “they must choose.” The task of MF is to support this choice.
Recently, the concept of MF has gained increased attention in the literature, with studies such as [7,8] and others exploring its implications. This paper provides a comprehensive review of existing methodologies that, although employing different vocabulary, aim to achieve the goals we understand as the task of MF. Note that the majority of publications do not explicitly use the term ‘meta-fairness’, highlighting the need for a unified terminology and understanding of this concept.
In this paper, we propose a definition of (individual) MF designed to serve as a guiding principle for achieving fairness, even when potential conflicts between criteria arise. Drawing on a literature review on meta-fairness, our goal is to integrate existing approaches into a more comprehensive perspective rather than introducing yet another framework tailored to a specific case. Furthermore, our approach allows for the combination of not only different fairness metrics but also additional quality criteria (e.g., explainability) depending on the application context. We employ a score-based questionnaire system and the Dempster-Shafer theory (DST) [9,10] for this combination.
The MF definition, including its contextual component, is intentionally broad to accommodate diverse application domains and levels of decision-making risk. Its general components, collectively organized under the MF framework, are illustrated here using healthcare as a concrete example. While the framework itself is broadly applicable, the specific questionnaires presented are primarily tailored to healthcare, particularly medical risk assessment tools. We illustrate the application of the framework using a case study on predicting mutation probabilities in the BRCA1/2 tumor-suppressor genes for a low level of decision-making risk, focusing on the challenge of ensuring correct and fair assignment of patients to the low-risk class. An example of applying the MF framework in a broader, conceptual context is its use for digital twins (DTs) in healthcare, where we propose applying meta-fairness to overarching DT principles, allowing fairness and bias to be quantified across each DT category.
The paper is structured as follows. Section 2 reviews related work on the transition from fairness to MF, highlighting that the pursuit of fairness under any interpretation ultimately calls for higher-level reasoning. This part also introduces a working definition of MF that serves as the basis for our analysis. Section 3 provides an outline of the methods used and details the proposed MF framework, including possible questionnaires and DST employment. The concepts are then illustrated through the two mentioned applications, including discussion, in Section 4. The paper concludes with a summary and perspectives for future work.
Note that our intention, as in [7], is to make the MF approach as broadly applicable as possible. Initially, it is independent of any specific application domain and can be employed in finance, healthcare, or other areas, accommodating both polar and non-polar decision settings. Only in Section 3.3, which discusses specific questionnaires, is the approach directly linked to healthcare; even there, several questions and bias lists are applicable beyond this domain. Section 3.4 can again be interpreted in a general context. Section 4.1 illustrates the application of MF to a specific model, while Section 4.2 explores its application to another concept (DT), further highlighting the generality of the proposed approach.

3. Materials and Methods

In addition to the extensive literature review in Section 2, we present a concise overview of the relevant methods in the context of DST, while also introducing our notation in Section 3.1. In the next Subsection, we reintroduce and improve the metric for medical risk tools’ assessment from [4] since it plays an important role in the proposed methodology for assessing (individual) fairness using MF, which we describe in the last subsection.

3.1. DST Enhanced by Interval Analysis

Interval analysis (IA) [64] is a widely used technique for result verification. By applying suitable fixed-point theorems, whose conditions can be reliably checked by a computer, IA allows for formal proofs that the outcomes of simulations are correct, assuming the underlying code is sound. IA accounts for errors due to rounding, conversion, discretization, and truncation. The results are expressed as intervals with floating-point bounds guaranteed to enclose the exact solution of the coded computer-based model. Since IA operates on sets, it enables deterministic propagation of bounded uncertainties from inputs to outputs. Some methods also support inverse uncertainty propagation. However, a well-known limitation of basic IA is the excessive widening of interval bounds, due to overestimation (the dependency problem and the wrapping effect). More advanced techniques, such as those based on affine arithmetic or Taylor models, have been developed to mitigate these effects and improve result tightness.
A real interval [ x ̲ , x ¯ ] , with x ̲ , x ¯ R and, normally, x ̲ x ¯ , is defined as
[ x ̲ , x ¯ ] = { x R | x ̲ x x ¯ } .
A real number x R can be represented as a point interval with x ̲ = x ¯ = x . For = { + , , · , / } , the interval operation between two intervals [ x ̲ , x ¯ ] and [ y ̲ , y ¯ ] is defined as
[ x ̲ , x ¯ ] [ y ̲ , y ¯ ] = { x y | x [ x ̲ , x ¯ ] , y [ y ̲ , y ¯ ] }
which results in another interval:
[ min S , max S ] , where S = { x ̲ y ̲ , x ̲ y ¯ , x ¯ y ̲ , x ¯ y ¯ } .
For division, it is usually required that 0 [ y ̲ , y ¯ ] though extended interval arithmetic can handle zero in the denominator. For some operations, simplified formulas exist (e.g., [ x ̲ , x ¯ ] [ y ̲ , y ¯ ] = [ x ̲ y ¯ , x ¯ y ̲ ] ). Using outward rounding, floating-point bounds can be computed to enclose real intervals reliably. Based on this arithmetic, interval methods can be extended to verified function evaluations and automatic error bounding in solving algebraic or differential equation systems.
DST [65] combines evidence from different sources to measure confidence that a certain event occurs. In the finite case, DST assigns a (crisp) probability to whether a realization of a random variable X lies in a set A i . The result is expressed through lower and upper bounds (belief and plausibility) on the probability of a subset of the frame of discernment Ω . A random DST variable is characterized by its basic probability assignment (BPA) M. If A 1 , , A n are the sets of interest where each A i 2 Ω , then M is defined by
M : 2 Ω [ 0 , 1 ] , M ( A i ) = m i , i = 1 n , M ( ) = 0 , i = 1 n m i = 1 .
The mass of the impossible event ∅ is equal to zero. Every A i with m i 0 is called a focal element (FE). The plausibility and belief functions can be defined with the help of the BPAs for all i = 1 n and any Y Ω as
P l ( Y ) : = A i Y m ( A i ) , B e l ( Y ) : = A i Y m ( A i ) .
These two functions allow defining upper and lower non-additive monotone measures [65] on the true probability. The FE masses must sum to one. If expert-provided BPAs exceed one, normalization is applied as
m ˜ i : = m i / i = 1 n m i
If the sum is below one, either the same normalization is used, or a new FE A n + 1 = Ω is added to absorb the deficit. The latter is meaningful only for computing the lower limit B e l ( Y ) , while the former may overly inflate the belief function. If there is evidence for the same issue from two or more sources (e.g., given as BPAs M 1 , M 2 ), the BPAs have to be aggregated. A common method is Dempster’s rule:
K : = A j A k = M 1 ( A j ) M 2 ( A k ) , M 12 ( A i ) = 1 1 K A j A k = A i M 1 ( A j ) M 2 ( A k )
with A i , M 12 ( ) = 0 .
Interval BPAs (IBPAs) generalize crisp BPAs by allowing m i to be intervals, reflecting uncertainty about the probability that X belongs to a given set. Computations of P l ( Y ) , B e l ( Y ) , and aggregations proceed as in the crisp case, but with interval arithmetic. Since interval arithmetic lacks additive inverses [64], the condition i = 1 n m i = 1 is relaxed to 1 i = 1 n m i .
Adopting the open-world interpretation of DST may be necessary when the number of FEs n cannot be fixed in advance [66,67]. Furthermore, several approaches have been proposed to address the limitation that Dempster’s rule emphasizes similarities in evidence while disregarding potential conflicts [68].

3.2. A Metric for Medical Risk Tools’ Assessment

A score metric for healthcare software is presented in [4] and extended in the next subsection by integrating the concept of MF from Definition 1. It appears necessary to assign weights to the individual quality criteria and other relevant factors shown in Figure 2, based on constraints such as personal and material effort, availability of on-site resources over the required period, retrievability, and other considerations. These weights should reflect the relative importance of each factor in achieving healthcare success and well-being, and should be determined by experts from the relevant disciplines. If there are different approaches for the evaluations, two basic probability assignments can be used and combined using Dempster’s rule of combination (or any other variant of a combination rule, e.g., [68]) from the DST.
Three popular fairness conditions in risk assessment are formulated in [2]:
  • Calibration within groups (if the algorithm assigns probability x to a group of individuals for being positive with respect to the property of interest, then approximately an x fraction of them should actually be positive);
  • Balance for the positive class (in the positive class, there should be equal average scores across groups);
  • Balance for the negative class (the same as above for the negative class).
In the same article, it is demonstrated that “except in highly constrained special cases, there is no method that can satisfy these three conditions simultaneously.” In other words, determining which fairness condition (or combination of conditions) should be used in a given situation must rely on factors beyond purely mathematical or philosophical definitions of fairness, leading to the proposed concept of MF. Our framework is intended to support the choice of metrics, which is a goal similar to that of the framework introduced in [7]. However, it is important to note that the framework in [7] does not explicitly consider QCs as an aspect of MF. Our approach does not seek to identify the one true combination of fairness conditions for a given situation, but rather aims to find a compromise among multiple possibilities that may be equally suitable (or quantifiably unequal) in their applicability to that situation.
Luther and Harutyunyan [4] formulate the following three requirements for fairness, of which the first one may be assessed by any of the three conditions from [2] mentioned above:
R1 
Subgroups within the cohort that possess unique or additional characteristics, especially those leading to higher uncertainty or requiring specialized treatment, should not be disadvantaged.
R2 
Validated lower and upper bounds are specified for the risk classes.
R3 
The employment of suitable or newly developed technologies and methods is continuously monitored by a panel of experts; the patient’s treatment is adjusted accordingly if new insights become available.
Although originally formulated for healthcare, these requirements are generalizable to any domain employing risk-score–based decision-making software. The latter two requirements extend beyond fairness, reflecting broader criteria for accountable risk assessment systems. Within the proposed MF framework, they can thus be interpreted as steps toward achieving meta-fairness. In the following, we reproduce the metric from Luther and Harutyunyan [4], with enhancements and explanations.
The score fairness metric is defined as a minimal set of requirements and yields a value between 0 and 15, with the higher score indicating a higher fairness degree. Its key advantage lies in its flexibility: it can be readily adapted to incorporate new developments, medical insights, and the increasing relevance of individual genetic markers while maintaining comparability with earlier versions. As a prerequisite, the following should be determined:
  • If the risk is sufficiently specified;
  • If the model for computing the risk factors and calculating the overall risk is valid/accurate;
  • If assignment to a risk class is valid/accurate.
The scoring procedure described below is not applicable unless the conditions specified above are met. This covers the quality criterion of accuracy described in Section 2.1. (All considered QCs are visualized in Figure 3 for a better overview.) In general, these prerequisites imply that the algorithm or software system in use has been developed in accordance with verification and validation (V&V) principles [69,70,71,72,73]. For example, it may mean that the software must produce consistent, accurate, and reproducible risk scores. A detailed discussion of how to achieve that lies beyond the scope of this paper, although this does not mean that some aspects of the QC Accuracy cannot be considered within MF.
Figure 3. A selection of quality criteria considered in Section 2.1, along with a selection of possible interconnections between them (non-exhaustive). Aside from fairness, our literature analysis delivered 49 further keywords for possible QCs (alphabetically): ability, accessibility, accountability, accuracy, applicability, auditability, authority, availability, awareness, causality, comparability, compatibility, completeness, complexity, effectiveness, efficiency, equality, equity, explainability, findability, flexibility, functionality, innovativity, interoperability, interpretability, loyalty, missingness, opportunity, parity, performance, plausibility, privacy, reliability, representativeness, reproducibility, responsibility, retrievability, security, severity, similarity, sustainability, transparency, trustworthiness, uncertainty, usability, utility, validity, willingness, and worthiness.
The overall ‘fairness’ score is calculated by summing the points assigned to responses for the questions below that evaluate various aspects of the algorithm or software system under review. To make later integration into the MF process easier, we specify the QC(s) the questions are aimed at.
  • General information, 1 point: Does the algorithm/system offer adequate information about its purpose, its target groups, patients and their diseases, disease-related genetic variants, doctors, medical staff, experts, their roles, and their tasks? Are the output results appropriately handled? Can FAQs, knowledge bases, and similar information be easily found? (QC Explainability, Auditability)
  • Risk factors:
    • 2 pts Is there accessible and fair information specifying what types of data are expected regarding an individual’s demographics, lifestyle, health status, previous examination results, family medical history, and genetic predisposition, and over what time period this information should be collected? (QC Auditability, Data Lineage, Explainability)
    • 1 pt Does the risk model include risk factors for protected or relevant/eligible groups? (QC Fairness)
  • Assignment to risk classes:
    • 0.5 pts Depending on the disease pattern, examination outcomes, and patients’ own medical samples (e.g., biomarkers), and using transparent risk metrics, are patients assigned to a risk class that is clearly described? (QC Explainability)
    • 0.5 pts If terms such as high risk, moderate risk, or low risk are used, are transition classes provided to avoid assigning similar individuals to dissimilar classes and to include the impact of epistemic uncertainty or missing data? (QC Fairness)
    • 1 pt Are the assignments to (transitional) risk classes made with the help of the risk model validated for eligible patient groups over a longer period of time in accordance with international quality standards? (QC Fairness, Accuracy, Validity)
  • Assistance:
    • 1 pt Can questionnaires be completed in a collaborative manner by patients and doctors together? (QC Auditability, Collaboration)
    • 1 pt Can the treating doctor be involved in decision-making and risk interpretation beyond questionnaire completion; are experts given references to relevant literature on data, models, algorithms, validation, and follow-up? (QC Auditability)
  • Data handling, 2 pts: Is the data complete and of high quality, and was it collected and stored according to relevant standards? Are data and results at disposal over a longer period of time? Are cross-cutting requirements such as data protection, privacy, and security respected? (QC Accuracy, Data Lineage)
  • Result consequences:
    • 1 pt Are the effects of various sources of uncertainty made clear to the patient and/or doctor? (QC Accuracy, Explainability, Auditability)
    • 2 pts Does the output information also include counseling possibilities and help services over an appropriate period of time depending on the allocated risk class? (QC Responsibility, Sustainability)
    • 1 pt Are arbitration boards and mediation procedures in the case of disputes available? (QC Responsibility, Sustainability)
As can be seen from the above, this individual-focused metric emphasizes accountability (particularly auditability) in response to the growing shift in the research from ‘formal fairness’ toward participatory approaches. However, it does not make explicit the context, the goals of decision-makers and decision recipients as well as the applicable patterns of justice and legal norms, as suggested by Hertweck et al. [7]. Therefore, at least for the individual case, it is reasonable to combine both views with each other, since the framework in [7] lacks explicit consideration of context and QCs. Moreover, the questions listed above are organized according to general-language concepts rather than MF categories. What becomes apparent is that, if fairness is understood merely as a set of metrics, then the proposal from [4] extends beyond it.

3.3. The MF Framework: (Individual) Fairness Enhanced Through Meta-Fairness

Discrimination arises when groups or individuals are treated dissimilarly. If a compromise can be found between different metrics reflecting varying degrees of fairness, it is essential to clarify what this implies for the affected person(s). A philosophical question remains as to whether the task of MF is to identify the optimal set of procedures that maximizes fairness for the individual or group, or to pursue a principled compromise among the available alternatives. Our approach is aimed at the latter.
As discussed in Section 2, the concept of algorithmic fairness, both group and individual, has been the subject of extensive literature, numerous implementations, and a wide range of perspectives, philosophies, and debates. Virtually every proposed approach has faced criticism at some point, with some critiques later retracted or revised. In particular, formal mathematical definitions of fairness have been strongly criticized for neglecting crucial aspects such as social context, social dynamics, and historical injustice [52,63]. We do not aim to introduce yet another formalization. Instead, our goal is to build upon and integrate existing approaches in a principled and flexible manner that should accommodate evolving understandings and easily allow the substitution or inclusion of different fairness metrics. Score-based questionnaires, together with DST methods that offer various mechanisms for combining expert opinions under uncertainty, seem particularly well-suited for this purpose.
In the case of non-polar predictions, the utilities of decision-makers and recipients can be assumed to align. When they do not, the DST allows the construction of two BPAs representing the viewpoints of the decision-makers and the decision subjects, which can then be combined using an appropriate DST combination rule (Section 3.1, [68]).
We assume that the same prerequisites as for the metric in Section 3.2 apply. This includes the requirement for the measurement from [52] that “developers are well advised to ensure that the measured properties are meaningful and predictively useful.” Grote [52] formulates relevant adequacy conditions for social objectives, QC Accuracy (referred to as ‘measurement’), social dynamics, and the utility of decision-makers, which we incorporate into the questionnaires presented in this subsection alongside the already mentioned ideas from [4,7].
In our MF framework, the aspects of meta-fairness illustrated in Figure 1 and Figure 2 are initially defined using a checklist and a score-based questionnaire, and, if necessary, subsequently combined with the aid of DST. A key component of each questionnaire is the specification of biases considered by each QC. Since the same bias may appear under slightly different names in the literature, and conversely, biases with identical names may have differing definitions across publications, we provide a list of the considered biases along with their definitions and references in Table 2 to establish a common ground. For each bias, we provide a single reference deemed most relevant, although other references could also be applicable. Examples of biases in the healthcare context, sometimes expressed using alternative terminology, are discussed in [13,74]; in such cases, the definition in the table, rather than the name in Column 2, should be considered authoritative. In the following, we refer to biases by their corresponding numbers in the table. The final column of the table indicates how each bias is associated with the relevant QCs.
Table 2. Biases (‘b’) considered in Section 3.3 and Section 4, with definitions and references, in the alphabetical order. ‘Ref’ means ‘reference;’ ‘C’ stands for ‘category’ according to [11], out of ‘H’ human, ‘SY’ systemic, ‘ST’ statistical. The abbreviations for QCs are: ‘A’ accuracy, ‘Au’ auditability, ‘C’ communication/collaboration, ‘D’ data, ‘E’ explainability, ‘F’ fairness, ‘RS’ responsibility/sustainability, ‘S’ social dynamics.
Definition 2.
The MF Framework is the set of instruments specifying the MF components Context (including Utilities), QCs, Legal/Ethical norms and values, and Social dynamics. In particular, a checklist for Context helps determine the relevance of each subsequent questionnaire about QCs, Legal/Ethical norms and values, and Social dynamics, which in turn give structure to the BPAs used to derive the final score if different opinions or interpretations are possible.
Note that Definitions 1 and 2 have, up to this point, been independent of the healthcare domain. At the stage of the context checklist, however, the domain must be specified, and the subsequent QC questionnaires need to be relevant to this domain—in our case, healthcare. Nevertheless, the bias lists for the QCs, as well as some of the other questions, can be readily generalized to other domains. In the following, we detail the context checklist and the questionnaires.

3.3.1. A Checklist for Context, Utilities from Figure 2

Based on the aspects shown in Figure 1 and Figure 2 and Definition 2, it is necessary to establish the context first, which takes the form of a checklist. The context aspect should include at least the components listed below, but can be extended if required. The actual utilities of decision-makers and recipients are considered to be parts of the context.
  • Domain: healthcare/finance/banking/…
  • Stakeholders: Decision-makers and -recipients
  • Kind of decisions: Classification, ranking, allocation, recommendation
  • Type of decisions: Polar/non-polar
  • Relevant groups: Sets of individuals representing potential sources of inequality (e.g., rich/poor, male/female; cf. [7])
  • Eligible groups: Subgroups in relevant groups having a moral claim to being treated fairly (cf. the concept of a claim differentiator from [7]; e.g., rich/poor at over 50 years of age)
  • Notion of fairness: What is the goal of justice depending on the context (cf. the concept of patterns of justice from [7]; e.g., demographic parity, equal opportunity, calibration, individual fairness)
  • Legal and ethical constraints: Relevant regulatory requirements, industry standards, or organizational policies
  • Time: Dynamics in the population
  • Resources: What is available?
  • Location: Is the problem location-specific? How?
  • Scope: Model’s purpose; short-term or long-term effects? groups or individuals? Real-time or batch processing? High-stakes or low-stakes?
  • Utilities: What are they for decision-makers, decision-recipients?
  • QCs: What is relevant?
  • Social objectives: What are social goals?

3.3.2. Questionnaires for Quality Criteria from Figure 2

As discussed earlier, QCs play a crucial role in the MF framework. Our literature review identified 50 candidate QCs used by researchers (cf. Figure 3). Below, we present questionnaires for a subset of QCs that are particularly relevant from the perspective of meta-fairness, which domain experts may refine further. Such refinements could alter the resulting scores, which can then be normalized within the proposed framework. We provide scoring possibilities for QCs’ fairness, accuracy, explainability, auditability, responsibility/sustainability, and communication/collaboration only.

3.3.3. QC Fairness ( v f [ 0 , 15 ] )

The first component of the score for QC fairness is any actual optimal value v f , 1 [ 0 , 15 ] of fairness scores obtained by any formal fairness procedure, for example, using the tools described in Section 2.4. If it is not directly given like that, it needs to be rescaled accordingly; the simplest way is to use a linear (min–max) transformation. For an original range [ a , b ] and a target range [ c , d ] , the transformation is given by: x new = x a b a ( d c ) + c . If more than one value is available (there are several interesting points on the Pareto front), then an interval containing them can be chosen (weight w 1 = 0.5 ). Then, the following assessment questions should be answered and the total scores summed up ( v f , 2 [ 0 , 15 ] , weight w 2 = 0.5 ). The score v f [ 0 , 15 ] is computed as the weighted sum v f = i = 1 2 w i · v f , i .
  • Utilities: Are the correct functions used? (1 pt for the decision-maker’s and -recipient’s each)
  • Patterns of justice: Are they correctly chosen and made explicit? (2 pt)
  • Model: Does the risk model include risk factors to reflect relevant groups? (1 pt)
  • Risk classes: If terms such as high risk, moderate risk, or low risk are used, are transition classes provided to avoid assigning similar individuals to dissimilar classes? (2 pt)
  • Type: Is group (0 pt) or individual (1 pt) fairness assessed? (Individual fairness prioritized.)
  • Bias: What biases out of the following list are tested for and are not exhibited by the model (here and in the following: 1 pt each if both true; 0 otherwise): 1/13/22/26/31/35/39

3.3.4. QC Accuracy ( v a [ 0 , 30 ] )

  • General: Does the variable used by the model accurately represent the construct it intends to measure, and is it suitable for the model’s purpose (2 pt)?
  • Uncertainty: Are the effects of various sources of uncertainty made clear to the patient and/or doctor (2 pt)?
  • Data
    • quality, representativeness: are FAIR principles upheld (1 pt each)?
    • lineage: Are data and results at disposal over a longer period of time (1 pt)? Are cross-cutting requirements, e.g., data protection/privacy/security, respected (1 pt)?
    • bias: 20/27/29/37 (1 pt each)
  • Reliabilty: Is the model verified? Formal/code/result/uncertainty quantification (1 pt each)
  • Validity: As the variable validated through appropriate (psychometric) testing (1 pt)? Are the assignments to (transitional) risk classes made with the help of the risk model validated for eligible patient groups over a longer period of time in accordance with international quality standards (1 pt)?
  • Society: Does the model align with and effectively contribute to the intended social objectives? I.e., is the model’s purpose clearly defined (1 pt)? Does its deployment support the broader social goals it aims to achieve (1 pt)?
  • Bias: Pertaining to accuracy: 8/11/14/18/24/33/42/43 (1 pt each)

3.3.5. QC Explainability v e [ 0 , 15 ]

  • Data: Are FAIR principles upheld with the focus on explainability (1 pt)?
  • Information: Does the algorithm/system offer adequate information about its purpose, its target groups, patients and their diseases, disease-related genetic variants, doctors, medical staff, experts, their roles, and their tasks (4 pt)?
  • Risk classes: Depending on the disease pattern, examination outcomes, and patients’ own medical samples (e.g., biomarkers), and using transparent risk metrics, are patients assigned to a risk class that is clearly described (4 pt)?
  • Effects: Are the effects of various sources of uncertainty made clear to the patient and/or doctor (1 pt)?
  • Bias: 7/15/21/28/30 (1 pt each)

3.3.6. QC Auditability v a [ 0 , 10 ]

  • Information: Are the output results appropriately handled (1 pt)? Can FAQs, knowledge bases, and similar information be easily found (1 pt)?
  • Data types: Is there accessible and fair information specifying what types of data are expected regarding an individual’s demographics, lifestyle, health status, previous examination results, family medical history, and genetic predisposition, and over what time period should this information be collected (2 pt)?
  • Expert involvement: Are experts given references to relevant literature on data, models, algorithms, validation, and follow-up (2 pt)?
  • Bias: 3/4/5/25 (1 pt each)

3.3.7. QCs Responsibility/Sustainability

  • Counseling: Does the output information also include counseling possibilities and help services over an appropriate period of time depending on the allocated risk class?
  • Mediation: Are arbitration boards and mediation procedures in the case of disputes available?
  • Trade-offs: Is optimization with respect to sustainability, respect, and fairness goals?
  • Sustainability: Are environmental, economic, and social aspects of sustainability all taken into account?
  • Bias: 2/10/17/36/38

3.3.8. QCs Communication/Collaboration

  • Type of interaction: Are humans ‘out of the loop,’ ‘in the loop,’ ‘over the loop’?
  • Adequacy: Are the model’s outputs used appropriately by human decision-makers? I.e., how are the model’s recommendations or predictions interpreted and acted upon by users? Are they integrated into decision-making processes fairly and responsibly?
  • Collaboration: Can questionnaires be completed in a collaborative manner by patients and doctors together?
  • Integrity: Is third-party interference excluded?
  • Bias: 6/9/12/16/19/23/user 24/32/40

3.3.9. Legal/Ethical Norms and Values from Figure 2

Key legal norms for algorithms in healthcare are still evolving and aim to balance innovation with patient well-being. While specific requirements vary across countries and jurisdictions, they generally cover patient safety and treatment efficacy, data protection and privacy, transparency and explainability, fairness, accountability and liability, informed consent and autonomy, and cybersecurity. Algorithms’ ethics similarly encompasses values, principles, and practices that apply shared standards of right and wrong to their design and use. As noted by Hanna et al. [74], recurrent ethical topics—respect for autonomy, beneficence and non-maleficence, justice, and accountability—are particularly relevant in healthcare. This again highlights the importance of incorporating QCs within an MF framework.
At least the following two questions are relevant here:
  • Adequacy: Are appropriate legal norms taken into account?
  • Relevance: Are relevant and entitled groups correctly identified and made explicit?
Further definition of this component of the MF framework should be grounded in a multi-stakeholder process. Binding requirements can be provided by legal and regulatory standards, such as anti-discrimination law or data protection regulation. Professional standards and ethical guidelines issued by domain-specific associations (e.g., in medicine, finance, or computer science) can offer further orientation. In addition, ethics committees and review boards play a central role in operationalizing these norms in practice. Finally, the inclusion of perspectives from affected stakeholders and civil society ensures that broader societal values are represented.

3.3.10. Social Dynamics from Figure 2

Considering the final component of MF, social dynamics, means uncovering structural biases, recognizing feedback effects, and preventing ‘fairwashing.’ Structural bias arises from the systemic rules, norms, and practices that shape how data is collected, how decisions are made, and how opportunities are distributed. For instance, in healthcare, clinical studies have historically underrepresented women, resulting in diagnostic models that perform worse for them. At the same time, algorithms influence the very societies that employ them. For example, predictive models often equate past healthcare costs with patient needs. Because marginalized groups historically receive less care, the model underestimates their health risks, leading to fewer resources being allocated. This, in turn, worsens outcomes and feeds back into future training data, reinforcing inequities in a self-perpetuating loop. By uncovering such structural and dynamic effects, we can reduce the risk of fairwashing, that is, presenting an algorithm as fair when it is not. Therefore, considering multiple metrics (or combinations of them), applying explainability tools responsibly, and maintaining a strong focus on accountability, as emphasized in the MF framework proposed here, also helps to take into account social dynamics. Further considerations may be as follows:
  • Time influence: Is the model designed to maintain stable performance across varying demographic groups over time?
  • Interaction: Is the influence of the model on user behavior characterized? To what extent are these effects beneficial or appropriate?
  • Utility: Is the social utility formulated?
  • Purpose fulfillment: Is it assessed if it can be aligned with the functionality of the software under consideration?
  • Feedback effects: Are there any studies characterizing changes induced by the model in the society?
  • Bias: 34/41
The questionnaires presented above, which specify the components of the MF framework according to Definition 2, indicate vital MF areas that may require further refinement by experts and are not intended to be exhaustive.

3.4. DST for Meta-Fairness

According to Definition 2, there should be a possibility to combine different views on what is fair. DST, especially with imprecise probabilities, provides a suitable means for combining diverse views, as it can represent uncertainty and partial belief without requiring fully specified probability distributions. Its ability to aggregate evidence from multiple sources makes it particularly useful for integrating notions arising from different stakeholders or ethical perspectives, while its explicit treatment of conflict helps to uncover tensions between competing objectives. At the same time, the computational complexity grows quickly with the size of the hypothesis space, the resulting belief and plausibility intervals may be difficult to interpret for non-technical audiences, and Dempster’s rule of combination can yield counterintuitive results under high conflict. Moreover, the DST obviously cannot resolve any normative questions about which fairness notion ought to take precedence.
There are several ways to apply the DST within the MF framework.

3.4.1. Augmented Fairness Score

One approach is to obtain v f , 1 scores (cf. Section 3.3, QC Fairness) in a more sophisticated manner. Metaphorically, unfairness can be viewed as a disease, allowing us to apply the same DST methodology as described in Section 4.1 for breast cancer. In this analogy, the RFs correspond to different types of bias present in an algorithm. We can then assign one BPA to individual unfairness and another to group unfairness, and combine them using Dempster’s rule. Finally, the fairness score can be calculated as one minus the unfairness mass, which includes ambiguous situations, as illustrated in the example below. If ambiguity is not to count as fairness, the fairness mass itself can be used directly as the score.
Let us consider a simple, hypothetical example of this approach in healthcare. In this example, we draw inspiration from Wang et al. [81], who quantified racial disparities in mortality prediction among chronically ill patients, but we instead focus on a hypothetical diabetes screening use case. Note that Wang et al. [81] did not apply DST methods, but instead relied on logistic regression and other similarly less interpretable approaches.
Suppose we have a predictive model that recommends patients for early diabetes screening. Unfairness can arise at both the individual and group levels: at the individual level, two patients with nearly identical medical profiles (e.g., age, BMI, glucose levels nearly the same for a Black and a White patient) may receive substantially different screening recommendations, while at the group level, patients sharing a particular SA may be systematically under-recommended for screening compared to others. For illustration, we examine disparities with respect to ethnicity (Black vs. White).
We define the FEs for the first BPA M i based on the three following individual unfairness metrics (or RFs in the analogy).
  • Feature similarity inconsistency (FS): A patient with a nearly identical profile to another patient receives a substantially different risk score;
  • Counterfactual bias (CB): If a single attribute is changed while all other features are held fixed (e.g., recorded ethnicities flipped from Black to White), the model recommendation changes, even though it should not medically matter; and
  • Noise sensitivity (NS): Small, clinically irrelevant perturbations (e.g., rounding a glucose level from 125.6 to 126) lead to disproportionately different predictions.
For the group view M g , the metrics/RFs may be selected as follows (cf. the three commonly used fairness conditions on page 15).
  • Demographic parity gap: One group is recommended for screening at a substantially lower rate than another, despite having similar risk profiles;
  • Equal opportunity difference: The model fails to identify true positive cases more frequently in one group than in others; and
  • Calibration gap: Predicted risk scores are systematically overestimated for one group and underestimated for another.
In general, the choice of metrics can be motivated by the lists of biases described in Section 3.3. As shown above, the FEs of individual and group BPAs are generally not identical. A transparent way to address this is to construct a mapping r : 2 Ω 2 U , F , where U , F (unfair/fair) represents the hypothesis frame of discernment. An example mapping for M i can be defined as follows: r ( { F S } ) = { U } (feature-similarity inconsistency is considered unfair); r ( { C B } ) = { U } (this is also regarded as unfair); r ( { N } ) = { U , F } (noise sensitivity is ambiguous); r ( { F S , C B } ) = { U } (strong evidence of unfairness); r ( Ω ) = { U , F } (total ignorance). For brevity, we omit this step for M g .
Suppose the masses for the individual BPA are defined as M i ( { F S } ) = 0.30 , M i ( { C B } ) = 0.40 , M i ( { N S } ) = 0.10 , M i ( { F S , C B } ) = 0.15 . We take the mass of Ω as the remainder ( M i ( Ω ) = 1 ( 0.30 + 0.40 + 0.10 + 0.15 ) = 0.05 ). In the simplest case, the transferred masses can be defined as the sum of the masses of all FEs that map exactly to each hypothesis set. That is, M ^ i ( { U } ) = M i ( { F S } ) + M i ( { C B } ) + M i ( { F S , C B } ) = 0.85 , M ^ i ( { F } ) = 0 , M ^ i ( { U , F } ) = 0.15 .
If the group BPA is defined by M g ( { U } ) = 0.40 , M g ( { F } ) = 0.40 , M g ( { U , F } ) = 0.20 , we can proceed to combine the evidence. To select an appropriate combination rule, we first compute the conflict K as in Equation (4). A high value of K indicates strong disagreement among the sources, in which case Dempster’s rule may produce extreme belief values, and an alternative combination rule should be considered. In this example, the conflict is relatively low: K = M ^ i ( { U } ) M g ( { F } ) + M g ( { U } ) M ^ i ( { F } ) = 0.34 . After applying Dempster’s rule, the combined BPA is given as M i , g ( { U } ) = [ 0.86 , 0.87 ] , M i , g ( { F } ) = [ 0.09 , 0.10 ] , M i , g ( { U , F } ) = [ 0.04 , 0.05 ] (rounding outwards).
The final belief in fairness can be interpreted as excluding ambiguity, lying in the interval B e l i , g ( F ) = [ 0.09 , 0.10 ] , or, if ambiguity is included, as P l i , g ( F ) = [ 0.13 , 0.14 ] (both values are low in this example). The actual fairness score v f , 1 must be recalibrated for the MF framework to be in the range from 0 to 15: v f , 1 = [ 0.09 , 0.10 ] · 15 = [ 1.35 , 1.50 ] when the interpretation excluding ambiguity is used.
This approach offers the advantage of high explainability by encoding both the information about the considered evidence types (BPAs) and the underlying reasoning (r). On the other hand, the choice of the mapping r is subjective and must be justified. All masses can be represented as intervals, and the mapping r itself can be defined probabilistically. Dynamical developments can be taken into account by creating cumulative evidence curves, as described in Section 4.1 for prediction of gene mutation probabilities.

3.4.2. General MF Score

Another way to incorporate DST into the proposed MF framework is to use its components from Definition 2 as FEs of an (I)BPA. This can involve, in a first step, using questionnaires to obtain scores for those components as described in Section 3.3. Each questionnaire can be associated with a probability p i [ 0 , 1 ] , computed as the ratio of the achieved score to the total possible score. This probability may be expressed as an interval to reflect the (un)certainty of the expert completing the questionnaire. Alternatively, these probabilities can be provided directly by an expert evaluating the system, without using a questionnaire. Note that A i consisting of more than one MF-related aspect can also be considered if an isolated assessment is not feasible.
In the second step, the weights w i associated with each A i are determined by the expert based on the context. Here, the weights can be either crisp numbers or intervals. For crisp weights, the condition i = 1 n w i = 1 must hold; for interval weights, i = 1 n w i 1 . A weight can be set to zero if a particular aspect is not to be considered in the current assessment. The mass m i associated with A i is then computed as m i = p i · w i (in either the crisp or interval version). To avoid inflating the resulting belief-based score, the mass of the frame of discernment should be assigned as the remainder, without applying any redistribution function.
Finally, we use the BPAs/IBPAs defined in this manner to produce the final assessment score or interval. First, we combine all defined M j , j = 1 , , N , which may reflect N expert opinions, different interpretations of fairness (if not already captured by the QC Fairness score), or group versus individual perspectives, using an appropriate DST combination rule depending on the degree of conflict K. Afterward, we can compute the B e l (or P l ) functions for the relevant aspects of interest, or combinations thereof, which can be considered as the general MF score. An example for the approach is in the Discussion to Section 4.1.
More sophisticated approaches to computing the general score are conceivable, drawing inspiration from methods in data fusion. For example, one could incorporate additional assessment criteria for MF expressed in the BPAs dynamically, following the open-world interpretation of DST, or account for differences in their FEs differently from the example in this subsection. We leave these directions for future work.

4. Results: Meta-Fairness, Applied

Alongside advancing the concept of MF, we aim to examine its application in healthcare and risk prevention. First, we revisit our earlier DST-based risk prediction models, asking whether these methods ensure the fair and appropriate assignment of patient groups—as well as individuals and their families—into risk classes. We also consider whether DST provides a better solution compared to logistic regression, or whether it requires further generalization, for example, through the use of interval bounds for probabilities and/or additional evidence combination rules.
Second, building on our experience with DTs in virtual museums [82], we outline an application of the MF concept to DTs in healthcare. Although a substantial body of literature exists on DT fairness, there seems to be limited knowledge of overarching DT principles. Modern DTs rely heavily on AI, including generation and modeling, communication, and collaboration, which introduces multiple levels of consideration and competing metrics, as emphasized in [82]. Each DT feature can now be analyzed additionally from the point of view of MF.

4.1. Predicting BRCA1/2 Mutation Probabilities

Pathogenic variants, or mutations, are a major cause of human disease, particularly cancer. When present in germline cells, such variants are heritable and increase cancer risk both for individuals and within families. Mutations in BRCA1 and BRCA2 are the most well-established causes of hereditary breast cancer (BC) and also confer an elevated risk of ovarian cancer (OC). This condition, formerly known as hereditary breast and ovarian cancer syndrome (HBOC), is increasingly referred to as King syndrome (KS) in recognition of Mary-Claire King, who first identified these genes, and to avoid the misleading implication that the syndrome affects only women or is limited to breast and ovarian cancers.
In [27,28], we introduced a two-stage (interval) DST model that estimates the likelihood of pathogenic variants in BRCA1/2 genes under epistemic uncertainty, with particular focus on the RF of age of cancer onset. In addition, we developed a decision-tree-based approach for classifying individuals into appropriate risk categories [27]. This work was based on a literature review and informed through established online tools such as BOADICEA (via CanRisk https://www.canrisk.org/ (accessed on 21 October 2025)) or Penn II (https://pennmodel2.pmacs.upenn.edu/penn2/ (accessed on 21 October 2025)) in the manner detailed in [27,28]. Implementations of models and methods, as well as all data, are to be found at https://github.com/lorenzgillner/UncertainEvidence.jl (accessed on 21 October 2025), https://github.com/lorenzgillner/BRCA-DST (accessed on 21 October 2025). In this subsection, we illustrate the application of the MF framework from Section 3.3 and Section 3.4 using this example. Here, the question of interest is if reliable assignment to the low-risk class is ensured.

4.1.1. Model Outline

The age at first cancer diagnosis is one of the most important indicators of a BRCA1/2 mutation. It must be accounted for carefully, even when the exact age is unknown, as is often the case for information regarding family members. The model in [28] proposes handling this using a two-stage procedure.
In Stage 1, we define cumulative curves by age for each RF, including BC, OC, BC/OC occurring in the same individual (sp), and additive factors for bilateral BC (bBC) and male BC (mBC). The curves are calculated in 5-year age increments and connected by straight lines. Separate sets of curves are provided for Ashkenazi and non-Ashkenazi Jewish ancestry (cf. Figure 4). In the figure, example probabilities from the literature are indicated for potential genetic test referral, illustrating that, at these thresholds, referrals would almost always occur even without considering multiple RFs. These thresholds appear to be too low, making it especially important to ensure reliable assignment to the low-risk class, as misclassification into a higher-risk category can lead to unnecessary emotional, medical, financial, and social burdens.
Figure 4. Cumulative curves for RF of BC, OC, sp as well as additive factors mBC, bBC for non-Ashkenazi ancestry.
In Stage 2, the lower risk bound B e l is calculated in two steps. First, individual and familial BPAs, M p and M f , are derived from the cumulative curves. Next, M p and M f are combined using an appropriate combination rule (e.g., Dempster’s) to compute the final B e l . If the age of onset is uncertain, the resulting value is expressed as an interval. Since we relied on manually collected data from open-access publications, supplemented with predictions from Penn II where necessary, we consider this model a proof-of-concept for the proposed approach, although it shows good agreement with the literature [28].

4.1.2. Example

Consider a non-Ashkenazi Jewish patient diagnosed with breast and ovarian cancer at age 22, whose mother was diagnosed with bilateral breast cancer at age 50. To compute the lower-risk probability of a BRCA1/2 mutation, we generate M p for the patient and M f for her mother from the curves in Figure 4 (see Table 3). For the patient, the exact age is known, resulting in a crisp BPA (Column 2). For the mother, however, only an approximate age is available. Instead of providing only the worst-case estimate by assuming the age to be 51, we construct the BPA over an interval of [ 50 , 60 ] years, which yields the intervals shown in Column 3. Note that in our model, the curves at 60 also cover ages above 60.
Table 3. Basic probability assignments for the patient, her mother, and combined.
Using Dempster’s rule in Equation (4), we combine the patient’s history with her family history (Column 4). From this, we compute the final belief value as lying in the interval [ 0.446 , 0.549 ] according to Equation (2) by choosing the set of RF {BC, OC, sp, bBC}. Note that interval arithmetic is used in Equations (2) and (4). The interval corresponds to a best-case BRCA1/2 mutation risk probability of approximately 45% and a worst-case lower risk probability of approximately 55%. For comparison, the crisp estimate from Penn II for the same case is 54%.The key advantages of our approach are that both best-case and worst-case bounds are explicitly visible, and the model remains explainable.

4.1.3. Discussion

The model described above does not entirely satisfy the criteria R1R3 outlined in Section 3.2. While the model corresponds reasonably well to the literature, it remains a proof of concept and can be further refined. The primary reason is that establishing a reliable ground truth for the considered RFs is inherently challenging [28], for instance, because ethnicity is both an important RF and an SA with limited availability in most databases. This discussion is not intended to demonstrate the fairness of our model; rather, it serves to illustrate how the MF framework can be applied.
Specifically, with respect to R1, the model currently distinguishes only two subgroups: the general population and the Ashkenazi Jewish population (with unspecified location). Further work is needed to assess whether this limitation disproportionately disadvantages other groups, such as Black or Latin American populations, and to examine whether geographic context (e.g., the USA vs. Mexico for Latin Americans) plays a significant role. Nonetheless, extending the model to include cumulative curves tailored to these populations should be straightforward once the relevant data become available. Regarding R2, the literature is not fully consistent about the thresholds defining risk classes. For the low-risk class, the probability of a BRCA1/2 mutation is often set below 10%, though in some cases the threshold is 7.5%. The advantage of our model is that this uncertainty can be expressed as the interval [ 7.5 , 10 ] . Criterion R3 likewise cannot be fully ensured in a simple academic context. New RFs continue to be identified. For example, triple-negative breast cancer (i.e., cancer cells lacking all three common receptors—estrogen, progesterone, and HER2) has been found to be strongly associated with BRCA1 mutations, but this factor is not yet included in the model. (Receptors are proteins that “detect” molecules such as hormones.) Furthermore, the possibility of multiple, distinct BCs in the same patient is not currently captured, since each FE is considered only once within the DST framework. This, too, could be addressed by extending the model in the same manner as for other RFs, provided suitable data are available to construct the cumulative curves in Figure 4.
Suppose that the considered, hypothetical context is as follows: Doctors and patients (stakeholders) act together (non-polar decisions; same utilities) to check if similar individuals are appropriately assigned to the low-risk class (notion of fairness: individual; kind of decision: classification). In this context, relevant/eligible groups do not play a role; further, we assume that resources are not limited, there are no further legal norms and ethical values to consider, and the location is not relevant. Dynamics in the population cannot be sufficiently reflected at the current state-of-the-art.
The model’s purpose is to assess the BRCA1/2 mutation probability to help the doctors reach the decision of sending the patient for the genetic test or not. In the considered context, the stakes are medium-high, but long-term: If a person is erroneously assigned a higher risk class than necessary, there will not be any life-threatening developments missed. However, this still has negative consequences both for the treating clinic and for the patient. Psychologically, it can create unnecessary anxiety, stress, or lifestyle changes. Medically, the risk of overtreatment and a strain on resources increases. So, the utilities of all stakeholders are obviously aligned. The overarching social goal is to minimize BC in the population. The relevant QCs are fairness, accuracy, and explainability. In this context, the next steps are to fill out appropriate questionnaires from Section 3.3 for the QCs given above.
QC Fairness. Individual fairness requires a good similarity measure for individuals. Using the DST model, it is easy to establish similarity: patients having the same RF combinations can be considered as similar and are treated the same, so that v 1 can be assumed 15. From the questionnaire, the value v 2 is 10, the model scoring 7 in its non-bias part. The score 0 is given for the following biases:
1 
Aggregation bias, since it cannot be excluded at the moment that false conclusions are drawn about individuals from the population;
22 
Inductive bias, since there are still assumptions built into the model structure that we consider general, and the Ashkenazi Jewish population only;
35 
Representation bias, since we cannot test for this at the moment; and
39 
Simplification bias, for the similar reason as 1.
The overall score is therefore v f = 12.5 , that is, approx. 83%. Note that the model systematically gives higher risk class estimations for the Ashkenazi Jewish population, which is not a sign of discrimination in this case because this group has truly higher base rates. That is, any definition of fairness based on equal base rates is not relevant in the context.
QC Accuracy. Combining various risk prevention approaches and using decision trees [27], we can offer a reliable lower bound for the transition between low risk and average risk. It is easier to determine intermediate risk classes if interval information on the probability bounds is available. Therefore, we can give the score of 4 for the first two questions; the group of questions about the data scores 6. Note that the score for the general question about FAIR principles can be assumed as 4 because the model is available under https://github.com/lorenzgillner/BRCA-DST (accessed on 21 October 2025). Reliability score is 1.5; validity 0; society relevance 2. The score 0 is given to the following biases: 20/29/37 for data and 14/42 for accuracy, summing up overall to v a = 19.5 and 65%.
QC Explainability. DST methods offer a clear advantage in explainability compared to, for example, a logistic-regression–based model, where the underlying formula is not transparent, potentially introducing implicit dependencies among variables and making the results harder to interpret. The explainability score is therefore relatively high for the model, namely, v e = 9 or 60%, with biases 21 and 30 being given the score of 0.
The first use case illustrates the possible employment of transition classes. Consider two patients, A and B: both are women of the same age, with similar health profiles and no personal history of cancer. Both have a family history of breast cancer, as their mothers were diagnosed at ages 48 and 49, respectively. Using the Penn II model to estimate the probability of a BRCA1/2 mutation for each family, the results are 10% for A and 9% for B. Consequently, A would be referred for genetic testing while B would not, despite their striking similarities. The situation would be even worse for A if the exact age at which her mother was diagnosed were uncertain, for example, if it were only known that the diagnosis occurred at age 45 or later. In such cases, the youngest possible age (45) is used in the calculation, which results in a probability of 11%. This approach is conservative, as using the earliest plausible age ensures the risk is not underestimated.
In our model, the upper bound of the belief-based risk is 7.5% at age 49 and 7.9% at age 48, which is somewhat lower than the estimates produced by the logistic-regression–based model Penn II. If the exact age is unknown, the result can be expressed as an interval, [ 0.039 , 0.10 ] , which reflects both the best-case and worst-case scenarios. Furthermore, if a transition class [ 7.5 , 10.5 ] is introduced between low risk and average risk, all patients with similar profiles would be classified into this category, ensuring fair treatment. Note that the psychological impact of being classified into a transition class is less severe than that of being placed directly into the ‘worse’ class, since it implies that one was never fully part of the ‘better’ category. Additionally, it allows us to take into account differences in thresholds from the literature.
As an illustration of the second approach outlined in Section 3.4, consider the following example. In the discussion above, we evaluated our model from the perspective of an individual patient. For this purpose, we define a BPA with the FEs QCF (QC Fairness), QCA (QC Accuracy), QCE (QC Explainability), and C (Context). The corresponding probabilities p i are derived from the previously obtained scores, and an expert assigns them equal weights w i (see Table 4, Columns 2 and 3) (We assign probability 1 to C, as it has been thoroughly described.) A different expert, however, considers context to be central to the assessment and therefore distributes the weights differently, while remaining uncertain about the relative importance of fairness in this setting (see Columns 4 and 5 of the table). Finally, the model can also be evaluated from the perspective of the patient’s family. Since formally, the framework does not distinguish between the patient and their family, an expert may assign different probabilities to the FEs when viewed from this perspective (see Columns 6 and 7).
Table 4. (I)BPAs for the example for the second use of DST from Section 3.4.
From these assignments, we see that Expert 1 is fairly confident the model is fair to individuals—about 21% certain, with plausibility reaching up to 45%. Expert 2 is less confident, estimating the lower bound of fairness between 8% and 17%. Group fairness (from the family perspective, using the lower bound) is assessed between 12% and 15%. Note that this scale differs from that used in the questionnaires above because of weighting. We combine these opinions using Dempster’s rule, since the overall conflict is average. The individual views are combined first, followed by the group perspective. The resulting belief function for the model’s fairness (whether for the patient or for the family) lies between 12% and 28%. The belief that the model is both fair and accurate ranges from 22% in the worst case to 47% in the best case (all values rounded outward). Overall, we see that a medium level of uncertainty in the inputs approximately doubles the uncertainty present in the final score.

4.2. Meta-Fairness Applied to Communication in Digital Healthcare Twins

The review by Katsoulakis et al. [83] examines the applications, challenges, and future directions of DT technology in healthcare. Although it does not explore the topic of fairness in depth, the authors emphasize that “ensuring that the digital twin models are free from biases and do not discriminate against individuals or groups is vital.” Key considerations include transparency and fairness in data usage, equitable access to the necessary technology and data, and ethical concerns such as privacy and informed consent. Similarly, the overview publication by Bibri et al. [84] highlights that, in addition to privacy and security, ethical and social aspects of DTs in healthcare should address fairness, bias mitigation, transparency, and accountability in decision-making processes. From a broader perspective, as human DTs and other augmented DTs populate and enhance the digital parallel universe, the Metaverse, Islam et al. [85] emphasizes that algorithmic bias, limited system transparency, and persistent data privacy concerns remain central obstacles to achieving an inclusive and ethical application of AI in this context. These concerns can be conceptually addressed through the MF framework introduced in Section 3.3, as outlined in this subsection.
In recent papers [82,86], we considered feature-oriented DTs of various kinds of virtual museums and formulated an approach for assessing them from the viewpoint of appropriate quality criteria. The features fit into three broad categories of content, communication, and collaboration that have further, subordinate features. A risk-informed virtual museum supports the communication and perception of different types of risks from a variety of threat categories and collaborative risk management. The approach can also be applied to healthcare DTs, since communication and collaboration are important features of such DTs, too. In the following, we take a closer look at them and their subordinate features (cf. Figure 5). Each individual feature can now be examined in terms of MF and related ethical principles.
Figure 5. Possible healthcare DT categories (in red), subordinate features and their quality criteria.

4.2.1. DTs and MF

In Figure 5, a layered model of healthcare DTs within the Metaverse is presented, illustrating how MF can guide the evolution from core digital twin concepts to accountable, augmented implementations. Three focus categories—content, communication, and collaboration—link the central DT to subordinate components such as software operation, media presentation, participation, learning, and workflow management, highlighting where quality criteria like accuracy, efficiency, usability, sustainability, and accountability are addressed within this framework. Furthermore, digital twins utilize overarching technologies, including the Internet of Things (IoT) and AI. AI, for instance, plays a big role in DTs, encompassing content generation and modeling, as well as communication and collaboration. The remaining meta-fairness aspects (beyond the quality criteria) are not explicitly depicted in the figure; Table 2 and the dedicated questionnaire in Section 3.3 summarize the potential biases associated with the communication and collaboration categories to address the QC Fairness. In the following, we discuss these biases in greater detail.

4.2.2. Communication, Collaboration, and Bias

Bias in communication can arise from differing assessments of the situation, cultural or social differences, prejudice, aversion, incomplete or distorted messages, missing information, unreliable channels, or content that is unintentionally or deliberately manipulated. Therefore, it is cognitive in nature. Meier [76] provides information on eighteen different communication biases that influence and affect how the communication partners perceive and interpret mutual information.
As pointed out by Paulus et al. [87], biases or confirmation errors can reinforce each other. This is particularly harmful when path dependencies arise, whereby the initial data bias not only influences the initial decisions, but also leads to erroneous decision paths due to confirmation errors. The authors argue that the interplay of data bias and confirmation bias threatens the digital resilience of crisis response organizations. The risk of confirmation bias is high if data collection is based on undisclosed criteria, resulting in distorted data sets that do not properly reflect the population or cohort under investigation, for example. Further, unjustified beliefs about one’s own abilities, opinions, talents, or values, along with erroneous judgments or inappropriate generalizations about others, can lead to misjudgment of oneself or one’s position in relation to others. In addition, communication may also be distorted by third-party interference, which can exploit or obstruct it. To counter this risk, coding theory techniques such as redundancy and encryption can make communication more secure and error-tolerant.
For the category of collaboration, the top ten biases are identified in [77]. Some of these, such as confirmation and overconfidence biases, are similar to those highlighted by Meier [76] for communication. The key distinction is that communication biases primarily influence how messages are exchanged and interpreted (e.g., stereotyping or framing), whereas collaboration biases affect how groups organize, make decisions, and act collectively (e.g., groupthink or authority bias).
Note that although the biases cataloged in [76,77] are primarily human-oriented, they can be directly translated into checks for algorithmic fairness, since algorithmic bias often originates from human bias. For example, confirmation bias, that is, favoring information that supports pre-existing beliefs, can cause developers to unintentionally overlook evidence that a model is discriminatory. This connection has been widely relied on in the literature. For instance, Lin et al. [88] applies two fairness metrics to collaborative learning and model personalization, demonstrating that both fairness and accuracy of the resulting models can be improved. Similarly, Chen et al. [89] examines the problem of caching and request routing in DT services, incorporating both resource constraints and fairness-awareness in the context of collaboration between multi-access edge computing servers.
Crowdsourcing is a concern in collaboration-oriented DTs, since companies and cooperating public institutions are increasingly relying on it to address staff shortages and to develop innovative products and services at a lower cost. Fairness can be associated with various aspects of crowdsourcing, such as remuneration or recognition [90]. The authors observe that participants’ perceptions of fairness are significantly related to their interest in the products, their perception of innovativeness, and their loyalty intentions. The influence of two different fairness understandings is found to be asymmetric: distributive justice is a fundamental prerequisite to avoid negative behavioral consequences, while procedural justice motivates participants and positively affects their commitment.
Since avoiding bias is a prerequisite for fairness, various bias metrics have been developed that differ in how they weight different types of bias. The overview above demonstrates that, by using the MF framework described in Section 3.3, and in particular the DST approach for combining different conceptions of unfairness illustrated in Section 3.4, it is possible to achieve a more comprehensive understanding of potential discrimination and bias-related risks within the DTs.

4.2.3. Discussion

The concept of augmented DTs provides a highly flexible and illustrative framework for selecting specific features across the three DT categories and for evaluating associated biases or constructing bias metrics for these features. Realizing this potential in the sense of the methods proposed in Section 3.4, however, requires further empirical studies, as well as access to their findings and underlying data.
In practical applications, the MF framework can be complemented by augmented (healthcare) DTs. The overarching technologies, such as IoT, AI, and ML, should also be seen through the lens of MF before they are used to augment healthcare DTs. These technologies, along with big data, wearable sensors, and telemedicine, function as cross-cutting elements that support medical professionals in collecting and monitoring health data, identifying patient risks, and communicating treatment plans effectively between patients and clinicians.
By integrating these technological and methodological layers, augmented healthcare DTs offer a more comprehensive, data-informed, and ethically aware approach to patient care, while also providing a structured means to detect, quantify, and mitigate potential biases in the system. We argue that it may be more useful to consider fairness and related concepts not in terms of application domains or contexts and their associated technologies, but rather based on the MF framework and the features of augmented digital twins extending into the Metaverse.

5. Conclusions

The concept of fairness is situated at the intersection of ethical norms, government regulations, political perspectives, and social challenges within a heterogeneous society whose members vary widely in age, background, education, opinions, and purchasing power. These differences often give rise to conflicts, distributional struggles, and neglect of others’ interests. The debate over fairness is further influenced by heterogeneous structures in academia, professional associations with their scientific committees, and lobbyists representing business, employers, and employees. These actors participate with differing objectives and sometimes controversial positions, further complicating the discourse. In this respect, it becomes necessary to examine and discuss fairness and potential discrimination against individuals or groups at a meta-level across various contexts, in order to capture the complex interplay of social, political, and institutional factors.
To evaluate these issues, we proposed a meta-fairness framework that incorporates multiple quality criteria, including accuracy, explainability, responsibility, and auditability, alongside the often competing metrics for assessing fairness and avoiding discrimination. To demonstrate the framework in practice, we provided detailed questionnaires for medical risk assessment tools and outlined strategies for weighting and combining the criteria using DST. While the questionnaire-based evaluation is comprehensive, it is designed to be modular and adaptable, allowing domain experts to prioritize the MF components most relevant to their context. Evaluating all criteria simultaneously may not always be feasible, but missing information can be accommodated through DST and appropriate weighting. Importantly, the subjective expert assessments required to create basic probability assignments in DST are justified not by eliminating subjectivity, but by making it explicit, structured, and open to validation. We took a first step toward this by employing standardized questionnaires and well-defined procedures for processing expert input related to meta-fairness.
We illustrated potential applications of the MF framework in two ways. First, we examined a specific case: risk assessment for genetic mutations influencing early-onset breast cancer. Second, we explored broader, conceptual applications of MF to modern technologies, particularly digital twins (DTs), and assessed their significance in healthcare, where DTs with diverse features are starting to partially replace physical models, systems, and processes.
First and foremost, it is essential to incorporate expert insights from the relevant scientific disciplines into the discussion and to subsequently refine the MF framework and illustrate the concept of meta-fairness. At the moment, our research is based on an extensive literature review described in Section 2. Addressing the fairness-related challenges requires an interdisciplinary scientific discourse, aimed at developing practical concepts for identifying and combating discrimination across its various forms, as well as providing points of contact and effective access to arbitration procedures within institutional settings.
Despite the complexity of this task, our MF framework makes a meaningful contribution by providing a structured view of the various factors influencing fairness in different contexts. A promising direction for future research is the application of the MF framework within reinforcement learning. Real-world RL-enabled systems are highly complex, as agents operate in dynamic environments over extended periods. Ensuring the responsible development and deployment of such systems will therefore require a deeper understanding of fairness in RL, which the MF framework could help to structure and guide. Conversely, it would also be valuable to investigate how principles from RL can inform and enrich the concept of MF. Our long-term objective is to enable the automatic implementation of transparent fairness principles and bias mitigation in AI- and ML-based decision-making processes.

Author Contributions

W.L. and E.A. contributed equally to the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

https://github.com/lorenzgillner (accessed on 21 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following common abbreviations are used in this manuscript. This list is not exhaustive.
AIArtificial Intelligence
BCBreast Cancer
BPABasic Probability Assignment
BRCA1/2BReast CAncer Gene 1 and 2
COVID-19COronaVIrus Disease 2019
DSTDempster-Shafer Theory
DTDigital Twin
FAIRFindable, Accessible, Interoperable, Reusable
FAQFrequently Asked Questions
FEFocal Element
IAInterval Analysis
IoTInternet of Things
KSKing Syndrome
MFMeta-Fairness
MLMachine Learning
OCOvarian Cancer
PTPhysical Twin
QCQuality Criteria
RFRisk Factor
RLReinforcement Learning
SASensitive Attribute
V&VVerification and Validation

References

  1. Paulus, J.; Kent, D. Predictably unequal: Understanding and addressing concerns that algorithmic clinical prediction may increase health disparities. npj Digit. Med. 2020, 3, 99. [Google Scholar] [CrossRef]
  2. Kleinberg, J.; Mullainathan, S.; Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Berkeley, CA, USA, 9–11 January 2017; Papadimitriou, C.H., Ed.; Schloss Dagstuhl–Leibniz-Zentrum für Informatik: Wadern, Germany, 2017; Volume 67, pp. 43:1–43:23. [Google Scholar] [CrossRef]
  3. Castelnovo, A.; Crupi, R.; Greco, G.; Regoli, D.; Penco, I.; Cosentini, A. A clarification of the nuances in the fairness metrics landscape. Sci. Rep. 2022, 12, 4209. [Google Scholar] [CrossRef]
  4. Luther, W.; Harutyunyan, A. Fairness in Healthcare and Beyond—A Survey. JUCS, 2025; to appear. [Google Scholar]
  5. Naeve-Steinweg, E. The averaging mechanism. Games Econ. Behav. 2004, 46, 410–424. [Google Scholar] [CrossRef]
  6. Hyman, J.M. Swimming in the Deep End: Dealing with Justice in Mediation. Cardozo J. Confl. Resolut. 2004, 6, 19–56. [Google Scholar]
  7. Hertweck, C.; Baumann, J.; Loi, M.; Vigano, E.; Heitz, C. A Justice-Based Framework for the Analysis of Algorithmic Fairness-Utility Trade-Offs. arXiv 2022. [Google Scholar] [CrossRef]
  8. Zehlike, M.; Loosley, A.; Jonsson, H.; Wiedemann, E.; Hacker, P. Beyond incompatibility: Trade-offs between mutually exclusive fairness criteria in machine learning and law. Artif. Intell. 2025, 340, 104280. [Google Scholar] [CrossRef]
  9. Shafer, G. A Mathematical Theory of Evidence; Princeton University Press: Princeton, NJ, USA, 1976. [Google Scholar] [CrossRef]
  10. Ferson, S.; Kreinovich, V.; Ginzburg, L.; Myers, D.S.; Sentz, K. Constructing Probability Boxes and Dempster-Shafer Structures; Sandia National Laboratories: Albuquerque, NM, USA, 2003. [Google Scholar] [CrossRef]
  11. Russo, M.; Vidal, M.E. Leveraging Ontologies to Document Bias in Data. arXiv 2024. [Google Scholar] [CrossRef]
  12. Newman, D.T.; Fast, N.J.; Harmon, D.J. When eliminating bias isn’t fair: Algorithmic reductionism and procedural justice in human resource decisions. Organ. Behav. Hum. Decis. Process. 2020, 160, 149–167. [Google Scholar] [CrossRef]
  13. Anderson, J.W.; Visweswaran, S. Algorithmic individual fairness and healthcare: A scoping review. JAMIA Open 2024, 8, ooae149. [Google Scholar] [CrossRef] [PubMed]
  14. AnIML. Bias and Fairness—AnIML: Another Introduction to Machine Learning. Available online: https://animlbook.com/classification/bias_fairness/index.html (accessed on 21 October 2025).
  15. Zliobaite, I. Measuring discrimination in algorithmic decision making. Data Min. Knowl. Discov. 2017, 31, 1060–1089. [Google Scholar] [CrossRef]
  16. Arnold, D.; Dobbie, W.; Hull, P. Measuring Racial Discrimination in Algorithms; Working Paper 2020-184; University of Chicago, Becker Friedman Institute for Economics: Chicago, IL, USA, 2020. [Google Scholar] [CrossRef]
  17. Mosley, R.; Wenman, R. Methods for Quantifying Discriminatory Effects on Protected Classes in Insurance; Research paper; Casualty Actuarial Society: Arlington, VA, USA, 2022. [Google Scholar]
  18. Sanna, L.J.; Schwarz, N. Integrating Temporal Biases: The Interplay of Focal Thoughts and Accessibility Experiences. Psychol. Sci. 2004, 15, 474–481. [Google Scholar] [CrossRef]
  19. Mozannar, H.; Ohannessian, M.I.; Srebro, N. From Fair Decision Making to Social Equality. arXiv 2020. [Google Scholar] [CrossRef]
  20. Ladin, K.; Cuddeback, J.; Duru, O.K.; Goel, S.; Harvey, W.; Park, J.G.; Paulus, J.K.; Sackey, J.; Sharp, R.; Steyerberg, E.; et al. Guidance for unbiased predictive information for healthcare decision-making and equity (GUIDE): Considerations when race may be a prognostic factor. npj Digit. Med. 2024, 7, 290. [Google Scholar] [CrossRef]
  21. Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; Zemel, R. Fairness Through Awareness. arXiv 2011, arXiv:1104.3913. [Google Scholar] [CrossRef]
  22. Baloian, N.; Luther, W.; Peñafiel, S.; Zurita, G. Evaluation of Cancer and Stroke Risk Scoring Online Tools. In Proceedings of the 3rd CODASSCA Workshop on Collaborative Technologies and Data Science in Smart City Applications, Yerevan, Armenia, 23–25 August 2022, Yerevan, Armenia, 23–25 August 2022; Hajian, A., Baloian, N., Inoue, T., Luther, W., Eds.; Logos Verlag: Berlin, Germany, 2022; pp. 106–111. [Google Scholar]
  23. Baumann, J.; Hertweck, C.; Loi, M.; Heitz, C. Distributive Justice as the Foundational Premise of Fair ML: Unification, Extension, and Interpretation of Group Fairness Metrics. arXiv 2023, arXiv:2206.02897. [Google Scholar] [CrossRef]
  24. Petersen, E.; Ganz, M.; Holm, S.H.; Feragen, A. On (assessing) the fairness of risk score models. arXiv 2023, arXiv:2302.08851. [Google Scholar] [CrossRef]
  25. Diakopoulos, N.; Friedler, S. Principles for Accountable Algorithms and a Social Impact Statement for Algorithms. Available online: https://www.fatml.org/resources/principles-for-accountable-algorithms (accessed on 21 October 2025).
  26. Alizadehsani, R.; Roshanzamir, M.; Hussain, S.; Khosravi, A.; Koohestani, A.; Zangooei, M.H.; Abdar, M.; Beykikhoshk, A.; Shoeibi, A.; Zare, A.; et al. Handling of uncertainty in medical data using machine learning and probability theory techniques: A review of 30 years (1991–2020). arXiv 2020. [Google Scholar] [CrossRef]
  27. Auer, E.; Luther, W. Uncertainty Handling in Genetic Risk Assessment and Counseling. JUCS J. Univers. Comput. Sci. 2021, 27, 1347–1370. [Google Scholar] [CrossRef]
  28. Gillner, L.; Auer, E. Towards a Traceable Data Model Accommodating Bounded Uncertainty for DST Based Computation of BRCA1/2 Mutation Probability with Age. JUCS J. Univers. Comput. Sci. 2023, 29, 1361–1384. [Google Scholar] [CrossRef]
  29. Pfohl, S.R.; Foryciarz, A.; Shah, N.H. An empirical characterization of fair machine learning for clinical risk prediction. J. Biomed. Inform. 2021, 113, 103621. [Google Scholar] [CrossRef]
  30. Penafiel, S.; Baloian, N.; Sanson, H.; Pino, J. Predicting Stroke Risk with an Interpretable Classifier. IEEE Access 2020, 9, 1154–1166. [Google Scholar] [CrossRef]
  31. Baniasadi, A.; Salehi, K.; Khodaie, E.; Bagheri Noaparast, K.; Izanloo, B. Fairness in Classroom Assessment: A Systematic Review. Asia-Pac. Educ. Res. 2023, 32, 91–109. [Google Scholar] [CrossRef]
  32. University of Minnesota Duluth. Reliability, Validity, and Fairness. 2025. Available online: https://assessment.d.umn.edu/about/assessment-resources/using-assessment-results/reliability-validity-and-fairness (accessed on 8 September 2025).
  33. Moreau, L.; Ludäscher, B.; Altintas, I.; Barga, R.S.; Bowers, S.; Callahan, S.; Chin, G., Jr.; Clifford, B.; Cohen, S.; Cohen-Boulakia, S.; et al. Special Issue: The First Provenance Challenge. Concurr. Comput. Pract. Exp. 2008, 20, 409–418. [Google Scholar] [CrossRef]
  34. Pasquier, T.; Lau, M.K.; Trisovic, A.; Boose, E.R.; Couturier, B.; Crosas, M.; Ellison, A.M.; Gibson, V.; Jones, C.R.; Seltzer, M. If these data could talk. Sci. Data 2017, 4, 170114. [Google Scholar] [CrossRef] [PubMed]
  35. Jacobsen, A.; de Miranda Azevedo, R.; Juty, N.; Batista, D.; Coles, S.; Cornet, R.; Courtot, M.; Crosas, M.; Dumontier, M.; Evelo, C.T.; et al. FAIR Principles: Interpretations and Implementation Considerations. Data Intell. 2020, 2, 10–29. [Google Scholar] [CrossRef]
  36. Wilkinson, M.D.; Dumontier, M.; Jan Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.; da Silva Santos, L.; Bourne, P.E.; et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef]
  37. Sculley, D.; Holt, G.; Golovin, D.; Davydov, E.; Phillips, T.; Ebner, D.; Chaudhary, V.; Young, M.; Dennison, D. Hidden Technical Debt in Machine Learning Systems. In Proceedings of the NIPS’15: 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2494–2502. [Google Scholar]
  38. Qi, Q.; Tao, F.; Hu, T.; Anwer, N.; Liu, A.; Wei, Y.; Wang, L.; Nee, A. Enabling technologies and tools for digital twin. J. Manuf. Syst. 2021, 58, 3–21. [Google Scholar] [CrossRef]
  39. Neto, A.; Souza Neto, J. Metamodels of Information Technology Best Practices Frameworks. J. Inf. Syst. Technol. Manag. 2011, 8, 619. [Google Scholar] [CrossRef]
  40. Waytz, A.; Dungan, J.; Young, L. The whistleblower’s dilemma and the fairness–loyalty tradeoff. J. Exp. Soc. Psychol. 2013, 49, 1027–1033. [Google Scholar] [CrossRef]
  41. Zhang, Y.; Sang, J. Towards Accuracy-Fairness Paradox: Adversarial Example-based Data Augmentation for Visual Debiasing. In Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 12–16 October 2020; MM ’20. pp. 4346–4354. [Google Scholar] [CrossRef]
  42. Carroll, A.; McGovern, C.; Nolan, M.; O’Brien, A.; Aldasoro, E.; O’Sullivan, L. Ethical Values and Principles to Guide the Fair Allocation of Resources in Response to a Pandemic: A Rapid Systematic Review. BMC Med. Ethics 2022, 23, 1–11. [Google Scholar] [CrossRef]
  43. Emanuel, E.; Persad, G. The shared ethical framework to allocate scarce medical resources: A lesson from COVID-19. Lancet 2023, 401, 1892–1902. [Google Scholar] [CrossRef]
  44. Kirat, T.; Tambou, O.; Do, V.; Tsoukiàs, A. Fairness and Explainability in Automatic Decision-Making Systems. A challenge for computer science and law. arXiv 2022, arXiv:2206.03226. [Google Scholar] [CrossRef]
  45. Modén, M.U.; Lundin, J.; Tallvid, M.; Ponti, M. Involving teachers in meta-design of AI to ensure situated fairness. In Proceedings of the Sixth International Workshop on Cultures of Participation in the Digital Age: AI for Humans or Humans for AI? Co-Located with the International Conference on Advanced Visual Interfaces (CoPDA@AVI 2022), Frascati, Italy, 7 June 2022; Volume 3136, pp. 36–42. [Google Scholar]
  46. Padh, K.; Antognini, D.; Lejal Glaude, E.; Faltings, B.; Musat, C. Addressing Fairness in Classification with a Model-Agnostic Multi-Objective Algorithm. arXiv 2021, arXiv:2009.04441. [Google Scholar]
  47. Jabbari, S.; Joseph, M.; Kearns, M.; Morgenstern, J.; Roth, A. Fairness in Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; JMLR.org: New York, NY, USA, 2017; Volume 70, pp. 1617–1626. [Google Scholar] [CrossRef]
  48. Reuel, A.; Ma, D. Fairness in Reinforcement Learning: A Survey. arXiv 2024. [Google Scholar] [CrossRef]
  49. Petrović, A.; Nikolić, M.; M, J.; Bijanić, M.; Delibašić, B. Fair Classification via Monte Carlo Policy Gradient Method. Eng. Appl. Artif. Intell. 2021, 104, 104398. [Google Scholar] [CrossRef]
  50. Eshuijs, L.; Wang, S.; Fokkens, A. Balancing the Scales: Reinforcement Learning for Fair Classification. arXiv 2024. [Google Scholar] [CrossRef]
  51. Kim, W.; Lee, J.; Lee, J.; Lee, B.J. FairDICE: Fairness-Driven Offline Multi-Objective Reinforcement Learning. arXiv 2025. [Google Scholar] [CrossRef]
  52. Grote, T. Fairness as adequacy: A sociotechnical view on model evaluation in machine learning. AI Ethics 2024, 4, 427–440. [Google Scholar] [CrossRef]
  53. Kamiran, F.; Calders, T. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 2012, 33, 1–33. [Google Scholar] [CrossRef]
  54. Menon, A.K.; Williamson, R.C. The cost of fairness in binary classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; Friedler, S.A., Wilson, C., Eds.; PMLR: New York, NY, USA, 2018; Volume 81, pp. 107–118. [Google Scholar]
  55. Han, X.; Chi, J.; Chen, Y.; Wang, Q.; Zhao, H.; Zou, N.; Hu, X. FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods. arXiv 2024, arXiv:2306.09468. [Google Scholar]
  56. Hardt, M.; Price, E.; Srebro, N. Equality of Opportunity in Supervised Learning. arXiv 2016, arXiv:1610.02413. [Google Scholar] [CrossRef]
  57. Baumann, J.; Hannák, A.; Heitz, C. Enforcing Group Fairness in Algorithmic Decision Making: Utility Maximization Under Sufficiency. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; FAccT ’22. pp. 2315–2326. [Google Scholar] [CrossRef]
  58. Duong, M.K.; Conrad, S. Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes. In Data Science and Machine Learning; Springer Nature: Singapore, 2023; pp. 105–120. [Google Scholar] [CrossRef]
  59. Bellamy, R.K.E.; Dey, K.; Hind, M.; Hoffman, S.C.; Houde, S.; Kannan, K.; Lohia, P.; Martino, J.; Mehta, S.; Mojsilovic, A.; et al. AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. arXiv 2018, arXiv:1810.01943. [Google Scholar] [CrossRef]
  60. Wang, S.; Wang, P.; Zhou, T.; Dong, Y.; Tan, Z.; Li, J. CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models. arXiv 2025, arXiv:2407.02408. [Google Scholar]
  61. Fan, Z.; Chen, R.; Hu, T.; Liu, Z. FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs. arXiv 2025, arXiv:2410.19317. [Google Scholar]
  62. Jin, R.; Xu, Z.; Zhong, Y.; Yao, Q.; Dou, Q.; Zhou, S.K.; Li, X. FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models. arXiv 2024, arXiv:2407.00983. [Google Scholar]
  63. Weinberg, L. Rethinking Fairness: An Interdisciplinary Survey of Critiques of Hegemonic ML Fairness Approaches. J. Artif. Intell. Res. 2022, 74, 75–109. [Google Scholar] [CrossRef]
  64. Moore, R.E.; Kearfott, R.B.; Cloud, M.J. Introduction to Interval Analysis; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2009. [Google Scholar] [CrossRef]
  65. Ayyub, B.M.; Klir, G.J. Uncertainty Modeling and Analysis in Engineering and the Sciences; Chapman & Hall/CRC: Boca Raton, FL, USA, 2006. [Google Scholar] [CrossRef]
  66. Smets, P. The Transferable Belief Model and Other Interpretations of Dempster-Shafer’s Model. arXiv 2013. [Google Scholar] [CrossRef]
  67. Skau, E.; Armstrong, C.; Truong, D.P.; Gerts, D.; Sentz, K. Open World Dempster-Shafer Using Complementary Sets. In Proceedings of the Thirteenth International Symposium on Imprecise Probability: Theories and Applications, Oviedo, Spain, 11–14 July 2023; de Cooman, G., Destercke, S., Quaeghebeur, E., Eds.; PMLR: New York, NY, USA, 2023; Volume 215, pp. 438–449. [Google Scholar]
  68. Xiao, F.; Qin, B. A Weighted Combination Method for Conflicting Evidence in Multi-Sensor Data Fusion. Sensors 2018, 18, 1487. [Google Scholar] [CrossRef]
  69. IEEE Computer Society. IEEE Standard for System, Software, and Hardware Verification and Validation; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar] [CrossRef]
  70. Auer, E.; Luther, W. Towards Human-Centered Paradigms in Verification and Validation Assessment. In Collaborative Technologies and Data Science in Smart City Applications; Hajian, A., Luther, W., Han Vinck, A.J., Eds.; Logos Verlag: Berlin, Germany, 2018; pp. 68–81. [Google Scholar]
  71. Barnes, J.J.I.; Konia, M.R. Exploring Validation and Verification: How they Differ and What They Mean to Healthcare Simulation. Simul. Heal. J. Soc. Simul. Healthc. 2018, 13, 356–362. [Google Scholar] [CrossRef]
  72. Riedmaier, S.; Danquah, B.; Schick, B.; Diermeyer, F. Unified Framework and Survey for Model Verification, Validation and Uncertainty Quantification. Arch. Comput. Methods Eng. 2020, 28, 1–26. [Google Scholar] [CrossRef]
  73. Kannan, H.; Salado, A. A Theory-driven Interpretation and Elaboration of Verification and Validation. arXiv 2025, arXiv:2506.10997. [Google Scholar]
  74. Hanna, M.G.; Pantanowitz, L.; Jackson, B.; Palmer, O.; Visweswaran, S.; Pantanowitz, J.; Deebajah, M.; Rashidi, H.H. Ethical and Bias Considerations in Artificial Intelligence/Machine Learning. Mod. Pathol. 2025, 38, 100686. [Google Scholar] [CrossRef] [PubMed]
  75. Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
  76. Meier, J.D. Communication Biases. Sources of Insight. 2025. Available online: https://sourcesofinsight.com/communication-biases/ (accessed on 21 October 2025).
  77. Sokolovski, K. Top Ten Biases Affecting Constructive Collaboration. 2018. Available online: https://innodirect.com/top-ten-biases-in-collaboration/ (accessed on 21 October 2025).
  78. Balagopalan, A.; Zhang, H.; Hamidieh, K.; Hartvigsen, T.; Rudzicz, F.; Ghassemi, M. The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations. In Proceedings of the FAccT ’22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 1194–1206. [Google Scholar] [CrossRef]
  79. Spranca, M.; Minsk, E.; Baron, J. Omission and Commission in Judgment and Choice. J. Exp. Soc. Psychol. 1991, 27, 76–105. [Google Scholar] [CrossRef]
  80. Caton, S.; Haas, C. Fairness in Machine Learning: A Survey. ACM Comput. Surv. 2024, 56, 1–38. [Google Scholar] [CrossRef]
  81. Wang, Y.; Wang, L.; Zhou, Z.; Laurentiev, J.; Lakin, J.R.; Zhou, L.; Hong, P. Assessing fairness in machine learning models: A study of racial bias using matched counterparts in mortality prediction for patients with chronic diseases. J. Biomed. Inform. 2024, 156, 104677. [Google Scholar] [CrossRef]
  82. Luther, W.; Baloian, N.; Biella, D.; Sacher, D. Digital Twins and Enabling Technologies in Museums and Cultural Heritage: An Overview. Sensors 2023, 23, 1583. [Google Scholar] [CrossRef]
  83. Katsoulakis, E.; Wang, Q.; Wu, H.L.; Shahriyari, L.; Fletcher, R.; Liu, J.; Achenie, L.; Liu, H.; Jackson, P.; Xiao, Y.; et al. Digital twins for health: A scoping review. npj Digit. Med. 2024, 7, 77. [Google Scholar] [CrossRef]
  84. Bibri, S.; Huang, J.; Jagatheesaperumal, S.; Krogstie, J. The Synergistic Interplay of Artificial Intelligence and Digital Twin in Environmentally Planning Sustainable Smart Cities: A Comprehensive Systematic Review. Environ. Sci. Ecotechnol. 2024, 20, 100433. [Google Scholar] [CrossRef]
  85. Islam, K.M.A.; Khan, W.; Bari, M.; Mostafa, R.; Anonthi, F.; Monira, N. Challenges of Artificial Intelligence for the Metaverse: A Scoping Review. Int. Res. J. Multidiscip. Scope 2025, 6, 1094–1101. [Google Scholar] [CrossRef]
  86. Luther, W.; Auer, E.; Sacher, D.; Baloian, N. Feature-oriented Digital Twins for Life Cycle Phases Using the Example of Reliable Museum Analytics. In Proceedings of the 8th International Symposium on Reliability Engineering and Risk Management (ISRERM 2022), Hannover, Germany, 4–7 September 2022; Beer, M., Zio, E., Phoon, K.K., Ayyub, B.M., Eds.; Research Publishing: Singapore, 2022; Volume 9, pp. 654–661. [Google Scholar]
  87. Paulus, D.; Fathi, R.; Fiedrich, F.; Walle, B.; Comes, T. On the Interplay of Data and Cognitive Bias in Crisis Information Management. Inf. Syst. Front. 2024, 26, 391–415. [Google Scholar] [CrossRef] [PubMed]
  88. Lin, F.; Zhao, C.; Vehik, K.; Huang, S. Fair Collaborative Learning (FairCL): A Method to Improve Fairness amid Personalization. INFORMS J. Data Sci. 2024, 4, 67–84. [Google Scholar] [CrossRef]
  89. Chen, L.; Zheng, S.; Wu, Y.; Dai, H.N.; Wu, J. Resource and Fairness-Aware Digital Twin Service Caching and Request Routing with Edge Collaboration. IEEE Wirel. Commun. Lett. 2023, 12, 1881–1885. [Google Scholar] [CrossRef]
  90. Faullant, R.; Füller, J.; Hutter, K. Fair play: Perceived fairness in crowdsourcing competitions and the customer relationship-related consequences. Manag. Decis. 2017, 55, 1924–1941. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.