Quantile Multi-Attribute Disparity (QMAD): An Adaptable Fairness Metric Framework for Dynamic Environments
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe proposed study provides an outstanding analysis of the fairness situation in the era of artificial intelligence. In domains such as forensic investigation, machine learning algorithms have sparked worries about possible biases being incorporated into decision-making procedures. However, if the arrangement of fundamental features across several groups and machine learning outcomes varies, it is still difficult to determine unfairness. A unique fairness measure framework that takes into account a variety of factors, such as machine learning outcomes and characteristic changes for bias identification, is proposed in this research. The idea and methodology can play a vital role in identifying unfairness in the field of machine learning applications. The presentation of paper is somehow very good for readers. Thus, it is recommended to accept this article for publication as it is, and there is one minor suggestion for the author: try to make a pictorial representation of the Fairness Metric Framework for Dynamic Environments and add it to the introduction section.
Author Response
Thank you for your valuable suggestion and for taking the time to review our manuscript. We agree that a visual representation can enhance clarity and reader understanding. In response, we have created a pictorial representation of the Fairness Metric Framework for Dynamic Environments and added it to the introduction section (Figure 1). We believe these changes significantly improve the clarity and overall impact of the manuscript, and we appreciate your feedback in helping us strengthen it.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors propose a fairness evaluation framework, aiming to overcome existing fairness metrics in dynamic environments and feature distribution shifts. The paper generally reads well. The authors do a thorough empirical evaluation, on both synthetic and real-world data.
A few comments intended to improve the paper:
- The current implementation requires manual selection of comparison and aggregation functions. This limits the use of the authors' approach in automated pipelines, and limits the target audience to domain experts.
- When aggregating multiple attributes, the final score's interpretability is questionable, especially when multiple metrics differ in scale or direction.
- QMAD is evaluated against classic fairness metrics such as Statistical Parity Difference and Disparate Impact, but it is not compared to more recent dynamic or context aware-fairness frameworks. The authors very briefly mention that their method is adaptable to context-aware approaches, but this topic is not given enough attention in the manuscript.
- Some results show that QMAD is flagging non-injected attributes as biased (for instance, marital status and race on Day 2), possibly due to statistical test sensitivity. This raises the question of whether false positives could occur that may lead to unnecessary model modifications.
Author Response
Comment - The current implementation requires manual selection of comparison and aggregation functions. This limits the use of the authors' approach in automated pipelines, and limits the target audience to domain experts.
Response: Thank you for your thoughtful observation and for reviewing our manuscript. We acknowledge the limitations of manual configuration in automated pipelines; however, we argue that this is an intentional and necessary design choice given the complexity and contextual sensitivity of fairness evaluation.
In the revised manuscript, we have elaborated on this rationale in the conclusion section. Specifically, we emphasize that fairness evaluation, particularly in high-stakes and sensitive domains (e.g., healthcare, finance, criminal justice), should not be fully automated without substantial risk. Different domains require distinct fairness definitions, assumptions, and interpretive norms — considerations that cannot be universally captured by an automated system.
QMAD is purposefully designed with modular flexibility, enabling domain experts to tailor metric selection according to:
- The structure and distribution of the dataset (e.g., balanced vs. skewed),
- The type of protected attributes (categorical vs. continuous),
- The evaluation objective (bias detection vs. quantification).
This manual pairing of comparison and aggregation functions fosters transparency, interpretability, and domain alignment — aspects we believe are essential for trustworthy fairness auditing. Attempts to fully automate this process risk oversimplifying nuanced statistical or ethical contexts, potentially leading to misleading results or false confidence in fairness assessments.
To support users in this process, we have included Table 12, which provides empirically grounded recommendations for selecting appropriate function pairs across common use cases. This further illustrates that fairness metric design must remain context-sensitive, and that our approach, while not fully automated, enables more responsible and accurate assessments.
We appreciate the opportunity to clarify this design decision and have revised the manuscript accordingly to better highlight this point.
We added to the conclusion section
Fairness evaluation, especially across sensitive societal contexts, cannot be reliably automated without risking misleading or inappropriate conclusions. Different applications (e.g., health vs. finance vs. criminal justice) require different fairness definitions, statistical assumptions, and interpretive norms — which cannot be universally encoded in an automated pipeline. QMAD is designed to offer modular flexibility, allowing practitioners and domain experts to tailor the metric according to: The nature of the dataset (e.g., balanced vs. skewed), The protected attribute types (continuous vs. categorical), The evaluation goal (bias detection vs. bias quantification).
Attempts to fully automate fairness evaluation without domain understanding risk promoting a false sense of fairness. In contrast, QMAD promotes transparent decision-making and context-sensitive fairness auditing.
While QMAD is modular by design, the framework favors manual selection of comparison-aggregation pairs to ensure alignment with the ethical, contextual, and statistical considerations of the target domain. This approach supports more nuanced and trustworthy fairness auditing than fully-automated solutions, which risk masking or misrepresenting bias.
Moreover, Automation in fairness metrics can foster false confidence and misrepresentation of bias, especially if underlying assumptions (e.g., distribution shape, feature importance, or causal structure) are violated. QMAD intentionally avoids this by enabling transparent, interpretable, and domain-informed configuration. As our empirical framework and use-case mapping show (in Table below), different contexts demand different statistical treatments.
Use Case |
Recommended Comparison Function |
Recommended Aggregation Function |
Notes |
Binary Classification (balanced groups) |
Ratio of Means (ROM) |
Arithmetic Mean |
Standard pair for stable class distributions |
Binary Classification (imbalanced groups) |
Kolmogorov-Smirnov Test (KSTest) |
Harmonic Mean |
KSTest detects subtle shifts in imbalanced classes |
Regression (normal distributions) |
Ratio of Means (ROM) |
Arithmetic Mean |
Fast and interpretable metric for regression |
Regression (skewed distributions) |
Anderson-Darling Test (ADTest) |
Harmonic Mean |
ADTest is more sensitive to tail differences |
Table 12 provides an evidence-based guideline for choosing appropriate function pairs per use case. This further supports that fairness metrics must be context sensitive and cannot be rigidly automated without compromising interpretability or accuracy.
Comment: - When aggregating multiple attributes, the final score's interpretability is questionable, especially when multiple metrics differ in scale or direction.
Response: Thank you for this thoughtful observation. This is an important point, particularly for practitioners aiming to understand the source and magnitude of bias across multiple attributes.
In our current QMAD implementation, we assume that users select comparison functions with consistent scaling and directionality across attributes. For instance, our predefined function pairs (e.g., ROM-Arithmetic, ADTest-Harmonic) were chosen to ensure compatibility in aggregation.
However, the QMAD framework is explicitly designed to preserve interpretability across multiple attributes by requiring coherent, well-matched function pairs.
Each comparison–aggregation pair (e.g., ROM–Arithmetic, ADTest–Harmonic) was chose based on statistical compatibility, ensuring that:
- All attribute-level scores (Ma) are on a consistent scale,
- The aggregated final score (M) is semantically aligned across attributes,
- Users can interpret “higher” or “lower” values in a unified direction depending on the selected pair.
Furthermore, QMAD’s modularity allows practitioners to group or normalize attributes where cross-scale aggregation may pose a challenge — though in our experiments, such normalization was not necessary due to the careful design of function pairs.
In conclusion, QMAD maintains interpretability precisely because it does not mix incompatible scoring schemes. It gives users control, transparency, and consistency, which are critical for high-stakes fairness auditing.
Comment: Some results show that QMAD is flagging non-injected attributes as biased (for instance, marital status and race on Day 2), possibly due to statistical test sensitivity. This raises the question of whether false positives could occur that may lead to unnecessary model modifications.
Response: Thank you for this important observation and for carefully reviewing our results. We appreciate the opportunity to clarify the interpretation of these flagged attributes.
Upon further analysis, we recognize that attributes such as marital status and race (e.g., on Day 2) may not necessarily reflect false positives. Instead, we propose that QMAD’s sensitivity may be surfacing signals of latent or proxy bias — subtle forms of distributional shifts or entanglements with other features that are not explicitly injected but nonetheless meaningful in real-world contexts.
To address this concern directly, we have expanded the discussion in the manuscript and added the following explanation:
We added this in 6.2. Real-World Data Evaluation Result in 6.2.1. Bias Detection
In real-world settings, not all biases are injected deliberately or known a priori. QMAD’s ability to flag these attributes highlights its strength in detecting potential distributional shifts, which may otherwise go unnoticed. For example:
- Marital status may be entangled with education or income levels.
- Race distribution may shift subtly due to sampling variability across time frames.
We now view them as early warnings, as shown in the table below, which can guide further causal or statistical investigation.
Attribute |
Known Bias Injected? |
Flagged by QMAD? |
Likely False Positive? |
Interpretation |
Education |
✅ |
✅ |
❌ |
Intended bias correctly detected |
Marital Status (Day 2) |
❌ |
✅ |
✅ |
Possible proxy or latent bias surfaced |
Race (Day 3) |
❌ |
✅ |
✅ |
Potential sampling shift or entangled bias |
We believe these changes significantly improve the clarity and overall impact of the manuscript, and we appreciate your feedback in helping us strengthen it.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper introduces a significant advancement in fairness metrics for Machine Learning (ML) systems, particularly addressing the dynamic environments and changing feature distributions. The Quantile Multi-Attributes Disparity (QMAD) metric is well-structured, clearly described, and presents a relevant contribution to the field. However, certain areas could be strengthened:
-
Introduction & Background:
-
The background is sufficient, but a more thorough integration and critical review of recent literature on fairness metrics, specifically addressing their limitations in dynamic environments, could further contextualize the necessity of the proposed QMAD framework.
-
-
Research Design:
-
The research design is robust and well-justified, yet clarifying the rationale behind selecting specific synthetic and real-world datasets would improve comprehensibility.
-
-
Methods Description:
-
The methods are adequately described. However, additional clarity could be provided on the specific scenarios or applications where each comparison-aggregation pair might be most effective or limited.
-
-
Results Presentation:
-
Results are clearly presented, but improving visualizations (e.g., figures summarizing bias detection clearly) could enhance readability and quick understanding of key findings.
-
-
Conclusion Support:
-
Conclusions are well-supported by the results. Nevertheless, explicitly stating the limitations of the study and potential directions for future research would make the discussion more balanced.
-
Comments and Suggestions for Authors:
Your manuscript proposes a compelling fairness metric, QMAD, effectively addressing dynamic feature distributions in ML fairness. To strengthen the paper:
-
Enhance the literature review by critically comparing your metric's novelty and contribution to more recent studies in fairness monitoring and bias detection.
-
Consider improving the clarity and justification of datasets and scenarios selection in the experiments.
-
Provide clearer graphical representations of the results to improve reader comprehension.
-
Explicitly discuss study limitations and future research opportunities, especially regarding automating the selection of comparison-aggregation function pairs, as indicated in your supplementary materials.
Author Response
Comment: The background is sufficient, but a more thorough integration and critical review of recent literature on fairness metrics, specifically addressing their limitations in dynamic environments, could further contextualize the necessity of the proposed QMAD framework.
Response: Thank you for this important observation and for carefully reviewing our manuscript in 2.1. Fairness Monitoring Section we added
Framework |
Dynamic Fairness |
Handles Feature Drift |
Multi-Attribute |
Time-Based Evaluation |
Customizable |
Scalable |
Statistical Parity Diff. |
❌ No |
❌ No |
❌ Single attribute |
❌ Static only |
❌ Fixed formula |
✅ Yes |
Equalized Odds |
❌ No |
❌ No |
❌ Single attribute |
❌ Static only |
❌ Fixed formula |
✅ Yes |
QDD (Ghosh et al., 2022) |
⚠️ Limited |
❌ No |
❌ Only prediction |
❌ Static only |
❌ Fixed structure |
✅ Yes |
QMAD (Ours) |
✅ Yes |
✅ Yes |
✅ Yes |
✅ Yes |
✅ Fully modular |
✅ Yes |
- Research Design:
Comment: The research design is robust and well-justified, yet clarifying the rationale behind selecting specific synthetic and real-world datasets would improve comprehensibility.
Response: Thank you for your valuable suggestion and for taking the time to review our manuscript. Regarding the datasets chosen, we would like to emphasize that our selection was intentional and methodologically grounded to ensure comprehensive validation of QMAD.
- The synthetic dataset adapted from Ghosh et al. (2022) was chosen because it enables controlled bias injection across multiple attributes and time frames. This allowed us to isolate and evaluate QMAD’s behavior under precisely designed distributional shifts — something not possible with real-world datasets.
- The UCI Adult Dataset was selected due to its widespread use in fairness benchmarking, enabling comparability with existing literature. More importantly, it contains demographically sensitive attributes like gender, race, education, and marital status — which are highly relevant for fairness analysis in classification tasks.
This dual setup — combining controlled synthetic testing with real-world validation — was designed to rigorously evaluate both the sensitivity and generalizability of QMAD. The synthetic dataset stresses the framework under known conditions, while the real-world dataset demonstrates its applicability in practical, uncontrolled scenarios.
We added to the flowing Section 5.2.1 (Synthetic Dataset) and 5.3.1(UCI Adult Dataset)
“The synthetic dataset provides a controlled environment to inject measurable biases into features and predictions, allowing for fine-grained validation of QMAD’s detection sensitivity. Meanwhile, the UCI Adult Dataset offers a real-world testbed with demographic richness and relevance, enabling us to evaluate QMAD’s robustness in high-stakes fairness contexts aligned with existing benchmarks.”
- Methods Description:
Comment: The methods are adequately described. However, additional clarity could be provided on the specific scenarios or applications where each comparison-aggregation pair might be most effective or limited.
Response: Thank you for this observation. We would like to clarify that the QMAD framework was designed precisely to enable application-aware pairing of comparison and aggregation functions. In fact, we provide a detailed mapping (Table 12) that outlines specific real-world scenarios where each function pair is most suitable.
For instance:
The ROM–Arithmetic pair is ideal for classification and regression tasks with balanced and well-behaved distributions, where mean differences across bins are informative and interpretable.
The KSTest–Harmonic pair excels in non-parametric or imbalanced settings, where subtle distributional shifts or outlier-resilient detection is needed.
The ADTest–Harmonic combination is effective for noisy or skewed datasets, as it provides stronger tail sensitivity and robustness.
These pairings were chosen based on the statistical properties of the functions and their suitability for specific fairness auditing conditions. We have clarified this in the revised manuscript by expanding the explanations in Section 4 (Adaptable Fairness Metric Framework) and incorporating Table 12 directly into the Conclusion.
“Each comparison–aggregation pair evaluated was selected with a specific application scenario in mind. Table 12 presents a mapping between common machine learning use cases (e.g., binary classification, skewed regression, fairness monitoring) and the function pair most appropriate for robust bias detection in that setting. This helps practitioners select the most suitable configuration based on the statistical characteristics of their problem.”
Use Case |
Recommended Comparison Function |
Recommended Aggregation Function |
Notes |
Binary Classification (balanced groups) |
Ratio of Means (ROM) |
Arithmetic Mean |
Standard pair for stable class distributions |
Binary Classification (imbalanced groups) |
Kolmogorov-Smirnov Test (KSTest) |
Harmonic Mean |
KSTest detects subtle shifts in imbalanced classes |
Regression (normal distributions) |
Ratio of Means (ROM) |
Arithmetic Mean |
Fast and interpretable metric for regression |
- Results Presentation:
Comment: Results are clearly presented, but improving visualizations (e.g., figures summarizing bias detection clearly) could enhance readability and quick understanding of key findings.
Response: Thank you for your valuable suggestion and for taking the time to review our manuscript. We agree that figures can enhance clarity and reader understanding. In response, we have added figure to page 14 and 16 .
- Conclusion Support:
Comment: Conclusions are well-supported by the results. Nevertheless, explicitly stating the limitations of the study and potential directions for future research would make the discussion more balanced.
Response: We appreciate the reviewer’s recognition of the strength of our conclusions. While we agree that highlighting future extensions can enrich the discussion, we emphasize that these are not limitations of QMAD’s validity, but rather intentional boundaries set to keep the framework lightweight, interpretable, and adaptable to diverse settings.
Nevertheless, we have now explicitly outlined the following areas as promising directions for future work as added to the Conclusion:
- False Positive Control – As QMAD is designed to detect even subtle disparities, we aim to integrate bootstrapped stability checks or multiple testing corrections to prevent over-alerting in noisy datasets.
- Causal & Longitudinal Extensions – Future versions may incorporate causal structure awareness and support fairness tracking over continuous, non-discrete timeframes or policies.
Comments and Suggestions for Authors:
Comment: Enhance the literature review by critically comparing your metric's novelty and contribution to more recent studies in fairness monitoring and bias detection.
Response: Added Table 1 in Section 2.1 (Fairness Monitoring)
Comment: Consider improving the clarity and justification of datasets and scenarios selection in the experiments.
Response: We appreciate the reviewer’s suggestion and would like to clarify that the selection of datasets and experimental scenarios was strategically designed to balance control, and generalizability.
Specifically:
The synthetic dataset was adapted from Ghosh et al. (2022) and constructed to allow precise injection of controlled bias across features and time steps. This enables rigorous validation of QMAD's detection sensitivity under quantifiable, known ground-truth conditions — something that real-world datasets cannot provide.
The UCI Adult dataset, widely used in fairness research, was chosen to validate QMAD’s practical utility on real-world demographic attributes like gender, race, and education. This dataset allows comparison with other published fairness metrics and exposes QMAD to authentic distributional and societal biases.
By combining these two datasets, we demonstrate QMAD’s strengths across both controlled, interpretable environments and unstructured fairness settings. This dual evaluation strategy is essential for assessing both the diagnostic precision and real-world applicability of a fairness auditing framework like QMAD.
Comment: Provide clearer graphical representations of the results to improve reader comprehension.
- Response: Thank you for your helpful suggestion and for your careful review of the manuscript. In response, we have revised and enhanced the visual presentation of key results by adding clearer and more informative graphical representations. These figures have been incorporated on pages 14 and 16 of the revised manuscript.
- Experimental Results for Synthetic Dataset
- Experimental Results for UCI Adult Dataset
- Explicitly discuss study limitations and future research opportunities, especially regarding automating the selection of comparison-aggregation function pairs, as indicated in your supplementary materials.
We have outlined the following areas as promising directions for future work as added to the Conclusion:
- False Positive Control – As QMAD is designed to detect even subtle disparities, we aim to integrate bootstrapped stability checks or multiple testing corrections to prevent over-alerting in noisy datasets.
- Causal & Longitudinal Extensions – Future versions may incorporate causal structure awareness and support fairness tracking over continuous, non-discrete timeframes or policies.
We believe these changes significantly improve the clarity and overall impact of the manuscript, and we appreciate your feedback in helping us strengthen it.