Research Brief

When Fairness Metrics Disagree

This research investigates how different fairness metrics can produce conflicting conclusions when evaluating the same machine learning system. The work introduces the Fairness Disagreement Index (FDI), a measure designed to quantify inconsistency across fairness metrics.

Why This Matters

In high-stakes AI systems, fairness evaluation cannot rely on a single metric. False positive rate disparity, false negative rate disparity, accuracy disparity, and worst-group performance may each tell a different story about the same model.

This creates deployment uncertainty: a system may appear acceptable under one fairness measure while showing substantial risk under another.

Key Findings

Fairness conclusions can change depending on the metric used.
Metric disagreement persists across decision thresholds.
Single-metric fairness reporting is insufficient for reliable deployment assessment.
FDI provides a structured way to quantify fairness disagreement.

Comparison of Fairness Disagreement Index (FDI) behaviour across FaceNet and ArcFace models under varying decision thresholds.

The results demonstrate that fairness disagreement persists across different model architectures and threshold settings, reinforcing the need for multi-metric fairness evaluation in deployment-sensitive environments.

FDI variation across decision thresholds, illustrating how fairness disagreement changes under different operating conditions.

Fairness assessment can vary significantly depending on the metric, threshold, or evaluation perspective used, highlighting the need for structured multi-metric assurance rather than isolated fairness reporting.

Deployment Relevance

For AI assurance, fairness disagreement is not just a reporting issue. It can create uncertainty around whether a system is suitable for real-world deployment, especially in sensitive domains such as biometric recognition, healthcare, law enforcement, and automated risk assessment.

View paper on arXiv View code on GitHub