Background: Machine learning (ML) has been widely adopted in decision-making, making fairness a central ethical and scientific priority. We developed the Themis chatbot, a Large Language Model (LLM) system designed to explain concepts of ML fairness in an accessible, conversational format.
Methods: The development followed four stages: (1) curating a document corpus of 286 peer-reviewed publications on ML fairness; (2) development of Themis by combining a modern LLM (OpenAI’s GPT-4o) with Retrieval Augmented Generation (RAG); (3) creation of a 340-item benchmark dataset, the FairnessQA; and (4) evaluating performance against state-of-the-art non-augmented LLMs (DeepSeek R1, GPT-4o, GPT-5, and Grok 3).
Results: For the multiple-choice questions, Themis achieved an accuracy of 96.7%, outperforming DeepSeek R1 (90.0%), GPT-4o (89.3%), GPT-5 (92.0%), and Grok 3 (86.7%), and the overall difference was statistically significant (χ
2(4) = 10.1,
p = 0.038). In the closed-ended questions, Themis achieved the highest accuracy (96.7%), while competing models ranged from 78.0% to 84.0%, and the overall difference was significant (χ
2(4) = 23.9,
p < 0.001). In the open-ended questions, Themis achieved the highest mean scores for correctness (M = 4.62), completeness (M = 4.59), and usefulness (M = 4.56), and differences were statistically significant (correctness: F(4, 195) = 20.91,
p < 0.001; completeness: F(4, 195) = 7.76,
p < 0.001; usefulness: F(4, 195) = 2.90,
p < 0.001). By consolidating scattered research into an interactive assistant, Themis makes fairness concepts more accessible to educators, researchers, and policymakers. This work demonstrates that retrieval-augmented systems can enhance the public understanding of machine learning fairness at scale.
Full article