Bias Discovery in Machine Learning Models for Mental Health

Fairness and bias are crucial concepts in artificial intelligence, yet they are relatively ignored in machine learning applications in clinical psychiatry. We computed fairness metrics and present bias mitigation strategies using a model trained on clinical mental health data. We collected structured data related to the admission, diagnosis, and treatment of patients in the psychiatry department of the University Medical Center Utrecht. We trained a machine learning model to predict future administrations of benzodiazepines on the basis of past data. We found that gender plays an unexpected role in the predictions-this constitutes bias. Using the AI Fairness 360 package, we implemented reweighing and discrimination-aware regularization as bias mitigation strategies, and we explored their implications for model performance. This is the first application of bias exploration and mitigation in a machine learning model trained on real clinical psychiatry data.


Introduction
For over ten years there has been increasing interest in the psychiatry domain for using Machine Learning (ML) to aid psychiatrists and nurses [36]. Recently, multiple approaches have been tested for Violence Risk Assessment (VRA) [31,26,45], suicidal behaviour prediction [48], and the prediction of involuntary admissions [19], among others.
Using ML for clinical psychiatry is appealing both as a time-saving instrument and as a way to provide insights to clinicians that might otherwise remain unexploited. Clinical ML models are usually trained on patient data, which includes some protected attributes, such as gender or ethnicity. We desire for models to give equivalent outputs for equivalent patients that differ only in the value of a protected attribute [9]. Yet, a systematic assessment of the fairness of ML models used for clinical psychiatry is lacking in the literature.
As a case study, we focus on the task of predicting future administrations of benzodiazepines. Benzodiazepines are prescription drugs used in the treatment of, for example, anxiety and insomnia. Long-term use of benzodiazepines is associated with increased medical risks, such as cancer [22]. In addition, benzodiazepines in high doses are addictive, with complicated withdrawal [39]. From a clinical perspective, gender should not play a role in the prescription of benzodiazepines [12,49]. Yet biases in the prescription of benzodiazepines have been explored extensively in the literature; some protected attributes that contributed to bias were prescriber gender [5], patient ethnicity [37,7], and patient gender [34], as well as interaction effects between some of these protected attributes [30,28]. There is no conclusive concensus regarding these correlations, with some studies finding no correlations between socio-demographic factors and benzodiazepines prescriptions [29].
We explore the effects of gender fairness bias on a model trained to predict the future administration of benzodiazepines to psychiatric patients based on past data, including past doses of benzodiazepines. A possible use case of this model is to identify patients that are at risk of taking benzodiazepines for too long. We hypothesize that our model is likely to unfairly use the patient's gender in making predictions. If that is the case, then mitigation strategies must be put in place to reduce this bias. We expect that there will be a cost to predictive performance.
Our research questions are: 1. For a model trained to predict future administrations of benzodiazepines based on past data, does gender unfairly influence the decisions of the model?
2. If gender does influence the decisions of said model, how much model performance is sacrificed when applying mitigation strategies to avoid the bias?
To answer these questions, we employ a patient dataset from the University Medical Center (UMC) Utrecht and train a model to predict future administrations of benzodiazepines. We apply the bias discovery and mitigation toolbox AI Fairness 360 [4]. Whenever we find that gender bias is present in our model, we present an appropriate way to mitigate this bias. Our main contribution is a first implementation of a fairness evaluation and mitigation framework on real-world clinical data from the psychiatry domain. We present a way to mitigate a real and well known bias in benzodiazepine prescriptions, without loss of performance. In Section 2 we describe our materials and methods, including a review of previous work in the field. In Section 3 we present our results, which we discuss in Section 4. We present our conclusions in Section 5.

Related Work
The study of bias in machine learning has garnered attention for several years [2]. [11] outlined the dangers of selection bias. Even when researchers attempt to be unbiased, problems might arise, such as bias from an earlier work trickling down into a new model [3] or implicit bias from variables correlated with protected attributes [8,25]. [6] reviewed bias in machine learning, noting also that there is no industry standard for the definition of fairness. [10] evaluated bias in a machine learning model used for university admissions; they also point out the difference between individual and group fairness, as do [52]. [17] and [14] provided theoretical frameworks for the study of fairness. Along the same lines, [41] and [13] provided metrics for the evaluation of fairness. [20] and [21] recommend methods for mitigating bias.
As for particular applications, [42], [50] and [51] studied race and gender bias in facial analysis systems. [27] evaluated fairness in dialogue systems. While they did not actually evaluate ML models, [23] highlighted the importance of bias mitigation in AI for education.
In the medical domain, [15] pointed out the importance of bias mitigation. Indeed, [47] uncovered bias in post-operative complication predictions. [43] found that disparities metrics change when transferring models across hospitals. Finally, [1] explored the impact of random seeds on the fairness of classifiers using clinical data from MIMIC-III, and found that small sample sizes can also introduce bias.
No previous study on ML fairness or bias focuses on the psychiatry domain. This domain is interesting because bias seems to be present in the daily practice. We have already discussed in the introduction how bias is present in the prescription of benzodiazepines. There are also gender disparities in the prescription of zolpidem [16] and in the act of seeking psychological help [32]. [44] also found racial disparities in clinical diagnoses with mania. Furthermore, psychiatry is a domain where a large amount of data is in the form of unstructured text, which is starting to be exploited for ML solutions [40,46]. Previous work has also focused on the explainability of text-based computational support systems in the psychiatry domain [18]. It will be crucial as these text-based models begin to be applied in the clinical practice to ensure that they too are unbiased towards protected attributes.

Data
We employ de-identified patient data from Electronic Health Records (EHRs) from the psychiatry department at the UMC Utrecht. Patients in the dataset were admitted to the psychiatry department between June 2011 and May 2021. The five database tables included are: admissions, patient information, medication administered, diagnoses, and violence incidents. Table 1 shows the variables present in each of the tables.
We construct a dataset where each data point is 14 days after the admission of a patient. We select only completed admissions (admission Table 1: Datasets retrieved from the psychiatry department of the UMC Utrecht, with the variables present in each dataset that are used for this study. Psychiatry is divided into four nursing wards. For the "medication" dataset, the "Administered" and "Not administered" variables contain in principle the same information; however, sometimes only one of them is filled. Dataset  . 3192 admissions (i.e., data points) are included in our dataset. These are coupled with data from the other four tables mentioned above. The nursing ward ID is converted to four binary variables; some rows do not belong to any nursing ward ID (because, for example, the patient was admitted outside of psychiatry and then transferred to psychiatry); these rows have zeros for all four nursing ward ID columns. For diagnoses, the diagnosis date was not always present in the dataset. In that case, we used the end date of the treatment trajectory. If that was also not present, we used the start date of the treatment trajectory. One of the entries in the administered medication table had no date of administration; this entry was removed. We only consider administered medication (administered = True). Doses of various tranquilizers are converted to an equivalent dose of diazepam, according to Table 2 [33]. 1 For each admission, we obtain the age of the patient at the start of the dossier from the patient table. The gender is reported in the admissions table; only the gender assigned at birth is included in this dataset. We count the number of violence incidents before admission and the number of violence incidents during the first 14 days of admission. The main diagnosis groups were converted to binary values, where 1 means that this diagnosis was present for that admission, and that it took place during the first 14 days of admission. Other binary variables derived from the diagnoses table are "Multiple problem" and "Personality disorder". For all diagnoses present for a given admission, we computed the maximum and minimum "levels of care demand", and saved them as two new variables. Matching the administered medication to the admissions by patient ID and date, we compute the total amount of diazepam-equivalent benzodiazepines administered in the first 14 days of admission, and the total administered in the remainder of the admission. The former is one of the predictor variables. The target variable is binary, i.e., whether benzodiazepines are administered during the remainder of the admission or not.
The final dataset consists of 3192 admissions. 1724 admissions correspond to men, while 1468 are women. 2035 admissions have some benzodiazepines administered during the first 14 days of admission, while 1980 admissions have some benzodiazepines administered during the remainder of the admission. Table 3 shows the final list of variables included in the dataset.

Evaluation Metrics
The performance of the model is to be evaluated by the use of the balanced accuracy 2 (average of true positive rate and true negative rate) and the F1 score. As for quantifying bias, we will use four metrics: • Statistical Parity Difference is discussed in [10], as the difference between the correctly classified instances for the privileged and the unprivileged group. If the statistical parity difference is 0, privileged and unprivileged groups receive the same percentage of positive classifications. Statistical parity is an indicator for representation and therefore a group fairness metric. If the value is negative, the privileged group has an advantage.
• Disparate Impact is computed as the ratio of rate of favourable outcome for the unprivileged group to that of the privileged group [13]. This value should be as close to 1 for a fair result; lower than 1 implies a benefit for the privileged group.
• Equal Opportunity Difference is the difference between the true positive rates between the unprivileged group and the privileged group. It evaluates the ability of the model to classify the unprivileged group compared to the privileged group. The value should be as close to 0 for a fair result. If the value is negative, the privileged group has an advantage.
• Average Odds Difference is the difference between false positives rates and true positive rates between the unprivileged group and  If the value is negative, the privileged group has an advantage.

Machine Learning Methods
We use AI Fairness 360, a package for the discovery and mitigation of bias in machine learning models. The protected attribute in our dataset is gender, while the favourable class is "man". We employ two classification algorithms implemented in ScikitLearn [35]: logistic regression and random forest 3 . For logistic regression, we use the "liblinear" solver. For the random forest classifier, we use 500 estimators, with min samples leaf equal to 25.
There are three types of bias mitigation techniques: pre-processing, in-processing, and post-processing [8]. Pre-processing techniques mitigate bias by removing the underlying discrimination from the dataset. In-processing techniques are modifications to the machine learning algorithms to mitigate bias during model training. Post-processing techniques seek to mitigate bias by equalizing the odds post-training. We use two methods for bias mitigation. As a pre-processing method, we use the reweighing technique of [20], and re-train our classifiers on the reweighed dataset. As an in-processing method, we add a discrimination-aware regularization term to the learning objective of the logistic regression model. This is called a prejudice remover. We set the fairness penalty parameter eta to 25, which is high enough that prejudice will be removed aggressively, while not too high such that accuracy will be significantly compromised [21]. Both of these techniques are seamlessly implemented in AI Fairness 360. To apply post-processing techniques in practice, one needs a training set and a test set; once the model is trained, the test set is used to determine how outputs should be modified in order to limit bias. However, in clinical applications datasets tend to be small, so we envision a realistic scenario in which the entire dataset is used for development, making the use of post-processing methods impossible. For this reason, we do not study these methods further. The workflow of data, models and bias mitigation techniques is shown on Figure 1.
To estimate the uncertainty due to the choice of training data, we use 5-fold cross-validation, with patient IDs as group identifiers to avoid using the same sample for development and testing. Within each fold, we again split the development set into 62.5% training, 37.5% validation, once again with patient IDs as group identifiers to avoid using the same sample for training and validation. We train the model on the training set, and use the validation set to compute the optimal classification threshold, which is the threshold that maximizes the balanced accuracy on the validation set. We then re-train the model on the entire development set, and compute the performance and fairness metrics on the test set. Finally, we compute the mean and standard deviation of all metrics across the 5 folds. The code used to generate the dataset and train the machine learning models is provided as a GitHub repository 4 .

Results
Each of our classifiers outputs a continuous prediction for each test data point. We convert these to binary classifications by comparing with a classification threshold. Figures 2 thru 7 show the trade-off between balanced accuracy and fairness metrics as a function of the classification threshold. Figures 2 and 3 show how the disparate impact error and average odds difference vary together with the balanced accuracy as a function of the classification threshold of a logistic regression model with no bias mitigation, for one of the folds of cross-validation. The corresponding plots for the random forest classifier show the same trends. The performance and fairness metrics after cross-validation are shown in Tables 4and 5, respectively. Since we observe bias (see Section 4 for further discussion), we implement the mitigation strategies detailed in Section 2.4. Figures 4 and 5 show the validation plots for a logistic regression classifier with reweighing for one of the folds of cross-validation; the plots for the random forest classifier show similar trends. Figures 6 and 7 show the validation plots for a logistic regression classifier with prejudice remover.

Analysis of Results
As reported in Table 5, all fairness metrics show results favourable to the privileged group (see Section 2.3 for a discussion of the fairness metrics we Figure 2: Balanced accuracy and disparate impact error versus classification threshold for a logistic regression classifier with no bias mitigation. The dotted vertical line is the threshold that maximizes balanced accuracy. The plot shown corresponds to one of the folds of cross-validation. Disparate impact error, equal to 1min(DI, 1/DI), where DI is the disparate impact, is the difference between disparate impact and its ideal value of 1. use). Reweighing improved the fairness metrics for both classifiers. The Prejudice Remover also improved the fairness metrics, albeit at a cost in performance. There was no big difference in performance between the logistic regression and random forest classifiers. If fairness is crucial, the logistic regression classifier gives more options in terms of the mitigation strategies. The better mitigation strategy is the one closest to the data, for it requires less tinkering with the model, which can lead to worse explainability.
In addition, we computed, for each fold of cross-validation, the difference for each performance and fairness metric between a model with a bias mitigator and the corresponding model without bias mitigation. We then take the mean and standard deviation of those differences, and report the results for performance and fairness metrics on Tables 6 and 7, respectively. We can see that differences in performance for reweighting are mostly small, while the gains in fairness metrics are statistically significant at 95% confidence level. Meanwhile, the prejudice remover incurs a greater cost in performance, with no apparent greater improvement to the fairness metrics.    Figure 3: Balanced accuracy and average odds difference versus classification threshold for a logistic regression classifier with no bias mitigation. The dotted vertical line is the threshold that maximizes balanced accuracy. The plot shown corresponds to one of the folds of cross-validation.

Limitations
Some diagnoses did not have a diagnosis date filled out in the raw dataset.
In those cases, we used the treatment end date. Some data points did not have a value for that variable either, and in those cases we used the treatment start date. This leads to an inconsistent definition of the diagnosis date, and hence to inconsistencies in the variables related to diagnoses during the first 14 days of admission. However, we re-did the analysis with only the diagnoses for which diagnosis dates were present in the raw data, and the results followed the same trends. On a similar note, we removed a few medication administrations that did not have an administering date. A better solution would have been to remove all data corresponding to those patients, albeit at the cost of having fewer data points. We re-did the analysis in that configuration, and obtained similar results.
Finally, this work considered only diagnoses that took place within the first 14 days of admission. It might have been interesting to also consider diagnoses that took place before admission. We leave this option for future work.

Future Work
The present work considered benzodiazepine prescriptions administered during the remainder of each patient's admission. To make the prediction 12 Figure 4: Balanced accuracy and disparate impact error versus classification threshold for a logistic regression classifier with reweighing. The dotted vertical line is the threshold that maximizes balanced accuracy. The plot shown corresponds to one of the folds of cross-validation. Disparate impact error, equal to 1min(DI, 1/DI), where DI is the disparate impact, is the difference between disparate impact and its ideal value of 1.
task fairer for the computer, we could consider predicting benzodiazepines administered during a specific time window, for example, days 15 thru 28 of an admission.
Previous work noted a possible bias between the gender of the prescriber and the prescriptions of benzodiazepines [30,28]. It would be interesting to look into this correlation in our dataset as well; one could train a model to predict, on the basis of patient and prescriber data, whether benzodiazepines will be prescribed. If there are correlations between the gender of the prescriber and the prescription of benzodiazepines, we could raise a warning to let the practitioner know that the model thinks there might be a bias.
Finally, there are other medications for which experts suspect there could be gender biases in the prescriptions and administrations, such as antipsychotics and antidepressives. It would be beneficial to also study those administrations using a similar pipeline to the one developed here.
As a final note, [38] warned against the use of blind applications of fairness frameworks in healthcare. Thus, the present study should be considered only as a demonstration of the importance of considering bias and mitigation in clinical psychiatry machine learning models. Further work is necessary to understand these biases on a deeper level, and what should be done about them. Figure 5: Balanced accuracy and average odds difference versus classification threshold for a logistic regression classifier with reweighing. The dotted vertical line is the threshold that maximizes balanced accuracy. The plot shown corresponds to one of the folds of cross-validation.

Conclusions
Given our results (Section 3) and discussion thereof (Section 4.1), we can conclude that a model trained to predict future administrations of benzodiazepines based on past data is biased by the patients' genders. Perhaps surprisingly, reweighing the data (a pre-processing step) seems to mitigate this bias quite significantly, without loss of performance. The in-processing method Prejudice Remover also mitigated this bias, but at a cost to performance. This is the first fairness evaluation of a machine learning model trained on real clinical psychiatric data. Future researchers working with such models should consider computing fairness metrics and, when necessary, adopt mitigation strategies to ensure patient treatment is not biased with respect to protected attributes.

Notes
The study was approved by the UMC ethics committee as part of Psy-Data, a team of data scientists and clinicians working at the psychiatry department of the UMC Utrecht.
The datasets generated for this study cannot be shared, to protect patient privacy and comply with institutional regulations. Figure 6: Balanced accuracy and disparate impact error versus classification threshold for a logistic regression classifier with prejudice remover. The dotted vertical line is the threshold that maximizes balanced accuracy. The plot shown corresponds to one of the folds of cross-validation. Disparate impact error, equal to 1min(DI, 1/DI), where DI is the disparate impact, is the difference between disparate impact and its ideal value of 1.

14
The core content of this study is drawn from the Master in Business Informatics thesis of Jesse Kuiper [24]. Figure 7: Balanced accuracy and average odds difference versus classification threshold for a logistic regression classifier with prejudice remover. The dotted vertical line is the threshold that maximizes balanced accuracy. The plot shown corresponds to one of the folds of cross-validation. Table 7: Fairness metric differences of models with bias mitigators reweighing (RW) and prejudice remover (PR) compared to a baseline without bias mitigation, for logistic regression (LR) and random forest (RF) classifiers. The fairness metrics are disparate impact (DI), average odds difference (AOD), statistical parity difference (SPD) and equal opportunity difference (EOD). The errors shown are standard deviations. Differences significant at 95% confidence level are shown in bold.