Multiple-Choice Question Answering Models for Automatic Depression Severity Estimation †

: Depression is one of the most prevalent mental health diseases. Although there are effective treatments, the main problem relies on providing early and effective risk detection. Medical experts use self-reporting questionnaires to elaborate their diagnosis


Introduction
The World Health Organization (WHO) [1] placed mental health as one of the most relevant components of health.Depression is one of the most common mental disorders.By itself, it affects more than 270 million people.Despite having many harmful effects, there are some effective known treatments.The main problem relies on providing early and effective risk detection.
One of the most reliable and frequent methods to measure depression severity is the Beck Depression Inventory-II (BDI-II) [2].Although significant evidence exists regarding its performance, some aspects often affect the results of these questionnaires.
These days, health organizations are publishing these questionnaires so that users can fill them in by themselves.However, people with mental disorders usually do not dare to visit those web pages and fill in the questionnaires.In this new communication era, people use social networks to share their feelings and emotions.Hence, these platforms are a great way to collect data to identify disorders like depression [3].
In this context, we describe an approach to improve the automatic estimation of the degree of depression from users on social media.Our study presents the use of pre-trained language models [4] to predict the depression degree of subjects.We evaluated these models for the task "Measuring the Severity of the Signs of Depression" of the CLEF 2020 eRisk Track [5].Our results achieved moderate performance among all the participants of the task.

Experiments 2.1. Datasets
In this study, we use the datasets provided by eRisk 2019 and 2020 for the task "Measuring the Severity of the Signs of Depression" [5,6].Each dataset contains the history of 20 and 70 users, respectively, providing the users' actual responses to the questionnaire and its complete history of postings.We used the 2019 dataset as training data and the 2020 dataset for testing.
We also used RACE (Large-scale ReAding Comprehension Dataset From Examinations) [7] and SWAG (Large-Scale Adversarial Dataset for Grounded Commonsense Inference) [8], two general-purpose multiple-choice question answering datasets.After some preliminary comparisons, we selected the RACE dataset to perform the first fine-tune over BERT as the results obtained were slightly better.

Beck Depression Inventory-II (BDI-II)
Beck Depression Inventory-II (BDI-II) is a questionnaire formed by 21 items to measure the depression severity.For each item, the BDI-II provides four options (except items 15 and 17, which provide seven options) and sentences to explain their meanings.These options represent a scale from the absence of the symptom to a total identification.

Models
We used a modified BERT [9] model for Multiple-Choice Question Answering (MCQA).This model was built over the pre-trained bert-base-uncased model, modifying it to allow multiple-choice question answering.In [4], we can see the process followed to build the model and its comparison with other baseline models.
We also tried to pre-train the MCQA models provided by the Hugging Face library (such as RoBERTa for multiple-choice), but the results obtained were much worse than those obtained using the adaptation mentioned.

Our Approach
Pre-trained language models are usually trained on a large text corpus and then finetuned on a downstream task.Following this approach, in the training phase, we fine-tuned a pre-trained model using the RACE dataset.We additionally fine-tuned the model using training data from the 2019 eRisk task.For that, we built a custom dataset that contains every post from each user combined with each question from the BDI-II questionnaire with all its options (0-3), and the label which represents the actual option was chosen by the user.After analyzing the results obtained, we decided to filter the training data as there was too much noise.Therefore, we calculated the post and question embeddings and used only the top 50 posts more similar to each question as training data in the fine-tuning process.
To run both fine-tunings, we used a batch size of eight (four, when fine-tuning with the seven options dataset), a maximum sequence length of 320, a learning rate of 5 × 10 −5 , two epochs, and two gradient accumulation steps.
To carry out inference, we feed the model with every post from each user combined with each question from the BDI-II questionnaire with all its options.As a result, we will receive the model's confidence on each option for each pair of post-questions.Given that confidence, we can extract the inferred answer for each paired user-question by selecting the option with the most appearances.
In this phase, we used the following parameters: a batch size of 48, a maximum sequence length of 320, and a minimum option probability of 0.4 (0.2 for seven options questions).We subtract 0.01 from the minimum probability if no posts achieve that minimum probability.
Finally, to facilitate the whole inference process, we built another dataset using the test data from the 2020 task and following the same approach as explained before.

Results and Discussion
In Table 1, we show the results obtained following the explained approach.On the one hand, we can see in the results table that we get good results in both ADODL and DCHR metrics.However, on the other hand, the results obtained in AHR and ACR metrics were poor compared with the best results of the 2020 task.
Inspecting the model's answers to the BDI-II, we could see that it overestimates depression severity.This is likely because both the train and test data are still noisy.With this in mind, in future work, we plan to design more effective data filtering processes on both the training and test data.

Conclusions
In this article, we studied the application of pre-trained multiple-choice question answering models to automatically estimate the depression severity of users on social media.The results obtained are promising and a good starting point to continue researching this type of model.

Table 1 .
Results of our model, along with the best baselines of eRisk 2020.Bold values correspond to the best result for each metric.