ContextMental: A Sociocultural Benchmark for Arabic Mental Health Understanding

Ayash, Lama; Alasmari, Ashwag; Alhuzali, Hassan

doi:10.3390/electronics15122558

Open AccessArticle

ContextMental: A Sociocultural Benchmark for Arabic Mental Health Understanding

by

Lama Ayash

^1,*

,

Ashwag Alasmari

^1,2

and

Hassan Alhuzali

³

¹

Center for Artificial Intelligence, King Khalid University, Abha 62521, Saudi Arabia

²

Department of Informatics and Computing Systems, King Khalid University, Abha 62521, Saudi Arabia

³

Department of Computer Science and Artificial Intelligence, Umm Al-Qura University, Makkah 24382, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2558; https://doi.org/10.3390/electronics15122558 (registering DOI)

Submission received: 11 April 2026 / Revised: 3 June 2026 / Accepted: 4 June 2026 / Published: 10 June 2026

(This article belongs to the Special Issue Low-Resource Languages in the Age of Large Language Models)

Download

Browse Figures

Versions Notes

Abstract

Mental health discourse may reflect social relationships, cultural norms, and religious factors that shape how individuals express and interpret distress. Existing NLP research on mental health has advanced the detection of depression, anxiety, suicide risk, and related clinical signals using text mining, neural classification, transformer-based models, and, more recently, large language models. However, most systems treat text primarily as a clinical signal rather than examining the social and cultural contexts in which distress is expressed. Arabic NLP research remains even more limited, largely focusing on detecting clinical conditions while overlooking contextual factors that shape mental health questions. This work introduces ContextMental, a multi-label annotation schema and benchmark dataset for modeling sociocultural context in Arabic mental health questions. The dataset contains 2677 questions, including 552 instances with contextual labels, enabling fine-grained analysis of social, cultural, and religious dimensions. An AraBERT-based classification framework is further developed using imbalance-aware optimization, semi-supervised pseudo-labeling, and adaptive threshold calibration. Experimental results indicate that pseudo-label augmentation improves overall classification performance, suggesting that semi-supervised learning can support context-aware Arabic mental health classification. This study provides a context-aware annotation framework, a benchmark dataset, and an AraBERT-based baseline modeling pipeline for Arabic mental health NLP, thereby supporting future research on socially, culturally, and religiously grounded language technologies.

Keywords:

Arabic natural language processing; Arabic mental health; contextual factors; multi-label classification; pseudo-labeling; semi-supervised learning; benchmark dataset

1. Introduction

Mental health discourse is inherently shaped by contextual factors, including social relationships, cultural norms, and religious factors [1,2,3]. Individuals often express distress through interpersonal experiences, cultural expectations, and religious frames that influence how symptoms are described and interpreted. However, prior mental health NLP research has primarily focused on detecting clinical signals, such as depression, anxiety, and suicidal ideation, often treating text as decontextualized input rather than as a contextually grounded expression of lived experience [4,5]. This limitation is especially pronounced in Arabic mental health NLP, where existing work has largely emphasized clinical-condition detection while giving less attention to social, cultural, and religious contexts. In Arabic settings, these contextual dimensions are particularly important because social, cultural, and religious structures can strongly influence how mental health concerns are articulated.

Recent advances in transformer-based models, including AraBERT, have improved performance in Arabic text classification [6]. Despite these gains, current models may prioritize linguistic and biomedical features when fine-tuned on datasets that mainly provide clinical labels, limiting their ability to capture contextual dimensions that govern meaning. At the same time, existing datasets do not provide explicit annotations of social, cultural, and religious factors, limiting the ability to study or model them directly [7]. As a result, models may produce linguistically accurate outputs that remain socially, culturally, or religiously misaligned.

This gap reflects a broader limitation in language modeling. Systems trained without contextual awareness may fail to capture the norms, values, and relational dynamics embedded in human communication [8,9]. In mental health applications, this limitation is particularly critical because expressions of distress often depend on interpersonal context, cultural framing, and religious beliefs. The task is therefore challenging because it lies at the intersection of mental health analysis and NLP, requiring the interpretation of subtle linguistic, social, cultural, and religious cues rather than simple surface-level text classification.

To address this problem, this work introduces ContextMental, a context-aware annotation schema and benchmark dataset for Arabic mental health questions. The proposed framework formulates contextual modeling as a multi-label classification task, enabling the identification of social, cultural, and religious dimensions at multiple levels of granularity.

The contributions of ContextMental are threefold:

(i): Context-aware annotation schema: A structured multi-label framework for capturing social, cultural, and religious dimensions at both coarse and fine-grained levels.
(ii): Benchmark dataset and analysis: A corpus of 2677 Arabic mental health questions, including 552 contextually labeled instances, accompanied by detailed distributional and co-occurrence analyses.
(iii): Context-aware classification framework: An AraBERT-based multi-label classification pipeline incorporating imbalance-aware optimization, pseudo-labeling, and adaptive threshold calibration.

The remainder of this paper is organized as follows: Section 2 reviews related work; Section 3 describes the dataset, annotation schema, and modeling framework; Section 4 presents the experimental setup; Section 5 reports the results and analysis; Section 6 discusses the findings and limitations; and Section 7 concludes the paper.

2. Related Work

Research in mental health NLP has primarily focused on identifying psychological conditions, such as depression, anxiety, and suicidal ideation, from text data [10,11,12,13]. Recent advances have extended this line of inquiry to knowledge-intensive question answering and counseling dialogue systems [14]. However, most existing approaches rely heavily on English-language datasets and often overlook the social, cultural, and religious framing of mental health discourse. This limitation becomes more pronounced when methods are transferred to languages with different morphological structures and sociocultural contexts. In Arabic, for example, mental health expressions may be shaped by complex morphology, dialectal variation, interpersonal norms, cultural expectations, and religious references. A recent review on the use of large language models in mental health [15] highlights their diagnostic and therapeutic potential but also cautions against cultural bias, ethical risks, and limited suitability in low-resource settings.

Arabic mental health NLP remains an emerging area due to limited annotated corpora and linguistic diversity. Existing work has addressed depression and suicidal ideation detection in Arabic social media, including AraDepSu [16] and deep-learning-based Arabic depression detection studies [17]. Efforts such as MentalQA [7,18] and the AraHealthQA 2025 shared task [19] have begun addressing Arabic mental health question answering and health-related reasoning. Recent work has also examined sociocultural dimensions of Arabic mental health discourse in condition-specific X communities, highlighting religious, relational, identity, emotional-distress, and medical vocabulary patterns across online communities [20].

In parallel, Arabic pretrained language models such as AraBERT [6], MARBERT [21], and CAMeLBERT [21] have improved Arabic text classification and representation learning across Modern Standard Arabic and dialectal Arabic. However, these resources and models primarily focus on clinical-condition detection, question answering, social media discourse characterization, or general Arabic language understanding rather than the explicit annotation of social, cultural, and religious contextual factors in Arabic mental health questions.

Sociocultural Context in NLP

Culturally aware NLP has gained increasing attention as researchers have recognized the limitations of Anglocentric modeling. Hovy and Yang [8] argue for the explicit modeling of social and cultural factors in language understanding. Recent large-scale studies [22,23,24] show that multilingual language models often reflect cultural asymmetries and value misalignments. The emerging taxonomy of Culturally Aware and Adapted NLP [25] formalizes this research space by identifying dimensions such as cultural embedding, adaptation, and alignment. Within healthcare dialogue, incorporating social and religious context has been shown to improve alignment and empathy in model responses [26].

This study extends these research directions by introducing a multi-label annotation schema that explicitly integrates social, cultural, and religious dimensions into Arabic mental health question classification. Unlike prior Arabic NLP work that has focused mainly on medical or linguistic aspects, the proposed framework captures the relational, cultural, and religious reasoning underlying patient expressions. It further contributes a scalable semi-supervised pipeline based on pseudo-labeling and adaptive threshold calibration, bridging socially aware NLP and Arabic healthcare classification.

3. Methodology

In this study, a comprehensive framework is introduced for multi-label classification of Arabic mental health questions, addressing data scarcity and label imbalance. The approach integrates contextually grounded annotation, semi-supervised data expansion, and imbalance-aware modeling. Curated online questions are annotated using a multi-level schema to produce a gold-standard dataset, which is subsequently expanded through pseudo-labeling. An AraBERT-based multi-label classifier is fine-tuned to identify social, cultural, and religious contextual dimensions, while adaptive threshold calibration is applied during inference to convert class probabilities into final label assignments. The resulting pipeline provides an end-to-end workflow from annotation to classification, as shown in Figure 1.

3.1. Data Collection

This study uses data collected from Altibbi.com (https://altibbi.com), a well-established Arabic medical platform that hosts thousands of health-related articles, glossaries, and Q&A discussions. From this platform, 2677 unique question–answer pairs focusing on mental health were collected between 2020 and 2021 [7]. These pairs reflect genuine patient inquiries and corresponding responses provided by licensed medical professionals. For this work, only the patient-authored questions were analyzed. Physician responses were excluded because the objective of this study is to examine how patients articulate mental health concerns through social, cultural, and religious framing, rather than to analyze clinical advice or provider communication.

3.2. Schema Development

The annotation schema was developed to capture contextual factors that shape how mental health concerns are expressed in Arabic discourse beyond purely clinical descriptions. In Arabic mental health contexts, individuals often articulate distress through cultural traditions, social expectations, family roles, community relations, and religious values. Ignoring these dimensions risks overlooking the broader context in which mental health concerns are experienced, communicated, and interpreted.

Social factors were included because many Arabic mental health questions are shaped by interpersonal relationships, family expectations, marital concerns, demographic position, and perceived life satisfaction. Prior work in socially aware NLP argues that linguistic meaning is shaped by social factors such as speakers, audiences, norms, and ideology [27]. This motivated the inclusion of social context as a major dimension of the schema.

Cultural factors were included because prior multilingual and cross-cultural NLP studies show that language technologies, particularly models adapted through transfer learning or fine-tuning, can underrepresent non-Western and culturally marginalized perspectives [22]. Research on cultural alignment further shows that cultural values and social norms should be explicitly represented when modeling language use across societies [23,26]. This motivated the inclusion of cultural context as a separate dimension of the schema.

Religious factors were modeled as a separate category because religious reasoning, spiritual coping, and faith-based interpretations appeared explicitly in Arabic mental health questions and often provided a distinct explanatory frame for distress. This distinction was also supported by public health work emphasizing that mental health and help-seeking are shaped by social determinants, including family structure, community relations, cultural values, religious context, and other contextual factors [28].

The schema was constructed through iterative exploratory analysis and pilot annotation of the collected dataset. Recurring contextual patterns clustered around three major dimensions: social, cultural, and religious. These dimensions were selected because they consistently captured the main contextual signals observed in the questions while preserving interpretability and annotation reliability.

The sub-categories were similarly derived through iterative refinement of recurring patterns observed during annotation. Within the social dimension, the dominant themes involved relationships, demographic factors, and expressions related to life satisfaction. Within the cultural dimension, recurring themes centered on information, values, and norms shaped by cultural expectations. Within the religious dimension, recurring themes reflected religious reasoning, spiritual coping, and faith-based interpretations of distress. The final schema was therefore designed to balance contextual coverage, interpretability, and annotation reliability while remaining grounded in the linguistic and sociocultural characteristics of Arabic mental health discourse.

3.3. Schema Categories and Definitions

The annotation schema is structured in two stages. First, each question is assessed using a binary decision (Yes/No) to determine whether cultural, social, or religious framing is present. A question is marked Yes when it contains an explicit or implicit reference to cultural norms, social circumstances, interpersonal context, demographic conditions, or religious beliefs and practices. It is marked No when the question is limited to symptoms, diagnosis, treatment, or general medical advice without such contextual framing. If the answer is Yes, the question is further annotated using the categories and sub-categories defined in Table 1. This design allows the schema to capture broader cultural, social, and religious influences while maintaining flexibility for multi-label annotation.

3.4. Annotation Process

The gold-standard dataset was constructed through a controlled human annotation process. Three native Arabic-speaking Saudi annotators participated in the labeling task. The annotation team consisted of two female annotators and one male annotator, including two PhD-level annotators and one master’s-level graduate. All annotators were familiar with Arabic dialectal variation, including Saudi Arabic, as well as the style of Arabic mental health questions.

Each question was independently annotated by two annotators to support inter-annotator agreement assessment and ensure labeling reliability. Disagreements were resolved through discussion and adjudication to produce the final labels.

The annotation process followed a predefined multi-level schema supported by detailed written guidelines specifying the criteria for each label (Supplementary File S1). These guidelines were iteratively refined during the early annotation rounds to improve consistency and reduce ambiguity. Through this process, 500 questions were manually annotated, forming the gold-standard subset used for model training and evaluation.

3.5. Inter-Annotator Agreement

To assess the reliability of the manual annotations, agreement was measured across three labeling dimensions: binary relevance (Yes/No), main category (Culture, Social, Religion), and sub-category assignment (Information, Values, Norms and Morals, Relationship, Demographics, Life Satisfaction). The results are shown in Table 2.

Cohen’s Kappa (

κ

) was used to evaluate pairwise agreement, while Krippendorff’s Alpha (

α

) was used as a complementary reliability measure suitable for annotation assessment [29,30]. Agreement was computed under an exact-match criterion, where each multi-label combination was treated as a unique class.

Strong agreement was observed for binary relevance (

κ = 0.80

), indicating reliable identification of contextual presence. Agreement at the main-category level was substantial (

κ = 0.76

), reflecting consistent identification of cultural, social, and religious dimensions. Agreement declined for sub-categories (

κ = 0.58

), which is expected given the finer granularity of the labels and the fact that multiple contextual interpretations may coexist in the same question.

3.6. Pseudo-Labeling

Manual annotation of contextual factors in Arabic mental health questions is costly because it requires culturally aware annotators and careful interpretation of social, cultural, and religious cues. Although the 500 manually annotated questions provide a high-quality gold-standard subset, this subset alone is limited for training a robust multi-label classifier, particularly for infrequent contextual categories. Therefore, a semi-supervised pseudo-labeling strategy was adopted to expand supervision to the remaining questions in the dataset.

To expand beyond the gold-standard subset, a model trained on the manually annotated questions was used to generate probability scores for the remaining 2177 unlabeled questions. Pseudo-labels were then assigned using confidence-based thresholds: labels exceeding the threshold were treated as positive, while labels below the threshold were treated as negative. This approach allowed the full dataset to contribute to training while maintaining a controlled decision boundary for positive label assignment.

3.7. Model Architecture

The framework uses AraBERT [6] as the backbone model for multi-label classification of Arabic mental health questions. Each input question is tokenized and encoded into contextualized representations. The [CLS] token representation is used as a global sequence representation and passed to a task-specific linear classification layer, as illustrated in Figure 2.

The classification head produces an independent probability score for each label using sigmoid activations, allowing multiple labels to be assigned to the same question. Weighted binary cross-entropy loss is applied independently to each label.

Validation-Based Threshold Calibration

In multi-label classification, using a single global probability threshold may not be suitable for all labels, particularly under severe class imbalance. Therefore, class-specific thresholds were calibrated on the validation set and applied during inference to convert predicted probabilities into final label assignments [31].

Let

p_{i, c}

denote the predicted probability for class c in sample i. A label is assigned when the predicted probability exceeds the corresponding class-specific threshold

τ_{c}

:

{\hat{y}}_{i, c} = I (p_{i, c} \geq τ_{c}),

(1)

where

I (\cdot)

denotes the indicator function.

The threshold for each label was selected empirically using validation predictions to improve the balance between precision and recall across labels with different frequencies. This calibration allows different labels to use different decision boundaries rather than relying on a single default threshold for all classes. The final validation-calibrated thresholds used during inference are summarized in Table 3.

4. Experimental Setup

The experimental setup was designed to assess the effectiveness of the proposed framework for multi-label classification of contextual factors in Arabic mental health questions. It covers both the baseline setting using gold-standard annotations and the semi-supervised setting incorporating pseudo-labeled data. The setup includes a controlled training protocol, a semi-supervised augmentation strategy, consistent model configurations, and a comprehensive evaluation scheme to ensure a fair comparison between approaches.

4.1. Training Protocol

The experimental design follows two training stages. In the first stage, the model is trained only on the 500 manually annotated questions, which form the gold-standard dataset. The task is formulated as a multi-label classification problem, where each contextual label is treated as an independent binary classification target.

The experiments use five-fold multi-label stratified cross-validation [32] to preserve label co-occurrence patterns across splits. In the gold-only stage, each fold reserves 100 samples as the test set, while the remaining 400 samples are divided into 360 training samples and 40 validation samples. The validation set is used for model selection and threshold calibration, while the test set remains unseen during training and validation.

To address the limited size of the gold-standard dataset, a semi-supervised pseudo-labeling strategy is applied to expand the training corpus. After training a seed model on the manually annotated gold-standard data, inference is performed on the remaining 2177 unlabeled questions to generate probability scores for each label. Pseudo-labels are then assigned using validation-calibrated class-specific thresholds: labels exceeding the corresponding threshold are treated as positive, while labels below the threshold are treated as negative. In this setting, the complete unlabeled subset contributes to weak supervision, with thresholding serving as the decision rule for label assignment.

The validation and test sets remain unchanged and contain only manually annotated gold-standard samples. This setup keeps the evaluation splits fixed while expanding only the training data through semi-supervised augmentation, enabling a clear assessment of the impact of additional pseudo-labeled samples on model performance.

4.2. Training Configuration

Both the seed model and the augmented model used identical hyperparameter settings to ensure a fair comparison, as shown in Table 4. The training process used the AdamW optimizer implemented in PyTorch (version 2.12.0) [33] with a learning rate of

2 \times 10^{- 5}

and a weight decay of

0.01

. Training was conducted for 15 epochs with a batch size of 8 and a maximum input sequence length of 192 tokens. Mixed-precision training was performed using FP16. The [CLS] token representation was passed to a linear classification head with sigmoid activations to produce independent probability scores for each label.

4.3. Evaluation Metrics

The evaluation uses standard multi-label classification metrics to provide a comprehensive assessment of model performance [34]:

Micro-F1 aggregates true positives, false positives, and false negatives across all labels. It reflects overall performance and is more influenced by frequent classes.
Macro-F1 computes the F1-score independently for each label and then averages the scores across labels, giving equal weight to both frequent and rare classes.
Subset Accuracy measures the proportion of samples for which the predicted label set exactly matches the ground-truth label set. This metric is strict because all labels for a given instance must be predicted correctly.
Jaccard Index evaluates the similarity between predicted and ground-truth label sets. It allows partial matches and provides a less strict alternative to subset accuracy.
Hamming Loss measures the fraction of incorrectly predicted labels over the total number of label decisions. Lower values indicate better label-wise performance.

5. Results

The experimental results evaluate both the modeling performance and the annotation characteristics of the proposed framework. The analysis examines the effects of pseudo-label augmentation, class-imbalance handling, and validation-based threshold calibration, followed by a detailed analysis of contextual label distributions, overlaps, and representative examples from the dataset.

5.1. Impact of Pseudo-Labeled Data on Model Performance

The pseudo-labeled training configuration improved overall performance compared with the gold-only configuration, particularly on metrics influenced by high-support labels. Mean performance across five folds showed that Micro-F1 increased from 0.72 to 0.84, while Macro-F1 improved from 0.19 to 0.22. Subset Accuracy increased from 0.70 to 0.84, and the Jaccard Index improved from 0.72 to 0.84. In addition, Hamming Loss decreased from 0.07 to 0.04, suggesting fewer label-wise prediction errors overall. These results are reported using the held-out gold-standard test sets in Table 5 and Figure 3.

5.2. Per-Class Performance Analysis

Per-class analysis was conducted to provide a clearer interpretation of model behavior across individual contextual labels. As shown in Table 6, pseudo-label augmentation improves performance on high-support labels, particularly No and Social|Relationship. The F1-score for Social|Relationship increases from 0.55 in the gold-only setting to 0.68 after adding pseudo-labeled samples, suggesting improved learning of frequent social-context patterns.

However, performance on low-support categories remains limited. Social|Life Satisfaction, Culture|Information, and Culture|Values each have only one positive instance in the held-out test fold and obtain F1-scores of 0.00 under both configurations. Among the cultural sub-categories, only Culture|Norms and Morals shows low performance, with F1 improving from 0.18 to 0.31 under pseudo-label augmentation. In contrast, Religion remains difficult to predict because of its low support, with F1 decreasing from 0.20 to 0.13 after pseudo-label augmentation. These results make the severity of class imbalance explicit and explain why cultural and religious context prediction remains challenging despite the use of pseudo-labeled data. These results show that pseudo-label augmentation helps frequent labels but remains limited for rare cultural and religious sub-categories, which should be addressed in future work through targeted annotation and stronger imbalance-aware learning.

5.3. Error Analysis

Error analysis shows that prediction errors mainly occur in three cases. First, some contextual questions are predicted as No, especially when the contextual cue is implicit rather than directly stated. This suggests that the model may miss social, cultural, or religious signals when they are expressed indirectly.

Second, cultural cases may be confused with Social|Relationship because cultural expectations are often expressed through family, marriage, interpersonal obligations, or social pressure. As a result, the model may capture the social surface meaning while missing the deeper cultural framing.

Third, religious cases may be confused with either No or Social|Relationship. This occurs when religious cues are sparse, indirect, or embedded within broader emotional or social distress.

5.4. Ablation Study

An ablation study was conducted to examine the individual and combined effects of class-weighted optimization and pseudo-label augmentation. Four configurations were compared: a standard AraBERT baseline, AraBERT with weighted BCE loss, AraBERT with pseudo-label augmentation, and the complete framework combining weighted BCE with pseudo-label augmentation. All configurations were evaluated using the same held-out gold-standard test sets under five-fold multi-label stratified cross-validation.

The standard AraBERT baseline achieved a Micro-F1 score of 0.78 and a Macro-F1 score of 0.19. Adding weighted BCE improved Macro-F1 to 0.21, indicating slightly better sensitivity to minority contextual categories under severe class imbalance. However, this improvement was accompanied by lower Micro-F1, Subset Accuracy, and Jaccard scores, suggesting reduced overall prediction stability, as reported in Table 7.

Pseudo-label augmentation alone improved the main performance metrics, increasing Micro-F1 from 0.78 to 0.83 and Subset Accuracy from 0.76 to 0.82. This suggests that additional pseudo-labeled contextual samples improve representation learning and prediction consistency, even without imbalance-aware optimization.

The strongest overall performance was achieved by combining weighted BCE with pseudo-label augmentation. This configuration achieved the highest scores across all reported metrics, including a Micro-F1 score of 0.84, a Macro-F1 score of 0.22, a Subset Accuracy of 0.84, and a Jaccard score of 0.84. Overall, the results suggest that pseudo-label augmentation provides the largest performance improvement, while weighted BCE provides a smaller additional gain in minority-label sensitivity when combined with additional contextual data.

5.5. Binary Annotation

The binary annotation task determines the presence or absence of contextual factors within each question. The human annotation phase covers 500 questions, of which 131 are labeled Yes and 369 are labeled No. Pseudo-labeling is subsequently applied to the remaining 2177 unlabeled questions, yielding an additional 421 Yes and 1756 No labels. The combined annotation results in 552 Yes and 2125 No instances across all 2677 questions, as summarized in Table 8. This distribution indicates that approximately 20.6% of questions contain at least one culturally, socially, or religiously grounded dimension. Figure 4 further illustrates the distribution of questions with and without contextual factors.

5.6. Distribution of Cultural, Social, and Religious Annotations

Among the 131 human-annotated positive cases, social aspects were the most prevalent, appearing in 108 questions. Cultural aspects were identified in 19 instances, while religious aspects appeared in 14. To better understand label co-occurrence, the annotations were further analyzed based on mutually exclusive label combinations. The majority of samples (98) were labeled as social only. In contrast, 14 questions were labeled as cultural only and 9 as religious only. A smaller number of instances exhibited overlapping labels, with 5 questions combining social and cultural aspects and another 5 combining social and religious aspects. No samples were annotated with all three categories simultaneously. Overall, these counts sum to 131, confirming that each instance is assigned to exactly one label combination. While the main categories themselves are not mutually exclusive, their combinations form a complete and non-overlapping partition of the dataset.

Among the 421 pseudo-labeled positive cases, social aspects again dominated, appearing in 417 questions. Cultural and religious aspects appeared in 9 and 5 cases, respectively.

The overlap distribution was computed over mutually exclusive label combinations. Specifically, 408 questions were labeled as social only, 2 as cultural only, and 2 as religious only. Additionally, 6 questions combined social and cultural labels, 2 combined social and religious labels, and 1 instance contained all three categories. These counts sum to 421, confirming a complete and non-overlapping partition of samples across label combinations. A detailed breakdown of label co-occurrence is provided in Table 9, presenting the distribution across human annotations (131), pseudo-labeled instances (421), and the combined dataset (552).

5.7. Sub-Category Distribution and Overlap Analysis

At the sub-category level, Social|Relationship is the most frequent label. As illustrated in Figure 5, overlap among social sub-categories is limited. The largest overlap is observed between Social|Relationship and Social|Demographics, while overlaps involving Social|Life satisfaction are rare. No instance contains all three social sub-categories.

Within the cultural dimension (Figure 6), Culture|Norms and Morals appears most frequently, whereas Culture|Information and Culture|Values occur in fewer instances. The only overlap is between Norms and Morals and Values, with no overlap involving Information.

Cross-category co-occurrence between social and cultural sub-categories is shown in Figure 7. The highest overlap occurs between Social|Relationship and Culture|Norms and Morals. Additional overlaps are observed between social and religious categories, as reflected in the heatmap.

5.8. Representative Examples

A range of sociocultural frames can be observed in how Arabic speakers articulate psychological distress. Religious framing appears in the first example, where the patient’s compulsive repetition of words is interpreted through a spiritual lens and managed via religious coping strategies, such as remembrance and seeking forgiveness (Table 10).

Social dimensions are prominent in multiple entries, particularly those involving relational and emotional difficulties. For instance, the question concerning marriage between two individuals diagnosed with bipolar disorder illustrates the intersection of mental health and familial expectations, where love, stigma, and responsibility intertwine. Similarly, the example expressing chronic sadness despite material stability captures a form of emotional dissatisfaction linked to life satisfaction rather than socioeconomic status.

Cultural labeling is evident in questions centered on self-image and social perception. The example of body dissatisfaction and comparison with others demonstrates how cultural ideals of beauty and moral self-worth interact to shape psychological vulnerability. The case involving withdrawal from escitalopram also reveals culturally specific health behaviors, where pharmacological decisions are influenced by concerns about motherhood and family values. Finally, the example describing fear of judgment and self-consciousness represents the sub-category of Values, emphasizing the moral and social gaze that governs self-presentation in collectivist settings.

Overall, the selected questions illustrate the interplay between religious beliefs, cultural values, and social relationships in Arabic expressions of distress. Such sociocultural grounding challenges purely clinical interpretations and underlines the necessity of culturally aligned NLP models for mental health understanding in Arabic contexts.

6. Discussion

This work introduces ContextMental, a benchmark and annotation framework designed to capture contextual dimensions in Arabic mental health question classification. The results demonstrate that contextual information plays a central role in how psychological distress is expressed in Arabic patient-authored questions, particularly through socially grounded narratives and interpersonal concerns.

The annotation distribution shows that social contextual factors are the most common, especially the Social|Relationship category, across both manually annotated and pseudo-labeled data. This pattern reflects the tendency of users to describe psychological experiences through family dynamics, interpersonal conflict, emotional attachment, and social interaction rather than through isolated clinical symptoms. In contrast, cultural and religious contextual signals appear less frequently. However, these dimensions still provide important complementary information that shapes the interpretation of mental health narratives within Arabic-speaking communities.

The overlap analysis shows that contextual dimensions can co-occur, although such overlaps remain limited compared with single-category annotations. The clearest cross-category overlap appears between social and cultural dimensions, particularly when relationship-related distress is expressed alongside social expectations, family norms, or culturally shaped behavioral constraints. This suggests that contextual signals in Arabic mental health questions are sometimes interconnected rather than strictly separable. Accordingly, context-aware mental health NLP may benefit from modeling dependencies between contextual dimensions instead of treating labels as fully independent categories.

The semi-supervised experiments indicate that pseudo-labeling improves overall predictive performance, mainly for high-support labels such as No and Social|Relationship. The pseudo-labeled data remain concentrated in social-context labels and add comparatively fewer examples for cultural and religious categories. This indicates that the model is more effective at extending frequent contextual patterns than at expanding low-support categories. The consistent improvements in Micro-F1, Jaccard Index, and Subset Accuracy further suggest that pseudo-labeling improves global prediction consistency and generalization for common contextual structures.

However, the results also show important limitations of pseudo-labeling under severe class imbalance. Rare contextual categories, including Culture|Values, Culture|Norms and Morals, and Religion, remain difficult to predict reliably despite the larger training set. The pseudo-labeled data provide only limited additional support for these categories, which restricts the model’s ability to learn stable decision boundaries for minority labels. Consequently, Macro-F1 improves more modestly than Micro-F1, indicating that the main gains are concentrated in dominant categories rather than distributed evenly across all labels. This reflects a known limitation of self-training approaches, where pseudo-labeling may provide greater benefit for frequent patterns while offering less improvement for underrepresented classes.

The annotation reliability analysis supports the consistency of the proposed framework, particularly at the main-category level, where inter-annotator agreement remains strong. Nevertheless, agreement decreases for fine-grained and infrequent sub-categories, reflecting the inherent ambiguity of contextual interpretation in mental health language. Distinguishing between overlapping cultural, social, and religious contextual signals often requires subjective judgment and nuanced interpretation of implicit linguistic cues. These challenges become more pronounced in short or sparsely contextualized questions, where limited textual evidence constrains annotation certainty.

Several limitations remain. First, the dataset size for minority contextual categories remains relatively small, which restricts reliable evaluation and stable learning for rare labels. Second, the pseudo-labeling strategy depends on model-generated annotations, which may introduce noise and reflect the distributional tendencies of the seed model. Third, the benchmark focuses specifically on Arabic mental health questions and therefore does not necessarily generalize to broader Arabic mental health discourse, conversational dialogue, or long-form clinical narratives. In addition, the current framework models contextual dimensions independently and does not explicitly capture hierarchical or temporal relationships between contextual factors.

Despite these limitations, the proposed benchmark provides a structured resource for contextual mental health understanding in Arabic NLP. The dataset, annotation framework, and semi-supervised learning setup establish a foundation for future research on context-aware mental health modeling, culturally informed NLP systems, and socially grounded language understanding in low-resource mental health settings.

6.1. Implications

The findings of this study have several methodological, computational, and societal implications for Arabic mental health NLP. Methodologically, the results suggest that a multidimensional annotation schema for Arabic mental health questions can be constructed through an iterative, guideline-driven annotation process by a small research team, without requiring large-scale infrastructure. The schema captures social, cultural, and religious aspects embedded in patient discourse while preserving interpretability and annotation consistency. However, scaling the schema to larger datasets or clinical deployment would require additional annotation resources and expert validation. The observed inter-annotator agreement and overlap analysis suggest that subjective constructs, such as cultural and religious reasoning, can be consistently annotated through carefully defined guidelines and iterative calibration. This framework may support the development of similar sociocultural annotation schemes in other healthcare domains and underrepresented languages.

From a computational perspective, the use of pseudo-labeling demonstrates the practicality of semi-supervised learning in low-resource and sensitive domains. Leveraging validation-calibrated model predictions enables the expansion of labeled data while preserving key distributional patterns observed in human annotations. Such approaches are particularly relevant in settings where data availability is constrained by privacy, ethical considerations, or linguistic diversity. Incorporating contextual categories within pseudo-labeling pipelines may further support the development of models that generalize across different social contexts while maintaining interpretability [35].

At the societal level, the results highlight the importance of accounting for contextual factors in the development of NLP systems for mental health. By capturing how individuals express distress through relational, cultural, and religious frames, the dataset provides a structured foundation for more context-aware analysis. These insights may inform the design of systems that better reflect the diversity of user experiences in Arabic-speaking contexts. Future work can explore how such models can be integrated into real-world applications while ensuring transparency and cultural sensitivity.

6.2. Ethical Considerations

Given the sensitive nature of mental health data, ethical considerations are central to this study. The dataset was derived from publicly available question–answer content, and no personally identifiable information is included in the released dataset. The analysis is limited to patient-authored questions and focuses on the thematic categorization of contextual factors rather than diagnosis, risk assessment, or evaluation of individual users. Annotators were instructed to treat the questions respectfully and to label only observable textual cues related to social, cultural, or religious framing.

Several risks remain despite anonymization. Mental health questions may contain sensitive self-disclosures, and contextual labels may reflect vulnerable social circumstances, religious concerns, family relations, or culturally specific experiences. In addition, pseudo-labeling may introduce noise, and model predictions may reproduce biases present in the source platform or annotation process. Therefore, the resulting models should be interpreted as research tools for corpus analysis and should not be used for clinical decision-making, diagnosis, triage, or automated mental health advice without expert oversight.

The schema is intended to support culturally aware analysis, not to stereotype Arabic-speaking users or reduce complex experiences to fixed categories. Any downstream use should include transparency about model limitations, human review, cultural sensitivity, and fairness evaluation across demographic and dialectal groups. Future work should further examine bias, annotation uncertainty, and safeguards for responsible deployment in mental health NLP.

7. Conclusions

This work introduces ContextMental, a socioculturally informed framework for Arabic mental health question classification that integrates cultural, social, and religious dimensions into both annotation and modeling. The proposed schema supports multi-label analysis at two levels of granularity—main categories and sub-categories—and is instantiated in a curated corpus of patient questions. A baseline classification framework based on AraBERT, combined with imbalance-aware optimization, pseudo-labeling, and adaptive threshold calibration, provides preliminary evidence that contextual-factor classification is feasible for high-support labels, while rare cultural and religious categories require further annotated data and stronger imbalance-aware methods.

The results highlight the importance of contextual framing in Arabic mental health discourse. Patient expressions are frequently shaped by interpersonal, cultural, and religious considerations, which are not fully captured by conventional clinical categorizations. Modeling these dimensions may support a more context-aware representation of how distress is articulated in real-world Arabic mental health questions.

From a methodological perspective, this work presents a framework that combines structured sociocultural annotation with semi-supervised learning. This approach can support dataset expansion while preserving the main distributional patterns observed in human annotations. However, the classification results should be interpreted cautiously, as improvements are concentrated mainly in high-support labels and do not yet indicate robust prediction of rare sociocultural categories.

Future work could explore the integration of multi-modal signals, such as speech and sentiment, as well as more advanced adaptation and imbalance-aware techniques to improve performance on minority labels. Additional human-centered evaluation is also needed to assess the practical implications of applying such models in real-world settings. Overall, this study contributes a structured resource and baseline modeling approach for context-aware Arabic mental health NLP, supporting future research on systems that better account for selected social, cultural, and religious dimensions of user experiences.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics15122558/s1, File S1: Annotation Guidelines.

Author Contributions

Conceptualization, L.A., A.A. and H.A.; methodology, L.A., A.A. and H.A.; data curation, L.A.; annotation, L.A., A.A. and H.A.; validation, A.A. and H.A.; formal analysis, L.A.; investigation, L.A.; visualization, L.A.; writing—original draft preparation, L.A.; writing—review and editing, A.A. and H.A.; supervision, H.A. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through small group research under grant number RGP1/69/46.

Data Availability Statement

The dataset supporting the findings of this study is openly available in the GitHub repository at https://github.com/LamaAy/ContextMental.git (accessed on 3 June 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Adebayo, Y.O.; Adesiyan, R.E.; Amadi, C.S.; Ipede, O.; Karakitie, L.O.; Adebayo, K.T. Cross-cultural perspectives on mental health: Understanding variations and promoting cultural competence. World J. Adv. Res. Rev. 2024, 23, 432–439. [Google Scholar] [CrossRef]
Konidaris, M.; Petrakis, M. Cultural humility training in mental health service provision: A scoping review of the foundational and conceptual literature. Healthcare 2025, 13, 1342. [Google Scholar] [CrossRef] [PubMed]
Lyons, P.; Edwardes, A.; Bladon, L.; Abel, K.M. Culturally sensitive mental health research: A scoping review. BMC Psychiatry 2025, 25, 190. [Google Scholar] [CrossRef] [PubMed]
De Choudhury, M.; Gamon, M.; Counts, S.; Horvitz, E. Predicting depression via social media. In Proceedings of the International AAAI Conference on Web and Social Media; AAAI Press: Washington, DC, USA, 2013; Volume 7, pp. 128–137. [Google Scholar] [CrossRef]
Coppersmith, G.; Dredze, M.; Harman, C.; Hollingshead, K.; Mitchell, M. CLPsych 2015 shared task: Depression and PTSD on Twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 31–39. [Google Scholar]
Antoun, W.; Baly, F.; Hajj, H. AraBERT: Transformer-based model for Arabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
Alhuzali, H.; Alasmari, A.; Alsaleh, H. Mentalqa: An annotated arabic corpus for questions and answers of mental healthcare. IEEE Access 2024, 12, 101155–101165. [Google Scholar] [CrossRef]
Hovy, D.; Yang, D. The importance of modeling social factors of language: Theory and practice. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 588–602. [Google Scholar]
Liu, Z.; Ferianc, M.; Treleaven, P.C.; Rodrigues, M. Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede’s Cultural Dimensions. arXiv 2024, arXiv:2309.12342v2. [Google Scholar]
Plutchik, R. A general psychoevolutionary theory of emotion. In Theories of Emotion; Elsevier: Amsterdam, The Netherlands, 1980; pp. 3–33. [Google Scholar]
Zhang, T.; Yang, K.; Alhuzali, H.; Liu, B.; Ananiadou, S. PHQ-aware depressive symptoms identification with similarity contrastive learning on social media. Inf. Process. Manag. 2023, 60, 103417. [Google Scholar] [CrossRef]
Chaturvedi, J.; Velupillai, S.; Stewart, R.; Roberts, A. Identifying mentions of pain in mental health records text: A natural language processing approach. arXiv 2023, arXiv:2304.01240. [Google Scholar] [CrossRef]
Garg, M.; Saxena, C.; Saha, S.; Krishnan, V.; Joshi, R.; Mago, V. CAMS: An Annotated Corpus for Causal Analysis of Mental Health Issues in Social Media Posts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference; European Language Resources Association: Paris, France, 2022; pp. 6387–6396. [Google Scholar]
Racha, S.; Joshi, P.; Raman, A.; Jangid, N.; Sharma, M.; Ramakrishnan, G.; Punjabi, N. MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models. arXiv 2025, arXiv:2502.15418. [Google Scholar] [CrossRef]
Kim, H.; Wang, J.; Liu, R. Applications of Large Language Models in Mental Health: Opportunities and Challenges. J. Med. Internet Res. 2025, 27, e69284. [Google Scholar]
Hassib, M.; Hossam, N.; Sameh, J.; Torki, M. Aradepsu: Detecting depression and suicidal ideation in arabic tweets using transformers. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 302–311. [Google Scholar]
Alghamdi, A.; Alshutayri, A.; Alharbi, B. Deep bidirectional transformers for Arabic dialect identification. In Proceedings of the 6th International Conference on Future Networks & Distributed Systems; Association for Computing Machinery: New York, NY, USA, 2022; pp. 265–272. [Google Scholar]
Alhuzali, H.; Alasmari, A. Pre-Trained Language Models for Mental Health: An Empirical Study on Arabic Q&A Classification. Healthcare 2025, 13, 985. [Google Scholar] [CrossRef]
Alhuzali, H.; Al-Eisawi, W.; Abdul-Mageed, M.; Abouzahir, C.; Abu-Daoud, M.; Alasmari, A.; Al-Monef, R.; Alqahtani, A.; Ayash, L.; Kharouf, L.; et al. AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering. In Proceedings of the Third Arabic Natural Language Processing Conference: Shared Tasks; Association for Computational Linguistics: Suzhou, China, 2025; pp. 107–118. [Google Scholar]
Alqahtani, A.; Salama, R.; Diab, M. Understanding the Sociocultural Dimensions of Mental Health Discourse in Arabic-Language X Communities. In Proceedings of the 11th Social Media Mining for Health Applications and Health Real-World Data (SMM4H-HeaRD 2026) Workshop and Shared Tasks; Association for Computational Linguistics: Online, 2026; Available online: https://github.com/amalqahtani/arabic-x-mental-health-discourse (accessed on 3 June 2026).
Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 7088–7105. [Google Scholar]
Havaldar, S.; Rai, S.; Singhal, B.; Liu, L.; Guntuku, S.C.; Ungar, L. Multilingual Language Models are not Multicultural: A Case Study in Emotion. In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis; Barnes, J., De Clercq, O., Klinger, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 202–214. [Google Scholar] [CrossRef]
Giuliani, N.; Ma, C.C.; Pradeep, P.; Ippolito, D. CAVA: A Tool for Cultural Alignment Visualization & Analysis. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; Hernandez Farias, D.I., Hope, T., Li, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 153–161. [Google Scholar] [CrossRef]
Rai, P.; Gupta, K.; Fung, P. Cross-Cultural Alignment in Large Language Models: A Systematic Study of Value Representation. ACL Roll. Rev. 2025, preprint. [Google Scholar]
Li, Z.; Giuliani, L.; Pavlick, E. Culturally Aware and Adapted NLP: A Taxonomy and a Framework. Trans. Assoc. Comput. Linguist. 2025, 13, 101–120. [Google Scholar]
Cao, Y.; Chen, M.; Hershcovich, D. Bridging Cultural Nuances in Dialogue Agents through Cultural Value Surveys. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2024; Graham, Y., Purver, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 929–945. [Google Scholar]
Yang, D.; Hovy, D.; Jurgens, D.; Plank, B. Socially Aware Language Technologies: Perspectives and Practices. Comput. Linguist. 2025, 51, 689–703. [Google Scholar] [CrossRef]
World Health Organization. Social Determinants of Mental Health; World Health Organization: Geneva, Switzerland, 2014. [Google Scholar]
Krippendorff, K. Content Analysis: An Introduction to Its Methodology, 4th ed.; Sage Publications: Thousand Oaks, CA, USA, 2018. [Google Scholar]
Artstein, R.; Poesio, M. Inter-coder agreement for computational linguistics. Comput. Linguist. 2008, 34, 555–596. [Google Scholar] [CrossRef]
Gao, S.; Dong, W.; Cheng, K.; Yang, X.; Zheng, S.; Yu, H. Adaptive decision threshold-based extreme learning machine for classifying imbalanced multi-label data. Neural Process. Lett. 2020, 52, 2151–2173. [Google Scholar] [CrossRef]
Sechidis, K.; Tsoumakas, G.; Vlahavas, I. On the stratification of multi-label data. In Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Read, J.; Pfahringer, B.; Holmes, G.; Frank, E. Classifier chains for multi-label classification. Mach. Learn. 2011, 85, 333–359. [Google Scholar] [CrossRef]
Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the ICML 2013 Workshop: Challenges in Representation Learning (WREPL), Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]

Figure 1. Methodological pipeline for extending and applying an annotation schema to Arabic mental health questions. The Arabic text/logo in the data collection panel refers to Altibbi, the source platform of the Arabic medical questions.

Figure 2. Training and semi-supervised augmentation pipeline. The model is fine-tuned on gold-standard data using weighted BCE loss; adaptive thresholds are then applied to model predictions to pseudo-label unlabeled questions.

Figure 3. Comparison between the gold-only and pseudo-labeled training configurations using mean performance across five folds.

Figure 4. Distribution of questions with and without contextual factors.

Figure 5. Overlap among social sub-categories (Relationship, Demographics, and Life Satisfaction).

Figure 6. Overlap among cultural sub-categories (Information, Values, and Norms and Morals).

Figure 7. Co-occurrence matrix of sub-categories. Diagonal entries represent the total frequency of each sub-category, while off-diagonal values indicate pairwise overlaps. Abbreviations: S|Rel (Social–Relationship), S|Demo (Social–Demographics), S|Life (Social–Life Satisfaction), C|Norms (Culture–Norms and Morals), C|Info (Culture–Information), C|Values (Culture–Values).

Table 1. Categories and definitions of the proposed annotation schema.

Main Category	Sub-Category	Definition
Cultural	Information	References to culturally shaped knowledge, assumptions, or explanations about mental health, symptoms, causes, treatment, or appropriate behavior.
	Values	References to culturally valued goals, expectations, or judgments, such as honor, family reputation, obedience, marriage expectations, independence, or acceptable roles.
	Norms & Morals	References to socially shared rules about proper or improper behavior, moral responsibility, shame, blame, obligation, or what is considered acceptable within the community.
Social	Relationship	References to interpersonal relations that shape the question, including family, spouse, parents, children, friends, peers, workplace relations, or community interactions.
	Demographics	References to personal or social-position attributes such as age, gender, marital status, nationality, income, education, employment, or family status.
	Life Satisfaction	References to the questioner’s perceived quality of life, happiness, loneliness, dissatisfaction, hopelessness, social functioning, or overall well-being.
Religious	–	References to religious belief, practice, obligation, spiritual coping, sin, guilt, prayer, divine will, or faith-based interpretation of distress or treatment.

Table 2. Inter-annotator agreement across labeling dimensions.

Annotation Level	$κ$	$α$
Yes/No	0.80	0.80
Category	0.76	0.68
Sub-category	0.58	0.58

Table 3. Validation-calibrated class-specific thresholds used during inference.

Label	Threshold ( $τ_{c}$ )
No Contextual Factors	0.94
Social\|Relationship	0.52
Social\|Demographics	0.40
Social\|Life Satisfaction	0.40
Culture\|Information	0.40
Culture\|Values	0.40
Culture\|Norms and Morals	0.40
Religion	0.40

Table 4. Training hyperparameters used for both seed and augmented models.

Hyperparameter	Value
Base model	`bert-base-arabertv02`
Max sequence length	192 tokens
Epochs	15
Batch size	8
Learning rate	$2 \times 10^{- 5}$
Weight decay	0.01
Optimizer	AdamW
Loss function	Weighted BCE (per-label)
Cross-validation	5-fold MLSKF
Mixed precision	fp16
Random seed	42

Table 5. Mean performance across five-fold multi-label stratified cross-validation on the held-out gold-standard test sets.

Configuration	Micro-F1	Macro-F1	Subset Acc.	Jaccard	Hamming Loss
Gold Only	0.72	0.19	0.70	0.72	0.07
Gold & Pseudo	0.84	0.22	0.84	0.84	0.04

Table 6. Per-label performance on the held-out gold-standard test sets. Support is the number of positive instances in each held-out test fold. Values are reported as mean ± standard deviation across five folds. Bold values indicate the better result between the two configurations for the same label and metric.

Config.	Label	Support	Precision	Recall	F1
Gold Only	No	71	0.76 ± 0.05	0.96 ± 0.04	0.85 ± 0.02
Gold Only	Social\|Relationship	22	0.46 ± 0.07	0.70 ± 0.09	0.55 ± 0.05
Gold Only	Social\|Demographics	3	0.12 ± 0.12	0.40 ± 0.15	0.16 ± 0.11
Gold Only	Social\|Life Satisfaction	1	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
Gold Only	Culture\|Information	1	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
Gold Only	Culture\|Values	1	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
Gold Only	Culture\|Norms and Morals	4	0.18 ± 0.20	0.35 ± 0.22	0.18 ± 0.13
Gold Only	Religion	3	0.14 ± 0.13	0.47 ± 0.30	0.20 ± 0.17
Gold & Pseudo	No	71	0.86 ± 0.06	0.93 ± 0.05	0.89 ± 0.01
Gold & Pseudo	Social\|Relationship	22	0.67 ± 0.10	0.74 ± 0.19	0.68 ± 0.08
Gold & Pseudo	Social\|Demographics	3	0.26 ± 0.23	0.27 ± 0.15	0.24 ± 0.17
Gold & Pseudo	Social\|Life Satisfaction	1	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
Gold & Pseudo	Culture\|Information	1	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
Gold & Pseudo	Culture\|Values	1	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
Gold & Pseudo	Culture\|Norms and Morals	4	0.57 ± 0.41	0.25 ± 0.00	0.31 ± 0.10
Gold & Pseudo	Religion	3	0.13 ± 0.18	0.13 ± 0.18	0.13 ± 0.18

Table 7. Ablation study results averaged across five folds on the held-out gold-standard test sets. Bold values indicate the best performance for each metric.

Configuration	Micro-F1	Macro-F1	Subset Acc.	Jaccard
AraBERT	0.78	0.19	0.76	0.76
AraBERT + Weighted BCE	0.73	0.21	0.70	0.72
AraBERT + Pseudo-labels	0.83	0.21	0.82	0.83
Full Framework	0.84	0.22	0.84	0.84

Table 8. Binary annotation summary for human and pseudo-labeled data.

Label	Human	Pseudo	Total
No	369	1756	2125
Yes	131	421	552
Total	500	2177	2677

Table 9. Overlap distribution between main categories for human and pseudo-labeled data. Rows sum to the total number of positive instances per annotation type.

Annotation	Only Social	Only Culture	Only Religion	Social & Culture	Social & Religion	All Three	Total
Human Annotators	98	14	9	5	5	0	131
Pseudo-labeling	408	2	2	6	2	1	421
Total	506	16	11	11	7	1	552

Table 10. Examples of mental health questions from ContextMental dataset, originally written in Arabic and translated into English for clarity. Each entry includes its assigned main and sub-category.

Question (English Translation from Arabic)	Main Category	Sub-Category
My problem is that I keep repeating a certain word for a long time. If I hear someone repeat a word, I start repeating it to myself. I try to stop but I can’t and I get frustrated. This happened before and I got over it, but it came back. OCD runs in my family. My mother told me to say astaghfirallah instead of repeating.	Religion	–
I have been diagnosed with bipolar disorder type I and II. Should I proceed with marriage to my fiancée who has the same condition? I love her deeply; she is my safe place. Our wedding is in three months.	Social	Relationship
I am not satisfied with my appearance. I feel ugly and ashamed of myself. I compare myself to other girls and envy their beauty. People say I am pretty, but I don’t believe them. My siblings sometimes make fun of me, which makes me feel hopeless.	Culture	Norms and Morals
No feeling of happiness, inability to sleep, no desire to do anything, only negative thoughts. I just want isolation. I can’t feel joy or eat, and I feel like I lost the purpose of life.	Social	Life Satisfaction
The withdrawal symptoms from escitalopram are very strong after I stopped it due to fear of its effect on pregnancy. Tests were normal, but I relapsed after two months. I cannot currently see a psychiatrist.	Culture	Information
I feel constant sadness despite being young and having a very stable, happy marriage and good finances. I am stuck in past problems and can’t move forward positively; I cry often and don’t know if this is depression.	Social	Demographics
I have thoughts telling me You are not being yourself; you are fake. I tried to get rid of this, but it didn’t work, even when I’m alone. I care too much about what others think when I do something in front of people.	Culture	Values

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ayash, L.; Alasmari, A.; Alhuzali, H. ContextMental: A Sociocultural Benchmark for Arabic Mental Health Understanding. Electronics 2026, 15, 2558. https://doi.org/10.3390/electronics15122558

AMA Style

Ayash L, Alasmari A, Alhuzali H. ContextMental: A Sociocultural Benchmark for Arabic Mental Health Understanding. Electronics. 2026; 15(12):2558. https://doi.org/10.3390/electronics15122558

Chicago/Turabian Style

Ayash, Lama, Ashwag Alasmari, and Hassan Alhuzali. 2026. "ContextMental: A Sociocultural Benchmark for Arabic Mental Health Understanding" Electronics 15, no. 12: 2558. https://doi.org/10.3390/electronics15122558

APA Style

Ayash, L., Alasmari, A., & Alhuzali, H. (2026). ContextMental: A Sociocultural Benchmark for Arabic Mental Health Understanding. Electronics, 15(12), 2558. https://doi.org/10.3390/electronics15122558

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

ContextMental: A Sociocultural Benchmark for Arabic Mental Health Understanding

Abstract

1. Introduction

2. Related Work

Sociocultural Context in NLP

3. Methodology

3.1. Data Collection

3.2. Schema Development

3.3. Schema Categories and Definitions

3.4. Annotation Process

3.5. Inter-Annotator Agreement

3.6. Pseudo-Labeling

3.7. Model Architecture

Validation-Based Threshold Calibration

4. Experimental Setup

4.1. Training Protocol

4.2. Training Configuration

4.3. Evaluation Metrics

5. Results

5.1. Impact of Pseudo-Labeled Data on Model Performance

5.2. Per-Class Performance Analysis

5.3. Error Analysis

5.4. Ablation Study

5.5. Binary Annotation

5.6. Distribution of Cultural, Social, and Religious Annotations

5.7. Sub-Category Distribution and Overlap Analysis

5.8. Representative Examples

6. Discussion

6.1. Implications

6.2. Ethical Considerations

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI