1. Introduction
Mental health discourse is inherently shaped by contextual factors, including social relationships, cultural norms, and religious factors [
1,
2,
3]. Individuals often express distress through interpersonal experiences, cultural expectations, and religious frames that influence how symptoms are described and interpreted. However, prior mental health NLP research has primarily focused on detecting clinical signals, such as depression, anxiety, and suicidal ideation, often treating text as decontextualized input rather than as a contextually grounded expression of lived experience [
4,
5]. This limitation is especially pronounced in Arabic mental health NLP, where existing work has largely emphasized clinical-condition detection while giving less attention to social, cultural, and religious contexts. In Arabic settings, these contextual dimensions are particularly important because social, cultural, and religious structures can strongly influence how mental health concerns are articulated.
Recent advances in transformer-based models, including AraBERT, have improved performance in Arabic text classification [
6]. Despite these gains, current models may prioritize linguistic and biomedical features when fine-tuned on datasets that mainly provide clinical labels, limiting their ability to capture contextual dimensions that govern meaning. At the same time, existing datasets do not provide explicit annotations of social, cultural, and religious factors, limiting the ability to study or model them directly [
7]. As a result, models may produce linguistically accurate outputs that remain socially, culturally, or religiously misaligned.
This gap reflects a broader limitation in language modeling. Systems trained without contextual awareness may fail to capture the norms, values, and relational dynamics embedded in human communication [
8,
9]. In mental health applications, this limitation is particularly critical because expressions of distress often depend on interpersonal context, cultural framing, and religious beliefs. The task is therefore challenging because it lies at the intersection of mental health analysis and NLP, requiring the interpretation of subtle linguistic, social, cultural, and religious cues rather than simple surface-level text classification.
To address this problem, this work introduces ContextMental, a context-aware annotation schema and benchmark dataset for Arabic mental health questions. The proposed framework formulates contextual modeling as a multi-label classification task, enabling the identification of social, cultural, and religious dimensions at multiple levels of granularity.
The contributions of ContextMental are threefold:
- (i)
Context-aware annotation schema: A structured multi-label framework for capturing social, cultural, and religious dimensions at both coarse and fine-grained levels.
- (ii)
Benchmark dataset and analysis: A corpus of 2677 Arabic mental health questions, including 552 contextually labeled instances, accompanied by detailed distributional and co-occurrence analyses.
- (iii)
Context-aware classification framework: An AraBERT-based multi-label classification pipeline incorporating imbalance-aware optimization, pseudo-labeling, and adaptive threshold calibration.
The remainder of this paper is organized as follows:
Section 2 reviews related work;
Section 3 describes the dataset, annotation schema, and modeling framework;
Section 4 presents the experimental setup;
Section 5 reports the results and analysis;
Section 6 discusses the findings and limitations; and
Section 7 concludes the paper.
2. Related Work
Research in mental health NLP has primarily focused on identifying psychological conditions, such as depression, anxiety, and suicidal ideation, from text data [
10,
11,
12,
13]. Recent advances have extended this line of inquiry to knowledge-intensive question answering and counseling dialogue systems [
14]. However, most existing approaches rely heavily on English-language datasets and often overlook the social, cultural, and religious framing of mental health discourse. This limitation becomes more pronounced when methods are transferred to languages with different morphological structures and sociocultural contexts. In Arabic, for example, mental health expressions may be shaped by complex morphology, dialectal variation, interpersonal norms, cultural expectations, and religious references. A recent review on the use of large language models in mental health [
15] highlights their diagnostic and therapeutic potential but also cautions against cultural bias, ethical risks, and limited suitability in low-resource settings.
Arabic mental health NLP remains an emerging area due to limited annotated corpora and linguistic diversity. Existing work has addressed depression and suicidal ideation detection in Arabic social media, including AraDepSu [
16] and deep-learning-based Arabic depression detection studies [
17]. Efforts such as MentalQA [
7,
18] and the AraHealthQA 2025 shared task [
19] have begun addressing Arabic mental health question answering and health-related reasoning. Recent work has also examined sociocultural dimensions of Arabic mental health discourse in condition-specific X communities, highlighting religious, relational, identity, emotional-distress, and medical vocabulary patterns across online communities [
20].
In parallel, Arabic pretrained language models such as AraBERT [
6], MARBERT [
21], and CAMeLBERT [
21] have improved Arabic text classification and representation learning across Modern Standard Arabic and dialectal Arabic. However, these resources and models primarily focus on clinical-condition detection, question answering, social media discourse characterization, or general Arabic language understanding rather than the explicit annotation of social, cultural, and religious contextual factors in Arabic mental health questions.
Sociocultural Context in NLP
Culturally aware NLP has gained increasing attention as researchers have recognized the limitations of Anglocentric modeling. Hovy and Yang [
8] argue for the explicit modeling of social and cultural factors in language understanding. Recent large-scale studies [
22,
23,
24] show that multilingual language models often reflect cultural asymmetries and value misalignments. The emerging taxonomy of Culturally Aware and Adapted NLP [
25] formalizes this research space by identifying dimensions such as cultural embedding, adaptation, and alignment. Within healthcare dialogue, incorporating social and religious context has been shown to improve alignment and empathy in model responses [
26].
This study extends these research directions by introducing a multi-label annotation schema that explicitly integrates social, cultural, and religious dimensions into Arabic mental health question classification. Unlike prior Arabic NLP work that has focused mainly on medical or linguistic aspects, the proposed framework captures the relational, cultural, and religious reasoning underlying patient expressions. It further contributes a scalable semi-supervised pipeline based on pseudo-labeling and adaptive threshold calibration, bridging socially aware NLP and Arabic healthcare classification.
3. Methodology
In this study, a comprehensive framework is introduced for multi-label classification of Arabic mental health questions, addressing data scarcity and label imbalance. The approach integrates contextually grounded annotation, semi-supervised data expansion, and imbalance-aware modeling. Curated online questions are annotated using a multi-level schema to produce a gold-standard dataset, which is subsequently expanded through pseudo-labeling. An AraBERT-based multi-label classifier is fine-tuned to identify social, cultural, and religious contextual dimensions, while adaptive threshold calibration is applied during inference to convert class probabilities into final label assignments. The resulting pipeline provides an end-to-end workflow from annotation to classification, as shown in
Figure 1.
3.1. Data Collection
This study uses data collected from
Altibbi.com (
https://altibbi.com), a well-established Arabic medical platform that hosts thousands of health-related articles, glossaries, and Q&A discussions. From this platform, 2677 unique question–answer pairs focusing on mental health were collected between 2020 and 2021 [
7]. These pairs reflect genuine patient inquiries and corresponding responses provided by licensed medical professionals. For this work, only the patient-authored questions were analyzed. Physician responses were excluded because the objective of this study is to examine how patients articulate mental health concerns through social, cultural, and religious framing, rather than to analyze clinical advice or provider communication.
3.2. Schema Development
The annotation schema was developed to capture contextual factors that shape how mental health concerns are expressed in Arabic discourse beyond purely clinical descriptions. In Arabic mental health contexts, individuals often articulate distress through cultural traditions, social expectations, family roles, community relations, and religious values. Ignoring these dimensions risks overlooking the broader context in which mental health concerns are experienced, communicated, and interpreted.
Social factors were included because many Arabic mental health questions are shaped by interpersonal relationships, family expectations, marital concerns, demographic position, and perceived life satisfaction. Prior work in socially aware NLP argues that linguistic meaning is shaped by social factors such as speakers, audiences, norms, and ideology [
27]. This motivated the inclusion of social context as a major dimension of the schema.
Cultural factors were included because prior multilingual and cross-cultural NLP studies show that language technologies, particularly models adapted through transfer learning or fine-tuning, can underrepresent non-Western and culturally marginalized perspectives [
22]. Research on cultural alignment further shows that cultural values and social norms should be explicitly represented when modeling language use across societies [
23,
26]. This motivated the inclusion of cultural context as a separate dimension of the schema.
Religious factors were modeled as a separate category because religious reasoning, spiritual coping, and faith-based interpretations appeared explicitly in Arabic mental health questions and often provided a distinct explanatory frame for distress. This distinction was also supported by public health work emphasizing that mental health and help-seeking are shaped by social determinants, including family structure, community relations, cultural values, religious context, and other contextual factors [
28].
The schema was constructed through iterative exploratory analysis and pilot annotation of the collected dataset. Recurring contextual patterns clustered around three major dimensions: social, cultural, and religious. These dimensions were selected because they consistently captured the main contextual signals observed in the questions while preserving interpretability and annotation reliability.
The sub-categories were similarly derived through iterative refinement of recurring patterns observed during annotation. Within the social dimension, the dominant themes involved relationships, demographic factors, and expressions related to life satisfaction. Within the cultural dimension, recurring themes centered on information, values, and norms shaped by cultural expectations. Within the religious dimension, recurring themes reflected religious reasoning, spiritual coping, and faith-based interpretations of distress. The final schema was therefore designed to balance contextual coverage, interpretability, and annotation reliability while remaining grounded in the linguistic and sociocultural characteristics of Arabic mental health discourse.
3.3. Schema Categories and Definitions
The annotation schema is structured in two stages. First, each question is assessed using a binary decision (Yes/No) to determine whether cultural, social, or religious framing is present. A question is marked Yes when it contains an explicit or implicit reference to cultural norms, social circumstances, interpersonal context, demographic conditions, or religious beliefs and practices. It is marked No when the question is limited to symptoms, diagnosis, treatment, or general medical advice without such contextual framing. If the answer is Yes, the question is further annotated using the categories and sub-categories defined in
Table 1. This design allows the schema to capture broader cultural, social, and religious influences while maintaining flexibility for multi-label annotation.
3.4. Annotation Process
The gold-standard dataset was constructed through a controlled human annotation process. Three native Arabic-speaking Saudi annotators participated in the labeling task. The annotation team consisted of two female annotators and one male annotator, including two PhD-level annotators and one master’s-level graduate. All annotators were familiar with Arabic dialectal variation, including Saudi Arabic, as well as the style of Arabic mental health questions.
Each question was independently annotated by two annotators to support inter-annotator agreement assessment and ensure labeling reliability. Disagreements were resolved through discussion and adjudication to produce the final labels.
The annotation process followed a predefined multi-level schema supported by detailed written guidelines specifying the criteria for each label (
Supplementary File S1). These guidelines were iteratively refined during the early annotation rounds to improve consistency and reduce ambiguity. Through this process, 500 questions were manually annotated, forming the gold-standard subset used for model training and evaluation.
3.5. Inter-Annotator Agreement
To assess the reliability of the manual annotations, agreement was measured across three labeling dimensions: binary relevance (Yes/No), main category (Culture, Social, Religion), and sub-category assignment (Information, Values, Norms and Morals, Relationship, Demographics, Life Satisfaction). The results are shown in
Table 2.
Cohen’s Kappa (
) was used to evaluate pairwise agreement, while Krippendorff’s Alpha (
) was used as a complementary reliability measure suitable for annotation assessment [
29,
30]. Agreement was computed under an exact-match criterion, where each multi-label combination was treated as a unique class.
Strong agreement was observed for binary relevance (), indicating reliable identification of contextual presence. Agreement at the main-category level was substantial (), reflecting consistent identification of cultural, social, and religious dimensions. Agreement declined for sub-categories (), which is expected given the finer granularity of the labels and the fact that multiple contextual interpretations may coexist in the same question.
3.6. Pseudo-Labeling
Manual annotation of contextual factors in Arabic mental health questions is costly because it requires culturally aware annotators and careful interpretation of social, cultural, and religious cues. Although the 500 manually annotated questions provide a high-quality gold-standard subset, this subset alone is limited for training a robust multi-label classifier, particularly for infrequent contextual categories. Therefore, a semi-supervised pseudo-labeling strategy was adopted to expand supervision to the remaining questions in the dataset.
To expand beyond the gold-standard subset, a model trained on the manually annotated questions was used to generate probability scores for the remaining 2177 unlabeled questions. Pseudo-labels were then assigned using confidence-based thresholds: labels exceeding the threshold were treated as positive, while labels below the threshold were treated as negative. This approach allowed the full dataset to contribute to training while maintaining a controlled decision boundary for positive label assignment.
3.7. Model Architecture
The framework uses AraBERT [
6] as the backbone model for multi-label classification of Arabic mental health questions. Each input question is tokenized and encoded into contextualized representations. The [CLS] token representation is used as a global sequence representation and passed to a task-specific linear classification layer, as illustrated in
Figure 2.
The classification head produces an independent probability score for each label using sigmoid activations, allowing multiple labels to be assigned to the same question. Weighted binary cross-entropy loss is applied independently to each label.
Validation-Based Threshold Calibration
In multi-label classification, using a single global probability threshold may not be suitable for all labels, particularly under severe class imbalance. Therefore, class-specific thresholds were calibrated on the validation set and applied during inference to convert predicted probabilities into final label assignments [
31].
Let
denote the predicted probability for class
c in sample
i. A label is assigned when the predicted probability exceeds the corresponding class-specific threshold
:
where
denotes the indicator function.
The threshold for each label was selected empirically using validation predictions to improve the balance between precision and recall across labels with different frequencies. This calibration allows different labels to use different decision boundaries rather than relying on a single default threshold for all classes. The final validation-calibrated thresholds used during inference are summarized in
Table 3.
5. Results
The experimental results evaluate both the modeling performance and the annotation characteristics of the proposed framework. The analysis examines the effects of pseudo-label augmentation, class-imbalance handling, and validation-based threshold calibration, followed by a detailed analysis of contextual label distributions, overlaps, and representative examples from the dataset.
5.1. Impact of Pseudo-Labeled Data on Model Performance
The pseudo-labeled training configuration improved overall performance compared with the gold-only configuration, particularly on metrics influenced by high-support labels. Mean performance across five folds showed that Micro-F1 increased from 0.72 to 0.84, while Macro-F1 improved from 0.19 to 0.22. Subset Accuracy increased from 0.70 to 0.84, and the Jaccard Index improved from 0.72 to 0.84. In addition, Hamming Loss decreased from 0.07 to 0.04, suggesting fewer label-wise prediction errors overall. These results are reported using the held-out gold-standard test sets in
Table 5 and
Figure 3.
5.2. Per-Class Performance Analysis
Per-class analysis was conducted to provide a clearer interpretation of model behavior across individual contextual labels. As shown in
Table 6, pseudo-label augmentation improves performance on high-support labels, particularly No and Social|Relationship. The F1-score for Social|Relationship increases from 0.55 in the gold-only setting to 0.68 after adding pseudo-labeled samples, suggesting improved learning of frequent social-context patterns.
However, performance on low-support categories remains limited. Social|Life Satisfaction, Culture|Information, and Culture|Values each have only one positive instance in the held-out test fold and obtain F1-scores of 0.00 under both configurations. Among the cultural sub-categories, only Culture|Norms and Morals shows low performance, with F1 improving from 0.18 to 0.31 under pseudo-label augmentation. In contrast, Religion remains difficult to predict because of its low support, with F1 decreasing from 0.20 to 0.13 after pseudo-label augmentation. These results make the severity of class imbalance explicit and explain why cultural and religious context prediction remains challenging despite the use of pseudo-labeled data. These results show that pseudo-label augmentation helps frequent labels but remains limited for rare cultural and religious sub-categories, which should be addressed in future work through targeted annotation and stronger imbalance-aware learning.
5.3. Error Analysis
Error analysis shows that prediction errors mainly occur in three cases. First, some contextual questions are predicted as No, especially when the contextual cue is implicit rather than directly stated. This suggests that the model may miss social, cultural, or religious signals when they are expressed indirectly.
Second, cultural cases may be confused with Social|Relationship because cultural expectations are often expressed through family, marriage, interpersonal obligations, or social pressure. As a result, the model may capture the social surface meaning while missing the deeper cultural framing.
Third, religious cases may be confused with either No or Social|Relationship. This occurs when religious cues are sparse, indirect, or embedded within broader emotional or social distress.
5.4. Ablation Study
An ablation study was conducted to examine the individual and combined effects of class-weighted optimization and pseudo-label augmentation. Four configurations were compared: a standard AraBERT baseline, AraBERT with weighted BCE loss, AraBERT with pseudo-label augmentation, and the complete framework combining weighted BCE with pseudo-label augmentation. All configurations were evaluated using the same held-out gold-standard test sets under five-fold multi-label stratified cross-validation.
The standard AraBERT baseline achieved a Micro-F1 score of 0.78 and a Macro-F1 score of 0.19. Adding weighted BCE improved Macro-F1 to 0.21, indicating slightly better sensitivity to minority contextual categories under severe class imbalance. However, this improvement was accompanied by lower Micro-F1, Subset Accuracy, and Jaccard scores, suggesting reduced overall prediction stability, as reported in
Table 7.
Pseudo-label augmentation alone improved the main performance metrics, increasing Micro-F1 from 0.78 to 0.83 and Subset Accuracy from 0.76 to 0.82. This suggests that additional pseudo-labeled contextual samples improve representation learning and prediction consistency, even without imbalance-aware optimization.
The strongest overall performance was achieved by combining weighted BCE with pseudo-label augmentation. This configuration achieved the highest scores across all reported metrics, including a Micro-F1 score of 0.84, a Macro-F1 score of 0.22, a Subset Accuracy of 0.84, and a Jaccard score of 0.84. Overall, the results suggest that pseudo-label augmentation provides the largest performance improvement, while weighted BCE provides a smaller additional gain in minority-label sensitivity when combined with additional contextual data.
5.5. Binary Annotation
The binary annotation task determines the presence or absence of contextual factors within each question. The human annotation phase covers 500 questions, of which 131 are labeled Yes and 369 are labeled No. Pseudo-labeling is subsequently applied to the remaining 2177 unlabeled questions, yielding an additional 421 Yes and 1756 No labels. The combined annotation results in 552 Yes and 2125 No instances across all 2677 questions, as summarized in
Table 8. This distribution indicates that approximately 20.6% of questions contain at least one culturally, socially, or religiously grounded dimension.
Figure 4 further illustrates the distribution of questions with and without contextual factors.
5.6. Distribution of Cultural, Social, and Religious Annotations
Among the 131 human-annotated positive cases, social aspects were the most prevalent, appearing in 108 questions. Cultural aspects were identified in 19 instances, while religious aspects appeared in 14. To better understand label co-occurrence, the annotations were further analyzed based on mutually exclusive label combinations. The majority of samples (98) were labeled as social only. In contrast, 14 questions were labeled as cultural only and 9 as religious only. A smaller number of instances exhibited overlapping labels, with 5 questions combining social and cultural aspects and another 5 combining social and religious aspects. No samples were annotated with all three categories simultaneously. Overall, these counts sum to 131, confirming that each instance is assigned to exactly one label combination. While the main categories themselves are not mutually exclusive, their combinations form a complete and non-overlapping partition of the dataset.
Among the 421 pseudo-labeled positive cases, social aspects again dominated, appearing in 417 questions. Cultural and religious aspects appeared in 9 and 5 cases, respectively.
The overlap distribution was computed over mutually exclusive label combinations. Specifically, 408 questions were labeled as social only, 2 as cultural only, and 2 as religious only. Additionally, 6 questions combined social and cultural labels, 2 combined social and religious labels, and 1 instance contained all three categories. These counts sum to 421, confirming a complete and non-overlapping partition of samples across label combinations. A detailed breakdown of label co-occurrence is provided in
Table 9, presenting the distribution across human annotations (131), pseudo-labeled instances (421), and the combined dataset (552).
5.7. Sub-Category Distribution and Overlap Analysis
At the sub-category level, Social|Relationship is the most frequent label. As illustrated in
Figure 5, overlap among social sub-categories is limited. The largest overlap is observed between Social|Relationship and Social|Demographics, while overlaps involving Social|Life satisfaction are rare. No instance contains all three social sub-categories.
Within the cultural dimension (
Figure 6), Culture|Norms and Morals appears most frequently, whereas Culture|Information and Culture|Values occur in fewer instances. The only overlap is between Norms and Morals and Values, with no overlap involving Information.
Cross-category co-occurrence between social and cultural sub-categories is shown in
Figure 7. The highest overlap occurs between Social|Relationship and Culture|Norms and Morals. Additional overlaps are observed between social and religious categories, as reflected in the heatmap.
5.8. Representative Examples
A range of sociocultural frames can be observed in how Arabic speakers articulate psychological distress. Religious framing appears in the first example, where the patient’s compulsive repetition of words is interpreted through a spiritual lens and managed via religious coping strategies, such as remembrance and seeking forgiveness (
Table 10).
Social dimensions are prominent in multiple entries, particularly those involving relational and emotional difficulties. For instance, the question concerning marriage between two individuals diagnosed with bipolar disorder illustrates the intersection of mental health and familial expectations, where love, stigma, and responsibility intertwine. Similarly, the example expressing chronic sadness despite material stability captures a form of emotional dissatisfaction linked to life satisfaction rather than socioeconomic status.
Cultural labeling is evident in questions centered on self-image and social perception. The example of body dissatisfaction and comparison with others demonstrates how cultural ideals of beauty and moral self-worth interact to shape psychological vulnerability. The case involving withdrawal from escitalopram also reveals culturally specific health behaviors, where pharmacological decisions are influenced by concerns about motherhood and family values. Finally, the example describing fear of judgment and self-consciousness represents the sub-category of Values, emphasizing the moral and social gaze that governs self-presentation in collectivist settings.
Overall, the selected questions illustrate the interplay between religious beliefs, cultural values, and social relationships in Arabic expressions of distress. Such sociocultural grounding challenges purely clinical interpretations and underlines the necessity of culturally aligned NLP models for mental health understanding in Arabic contexts.
6. Discussion
This work introduces ContextMental, a benchmark and annotation framework designed to capture contextual dimensions in Arabic mental health question classification. The results demonstrate that contextual information plays a central role in how psychological distress is expressed in Arabic patient-authored questions, particularly through socially grounded narratives and interpersonal concerns.
The annotation distribution shows that social contextual factors are the most common, especially the Social|Relationship category, across both manually annotated and pseudo-labeled data. This pattern reflects the tendency of users to describe psychological experiences through family dynamics, interpersonal conflict, emotional attachment, and social interaction rather than through isolated clinical symptoms. In contrast, cultural and religious contextual signals appear less frequently. However, these dimensions still provide important complementary information that shapes the interpretation of mental health narratives within Arabic-speaking communities.
The overlap analysis shows that contextual dimensions can co-occur, although such overlaps remain limited compared with single-category annotations. The clearest cross-category overlap appears between social and cultural dimensions, particularly when relationship-related distress is expressed alongside social expectations, family norms, or culturally shaped behavioral constraints. This suggests that contextual signals in Arabic mental health questions are sometimes interconnected rather than strictly separable. Accordingly, context-aware mental health NLP may benefit from modeling dependencies between contextual dimensions instead of treating labels as fully independent categories.
The semi-supervised experiments indicate that pseudo-labeling improves overall predictive performance, mainly for high-support labels such as No and Social|Relationship. The pseudo-labeled data remain concentrated in social-context labels and add comparatively fewer examples for cultural and religious categories. This indicates that the model is more effective at extending frequent contextual patterns than at expanding low-support categories. The consistent improvements in Micro-F1, Jaccard Index, and Subset Accuracy further suggest that pseudo-labeling improves global prediction consistency and generalization for common contextual structures.
However, the results also show important limitations of pseudo-labeling under severe class imbalance. Rare contextual categories, including Culture|Values, Culture|Norms and Morals, and Religion, remain difficult to predict reliably despite the larger training set. The pseudo-labeled data provide only limited additional support for these categories, which restricts the model’s ability to learn stable decision boundaries for minority labels. Consequently, Macro-F1 improves more modestly than Micro-F1, indicating that the main gains are concentrated in dominant categories rather than distributed evenly across all labels. This reflects a known limitation of self-training approaches, where pseudo-labeling may provide greater benefit for frequent patterns while offering less improvement for underrepresented classes.
The annotation reliability analysis supports the consistency of the proposed framework, particularly at the main-category level, where inter-annotator agreement remains strong. Nevertheless, agreement decreases for fine-grained and infrequent sub-categories, reflecting the inherent ambiguity of contextual interpretation in mental health language. Distinguishing between overlapping cultural, social, and religious contextual signals often requires subjective judgment and nuanced interpretation of implicit linguistic cues. These challenges become more pronounced in short or sparsely contextualized questions, where limited textual evidence constrains annotation certainty.
Several limitations remain. First, the dataset size for minority contextual categories remains relatively small, which restricts reliable evaluation and stable learning for rare labels. Second, the pseudo-labeling strategy depends on model-generated annotations, which may introduce noise and reflect the distributional tendencies of the seed model. Third, the benchmark focuses specifically on Arabic mental health questions and therefore does not necessarily generalize to broader Arabic mental health discourse, conversational dialogue, or long-form clinical narratives. In addition, the current framework models contextual dimensions independently and does not explicitly capture hierarchical or temporal relationships between contextual factors.
Despite these limitations, the proposed benchmark provides a structured resource for contextual mental health understanding in Arabic NLP. The dataset, annotation framework, and semi-supervised learning setup establish a foundation for future research on context-aware mental health modeling, culturally informed NLP systems, and socially grounded language understanding in low-resource mental health settings.
6.1. Implications
The findings of this study have several methodological, computational, and societal implications for Arabic mental health NLP. Methodologically, the results suggest that a multidimensional annotation schema for Arabic mental health questions can be constructed through an iterative, guideline-driven annotation process by a small research team, without requiring large-scale infrastructure. The schema captures social, cultural, and religious aspects embedded in patient discourse while preserving interpretability and annotation consistency. However, scaling the schema to larger datasets or clinical deployment would require additional annotation resources and expert validation. The observed inter-annotator agreement and overlap analysis suggest that subjective constructs, such as cultural and religious reasoning, can be consistently annotated through carefully defined guidelines and iterative calibration. This framework may support the development of similar sociocultural annotation schemes in other healthcare domains and underrepresented languages.
From a computational perspective, the use of pseudo-labeling demonstrates the practicality of semi-supervised learning in low-resource and sensitive domains. Leveraging validation-calibrated model predictions enables the expansion of labeled data while preserving key distributional patterns observed in human annotations. Such approaches are particularly relevant in settings where data availability is constrained by privacy, ethical considerations, or linguistic diversity. Incorporating contextual categories within pseudo-labeling pipelines may further support the development of models that generalize across different social contexts while maintaining interpretability [
35].
At the societal level, the results highlight the importance of accounting for contextual factors in the development of NLP systems for mental health. By capturing how individuals express distress through relational, cultural, and religious frames, the dataset provides a structured foundation for more context-aware analysis. These insights may inform the design of systems that better reflect the diversity of user experiences in Arabic-speaking contexts. Future work can explore how such models can be integrated into real-world applications while ensuring transparency and cultural sensitivity.
6.2. Ethical Considerations
Given the sensitive nature of mental health data, ethical considerations are central to this study. The dataset was derived from publicly available question–answer content, and no personally identifiable information is included in the released dataset. The analysis is limited to patient-authored questions and focuses on the thematic categorization of contextual factors rather than diagnosis, risk assessment, or evaluation of individual users. Annotators were instructed to treat the questions respectfully and to label only observable textual cues related to social, cultural, or religious framing.
Several risks remain despite anonymization. Mental health questions may contain sensitive self-disclosures, and contextual labels may reflect vulnerable social circumstances, religious concerns, family relations, or culturally specific experiences. In addition, pseudo-labeling may introduce noise, and model predictions may reproduce biases present in the source platform or annotation process. Therefore, the resulting models should be interpreted as research tools for corpus analysis and should not be used for clinical decision-making, diagnosis, triage, or automated mental health advice without expert oversight.
The schema is intended to support culturally aware analysis, not to stereotype Arabic-speaking users or reduce complex experiences to fixed categories. Any downstream use should include transparency about model limitations, human review, cultural sensitivity, and fairness evaluation across demographic and dialectal groups. Future work should further examine bias, annotation uncertainty, and safeguards for responsible deployment in mental health NLP.
7. Conclusions
This work introduces ContextMental, a socioculturally informed framework for Arabic mental health question classification that integrates cultural, social, and religious dimensions into both annotation and modeling. The proposed schema supports multi-label analysis at two levels of granularity—main categories and sub-categories—and is instantiated in a curated corpus of patient questions. A baseline classification framework based on AraBERT, combined with imbalance-aware optimization, pseudo-labeling, and adaptive threshold calibration, provides preliminary evidence that contextual-factor classification is feasible for high-support labels, while rare cultural and religious categories require further annotated data and stronger imbalance-aware methods.
The results highlight the importance of contextual framing in Arabic mental health discourse. Patient expressions are frequently shaped by interpersonal, cultural, and religious considerations, which are not fully captured by conventional clinical categorizations. Modeling these dimensions may support a more context-aware representation of how distress is articulated in real-world Arabic mental health questions.
From a methodological perspective, this work presents a framework that combines structured sociocultural annotation with semi-supervised learning. This approach can support dataset expansion while preserving the main distributional patterns observed in human annotations. However, the classification results should be interpreted cautiously, as improvements are concentrated mainly in high-support labels and do not yet indicate robust prediction of rare sociocultural categories.
Future work could explore the integration of multi-modal signals, such as speech and sentiment, as well as more advanced adaptation and imbalance-aware techniques to improve performance on minority labels. Additional human-centered evaluation is also needed to assess the practical implications of applying such models in real-world settings. Overall, this study contributes a structured resource and baseline modeling approach for context-aware Arabic mental health NLP, supporting future research on systems that better account for selected social, cultural, and religious dimensions of user experiences.