Emotional Analysis in a Morphologically Rich Language: Enhancing Machine Learning with Psychological Feature Lexicons

Keinan, Ron; Margalit, Efraim; Bouhnik, Dan

doi:10.3390/electronics14153067

Open AccessArticle

Emotional Analysis in a Morphologically Rich Language: Enhancing Machine Learning with Psychological Feature Lexicons

by

Ron Keinan

^1,2,*

,

Efraim Margalit

¹

and

Dan Bouhnik

¹

Department of Computer Science, Lev Academic Center, Jerusalem College of Technology, Jerusalem 9116001, Israel

²

Bar Ilan University, Ramat Gan 5290002, Israel

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(15), 3067; https://doi.org/10.3390/electronics14153067

Submission received: 23 June 2025 / Revised: 29 July 2025 / Accepted: 30 July 2025 / Published: 31 July 2025

(This article belongs to the Special Issue Techniques and Applications of Multimodal Data Fusion)

Download

Browse Figures

Versions Notes

Abstract

This paper explores emotional analysis in Hebrew texts, focusing on improving machine learning techniques for depression detection by integrating psychological feature lexicons. Hebrew’s complex morphology makes emotional analysis challenging, and this study seeks to address that by combining traditional machine learning methods with sentiment lexicons. The dataset consists of over 350,000 posts from 25,000 users on the health-focused social network “Camoni” from 2010 to 2021. Various machine learning models—SVM, Random Forest, Logistic Regression, and Multi-Layer Perceptron—were used, alongside ensemble techniques like Bagging, Boosting, and Stacking. TF-IDF was applied for feature selection, with word and character n-grams, and pre-processing steps like punctuation removal, stop word elimination, and lemmatization were performed to handle Hebrew’s linguistic complexity. The models were enriched with sentiment lexicons curated by professional psychologists. The study demonstrates that integrating sentiment lexicons significantly improves classification accuracy. Specific lexicons—such as those for negative and positive emojis, hostile words, anxiety words, and no-trust words—were particularly effective in enhancing model performance. Our best model classified depression with an accuracy of 84.1%. These findings offer insights into depression detection, suggesting that practitioners in mental health and social work can improve their machine learning models for detecting depression in online discourse by incorporating emotion-based lexicons. The societal impact of this work lies in its potential to improve the detection of depression in online Hebrew discourse, offering more accurate and efficient methods for mental health interventions in online communities.

Keywords:

emotional analysis; Hebrew texts; machine learning; psychological feature lexicons; depression detection

1. Introduction

1.1. Research Background and Significance

The proliferation of online communication platforms and social networks has generated vast amounts of user-generated textual data. This digital discourse represents an invaluable resource for understanding public sentiment, opinions, and, critically, mental well-being. Consequently, emotional analysis—the computational task of detecting and interpreting human emotions in text—has emerged as a vital area of research with far-reaching applications, from enhanced customer service and market intelligence to public health monitoring [1,2]. A particularly significant application lies in the early detection of mental health conditions, such as depression. By analyzing language patterns in online communities, it is possible to identify at-risk individuals non-intrusively, paving the way for timely interventions and support [3].

Health-related social networks, where users discuss their conditions and share experiences, are especially rich sources of data for this purpose. Our study leverages a unique dataset from “Camoni,” a prominent Israeli health-focused social network [4,5]. However, extracting meaningful emotional signals from this data presents substantial computational and linguistic challenges, particularly for a language as complex as Hebrew.

1.2. The Main Problem: Challenges in Hebrew Natural Language Processing

While machine learning has shown promise in sentiment analysis, its application to Hebrew texts for nuanced tasks like depression detection is fraught with difficulties. The primary obstacle stems from the unique linguistic characteristics of the Hebrew language. Hebrew is a morphologically rich language, characterized by a non-concatenative templatic morphology and a rich derivation system [6]. This means that a single word can be composed of multiple subword units (root, pattern, prefixes, suffixes), leading to significant ambiguity and a vast vocabulary space. For example, a single written form can correspond to multiple distinct meanings depending on context and vocalization.

Furthermore, compared to high-resource languages like English, Hebrew is considered a low-resource language in the field of Natural Language Processing (NLP). This translates to a scarcity of large-scale annotated datasets, pre-trained models, and sophisticated computational tools specifically tailored for emotional analysis [7,8,9,10]. The combination of high morphological complexity and low resource availability severely hampers the performance of conventional machine learning models, which often fail to capture the subtle emotional cues essential for accurately identifying psychological states like depression from online discourse.

1.3. Limitations of Existing Methods

Standard approaches to text classification in this domain typically involve a pipeline of pre-processing steps—such as punctuation removal, stop-word elimination, and lemmatization [11]—followed by feature extraction using methods like Term Frequency-Inverse Document Frequency (TF-IDF). These features are then fed into various classifiers, including Support Vector Machine (SVM), Random Forest, and ensemble methods like Bagging, Boosting, and Stacking [12,13].

While these methods are foundational, they suffer from a critical limitation: they primarily operate on a lexical level, capturing word frequencies and distributions rather than deeper semantic or psychological meaning. Lemmatization can mitigate some morphological challenges, but it does not resolve the underlying issue of semantic nuance. Consequently, these models often struggle to differentiate between texts that are topically similar but emotionally distinct. They can miss sarcasm, context-dependent emotional expressions, and the subtle linguistic markers indicative of a persistent depressive state. Simply put, the feature space generated by TF-IDF alone is often insufficient for the fine-grained task of clinical state detection.

1.4. Our Contribution: Research Goal, Method, and Innovation

The primary goal of this research is to overcome the aforementioned limitations by enhancing machine learning models for depression detection in Hebrew texts. We propose a novel approach that integrates domain-specific psychological knowledge directly into the feature engineering process. Our central hypothesis is that by enriching the feature set with features derived from expert-curated psychological lexicons, we can provide the model with the explicit emotional and psychological signals that conventional methods miss.

To achieve this, we utilize a large, unique dataset scraped from the “Camoni” social network, comprising over 350,000 posts from 25,000 users spanning a decade [4,5]. The core innovation of our methodology lies in the extraction and integration of features from a suite of specialized lexicons, including categories such as negative emojis, positive emojis, hostile words, anxiety words, and words indicating a lack of trust [1,2,14]. These lexicons, developed and validated in psychological research, are designed to capture subtle emotional cues often overlooked by data-driven methods.

By combining these lexicon-based features with traditional TF-IDF features and employing advanced machine learning models with rigorous hyperparameter tuning, we aim to significantly improve the accuracy of depression classification. The contribution of this paper is twofold. It presents a robust methodology for improving emotional analysis in a morphologically rich, low-resource language. Moreover, it provides empirical evidence that the integration of psychological lexicons offers a powerful pathway to developing more accurate and clinically relevant computational tools for mental health monitoring. This work not only advances our understanding of depression detection in online communities but also lays the groundwork for future research in computational psychiatry.

2. Related Work

Text classification in Hebrew follows the same foundational principles as in any other language, involving the categorization of textual documents based on their content. This task is essential for automating the analysis and organization of Hebrew text data, offering valuable applications in diverse fields. However, text classification in Hebrew presents unique challenges, primarily due to the scarcity of language resources, including lexicons and tools. The limited availability of these resources makes training models more challenging, potentially leading to lower accuracy in classification. The insufficiency of linguistic tools hampers the ability to effectively capture the intricacies of the Hebrew text, hindering the performance of text classification models [7].

Adding to the complexity is Hebrew’s intricate morphology. The language boasts a rich system of derivation, encompassing multiple forms for verbs, nouns, and other grammatical elements. This morphological richness introduces additional layers of complexity during the training of models, demanding a fine understanding of the language’s structure to achieve accurate classification [6]. Moreover, Hebrew exhibits distinctive grammar and a relatively free word order. This linguistic characteristic adds an extra layer of complexity to text classification. The flexible word order and unique grammatical structures can pose challenges in extracting relevant features and patterns from the data. The non-standard syntax can make it more challenging for models to discern the contextual meaning of words, potentially impacting the precision and recall of classification outcomes.

In recent years, there has been significant research into identifying mental health disorders through machine learning, particularly focusing on detecting depression in social media texts. De Choudhury [15] highlights the potential of online social networking as a tool for public health by identifying signs of depression through behavioral attributes extracted from social media posts. Similar research [16] emphasizes Twitter’s role in detecting psychological well-being, including depression and suicidality, using both human coders and automated classifiers. Another notable work [17] highlights the increasing use of social network sites for self-expression, while the authors of [18] achieved high accuracy in distinguishing between depressed and non-depressed users on Reddit using various machine learning methods.

A feature lexicon is a list of positive/negative words and phrases, each assigned a sentiment subject that it reflects. For example, words like “love,” “wonderful,” and “delightful” might have strong positive sentiment scores, while words like “hate,” “disgusting,” and “terrible” might have strong negative sentiment scores [19]. The coverage and quality of a feature lexicon can significantly contribute to the success of various tasks like opinion mining and sentiment analysis [1,2,14]. Since sentiment analysis involves determining the emotional tone or attitude expressed in a piece of text, and feature lexicons provide a predefined set of words and their corresponding sentiments, lexicons can be used as features for text classification models. Feature lexicons can be created through various methods, including manual annotation, crowdsourcing, and NLP techniques. These lexicons provide a quick and efficient way to extract relevant features from text data [10].

The issue of negation is a crucial aspect of sentiment analysis that needs to be accounted for when using feature lexicons. Negation refers to the use of words that change the meaning of a sentence to its opposite, such as “not,” “no,” and “never.” Negation can completely flip the sentiment of a sentence and affect the accuracy of sentiment analysis. Some feature lexicons include special notation for negation, and advanced natural language processing techniques may also be used to better account for negation and other linguistic nuances in a text.

Most recent depression-detection work has focused on English or other languages, often using lexicon integration with neural models. For example, Milintsevich [20] augmented a Transformer model with multiple sentiment and emotion lexicons and obtained state-of-the-art depression-symptom estimation performance. Ogunleye [21] combined Sentence-BERT embeddings with ensemble learning and sentiment indicators, improving F1 scores to roughly 69–76% on benchmark social media datasets. Similarly, Chiong [22] used 90 handcrafted lexicon-based and content features to achieve >96% accuracy (and >98% with Gradient Boosting) in English Twitter depression data. Our work differs by focusing on Hebrew, a morphologically rich and resource-scarce language. We leverage Hebrew-specific psychological lexicons (e.g., those introduced by [3]), which were curated by experts for emotions, anxiety, trust, hostility, etc., and integrate these lexicons with TF–IDF and n-gram features in classical ML models (SVM, RF, MLP). In line with the above studies, we find that adding these emotion-based lexicon features significantly boosts classification accuracy in our Hebrew depression dataset.

However, our approach also has limitations. Unlike many recent methods that employ deep Transformer architectures, we apply lexicon features in traditional ML pipelines. Our current models do not use such pretrained deep architectures. Similarly, recent Hebrew pretrained models like HeRo [23] provide state-of-the-art results on sentiment and classification tasks, but they were not incorporated here.

3. Methodology

The research deals with computerized and large-scale quantitative content analysis. The data processing workflow began with the collection of a dataset comprising Hebrew-language posts, each binary classified to indicate the presence or absence of depressive content. Subsequently, a machine learning-based classification process was implemented, incorporating enhancements such as feature selection, preprocessing, and parameter tuning. The primary phase of this process involved enriching the models with external features derived from specialized sentiment lexicons, aimed at improving classification accuracy.

3.1. The Data

The data processing workflow began with the creation of a dataset by scraping the Camoni website, Israel’s largest platform, containing digital health communities on various topics. It was established to empower and enable patients and their families to take an active part in the management of the disease, along with the ability for social interaction that may improve their lives and allow the contestants to make informed choices and better deal with their condition.

We downloaded all information about posts, comments, and user data. The labeling of posts as ‘depressive’ or ‘non-depressive’ was performed using an automated, source-based methodology. Our dataset comprised 3000 posts. For the ‘depressive’ class, 1500 posts were sourced from the ‘depression and anxiety’ community. These were automatically labeled as positive (indicative of depression) based on the clear heuristic that individuals posting in such a community are highly likely to be discussing depression. Notably, these were posts, not comments, implying a solicitation for help. To validate this automated labeling approach, we conducted a manual review on a sample of approximately 100 posts, which confirmed that the community-based labels were consistent with human assessment.

To establish a robust comparison, the remaining 1500 ‘non-depressive’ posts were automatically labeled based on their origin from communities divergent from the mental health domain, such as ‘obesity’ or ‘trauma from terrorism and war. This deliberate selection aimed to introduce a thematic distance from the content typically associated with depression, enriching our dataset with a diverse array of linguistic expressions.

Following the meticulous labeling process and dataset curation, our workflow transitioned into the crucial phase of data partitioning. This involved a strategic division of the dataset into two distinct groups: the training set and the test set. The rationale behind this partitioning strategy was to facilitate a robust model training process while ensuring an unbiased and independent evaluation of model accuracy.

The training group, constituting a substantial portion of the dataset, served as the foundation for training diverse models aimed at depression detection. During this training phase, the models immersed themselves in the intricate patterns and features present within the labeled posts, fine-tuning their predictive capabilities based on the characteristics associated with depression-related content in Hebrew digital health communities.

Meanwhile, the test group played a pivotal role in assessing the generalizability and performance of the trained models. This subset of the dataset, distinct from the training set, provided a simulated environment for testing the models’ accuracy without the risk of overfitting the training data. By subjecting the models to the test set, we gauged their ability to discern depression-related content beyond the confines of the specific training examples, thus ensuring their efficacy in real-world scenarios.

3.2. Classification Methodology

The full flow of classification is described in Figure 1:

In the first stage, we aimed to adapt an optimal machine learning model and feature structure to achieve the highest possible accuracy. We compared many classifiers, including Decision Tree, Random Forest, Support Vector Machine (SVM), Logistic Regression, k-Nearest Neighbors (KNN), Naive Bayes, and ensemble learning methods like Bagging, Boosting, and Stacking.

Subsequently, we evaluated different text preprocessing methods to enhance model accuracy. We tested each method individually and in combination, focusing on removing HTML and URLs to eliminate extraneous web-related information, removing punctuation to streamline the text, removing stopwords to focus on content-bearing words, reducing repeated characters to address writing variations, removing non-Hebrew characters to ensure linguistic relevance, removing numbers to refine the text, removing redundant spaces to improve consistency, and lemmatizing the generic Hebrew words. Each preprocessing method was meticulously tested to determine its impact on classification accuracy, ultimately guiding us to an optimal preprocessing strategy.

We then transferred the texts to a vector field using the Term Frequency—Inverse Document Frequency (TF-IDF) approach, allowing the models to process and analyze the texts effectively. Our approach focused on identifying features characteristic of depression-related texts. To achieve this, we employed TF-IDF to weigh words, creating a dictionary of all words and assigning a TF-IDF score to each word based on its occurrences across the dataset. TF reflects a term’s frequency in a post, while IDF accounts for the inverse ratio of the number of posts containing that word. This adjustment balances word frequency and importance, enhancing feature precision and relevance for our depression detection analysis. A critical consideration was the careful selection of significant features within depression classification [24,25].

To enhance our methodology, we incorporated char-grams as features, which represent sequences of characters of varying lengths. This approach is particularly relevant for Hebrew due to its complex morphology and grammar [11,26]. We employed classifiers using char-grams ranging from 3 to 10 characters, generating eight types of character-based feature tables and three types of word-based feature tables (unigrams, bigrams, trigrams).

Parameter tuning was conducted by selecting a range of values for each parameter and performing a grid search or random search to identify the optimal settings. Cross-validation was employed to evaluate model performance during the tuning process, and the best parameters were selected based on validation results. Once the parameters were optimized, the model was trained on the training dataset. The trained models were then evaluated on the testing dataset using an accuracy metric. The full experimental setup is described in detail in Appendix D.

3.3. Feature Lexicons Enrichment

In the next stage of our analysis, we introduced a dimension to our models by incorporating features from a rich set of feature lexicons. This augmentation is designed to investigate the potential impact of sentiment-related information on the classification accuracy of our models.

Most of the selected lexicons, numbering 31 in total, have been carefully crafted by experts in the field of psychology [3], constituting an extensive collection of Hebrew lexicons specifically tailored to capture various psychological aspects. We also added another lexicon that contains a large pool of words with negative sentiment [27], as well as 3 more lexicons that are based solely on emojis, without words; each lexicon focuses on a specific sentiment-positive, negative, or neutral emojis [28]. Emojis are universal and do not depend on the language, and therefore, their research can have extensive significance beyond the Hebrew language.

These lexicons prove to be invaluable assets in the realm of psychology applications, providing insights into diverse domains such as emotional states, well-being, conversation dynamics (including relationship quality), and the identification of topics spanning family and work discussions, among others. Each lexicon within this curated set is dedicated to a specific topic, allowing us to systematically explore the influence of words within distinct emotional domains on the performance of our depression classification models.

Our approach involves the step-by-step integration of each model with one featured lexicon from the comprehensive list at a time. This meticulous process enables us to discern the contributions of each lexicon to the model’s classification capabilities, shedding light on the interplay between sentiment-related features and the accurate identification of depressive content in the Hebrew language. We selected the best 140 models up to this point, and for each of them, we added each of the lexicons separately to test its effect.

The list of lexicons, their description, and size are described in Table 1:

3.4. Combined Feature Lexicons

Building upon the preceding exploration, our focus shifted to distilling the most impactful feature lexicons that exhibited a notable influence on improving depression classification. To identify the feature lexicons most impactful for depression classification, we conducted a systematic contribution analysis. Our methodology involved establishing a baseline model using only TF-IDF features. Subsequently, we iteratively augmented this baseline by adding the features from each lexicon, one at a time. A classifier was then trained and evaluated on each of these new, combined feature sets. The performance of each lexicon was measured by the classification accuracy achieved by the model. This quantitative analysis allowed us to rank the lexicons based on their direct contribution to improving model performance. The six lexicons that yielded the highest individual increases in accuracy were selected for our final model. These lexicons, namely [‘EMOJI_NEG’, ‘NOTTRUST’, ‘HOSTILE’, ‘ANXIETY’, ‘EMOJI_POS’, ‘EMOJI_NEU’], were thus empirically identified as containing the key emotional dimensions most relevant to this classification task.

In the subsequent phase, our approach entailed a meticulous integration of these six chosen lexicons into each model. This integration was carried out systematically, exploring every possible permutation from the selected list of lexicons. The objective was to assess whether the incorporation of these lexicons, individually and in combination, would yield a more substantial improvement in the models’ classification performance.

To ensure the robustness of our findings and avoid reliance on a single data partition, we employed a stratified 10-fold cross-validation strategy for all experiments. The dataset was partitioned into ten folds, maintaining the same percentage of ‘depressive’ and ‘non-depressive’ posts in each fold. The models were trained on nine folds and evaluated on the held-out fold. This process was repeated ten times, with each fold serving as the test set once. The final accuracy scores reported are the average accuracy across these ten folds. This approach provides a more reliable estimate of model performance and ensures our results are generalizable.

4. Results

To present our findings, we first detail the rigorous experimental process undertaken. In our analysis, we introduced features from a rich set of 34 lexicons, comprising 31 psychological lexicons curated by experts and 3 emoji-based lexicons (positive, negative, and neutral). Our approach involved systematically integrating each model with one feature lexicon at a time to discern its contribution to classification accuracy. This extensive evaluation was conducted on the 140 best-performing models identified in prior stages of our research.

Table 2 showcases the top 50 models from this extensive evaluation, highlighting the specific lexicon that yielded the most significant improvement for each model. The “Feature” column clarifies the number (3 or 4) and type of n-gram feature used (c for character n-grams, w for word n-grams). The comprehensive list of all 50 results can be found in Appendix A.

To illustrate the difference between the lexicons, Table 3 presents a detailed breakdown of the best model’s performance (the model appearing in the first row of Table 2) when integrated with each of the 34 lexicons separately.

Table 3 displays the result of the same model with each lexicon separately. Some of the lexicons harm the performance of the model, and there are those that improve it.

Figure 2 displays the lexicons that contributed the most to accuracy improvement. In other words, we counted how many times each lexicon appeared in Table 1. Fewer than half of the lexicons appear here, which means that most of the lexicons were never chosen as the most improving.

Next, we conducted a systematic contribution analysis to identify the most impactful feature lexicons. This involved establishing a baseline model using only TF-IDF features and then iteratively adding features from each lexicon. Based on their direct contribution to improving model performance, we empirically selected the six lexicons that yielded the highest individual increases in accuracy: ‘EMOJI_NEG’, ‘NOTTRUST’, ‘HOSTILE’, ‘ANXIETY’, ‘EMOJI_POS’, and ‘EMOJI_NEU’. Subsequently, our approach involved a meticulous integration of these six chosen lexicons into each model, exploring every possible permutation of their combinations. Due to the computational resources required for such an extensive evaluation, this process was performed only for the first 50 models. This methodical integration allowed us to assess whether the inclusion of specific lexicons, either separately or in combination, would lead to a more substantial improvement in the models’ ability to detect depressive content. To ensure the robustness and generalizability of our findings, all experiments employed a stratified 10-fold cross-validation strategy, where the final accuracy scores reported are the average across these ten folds. The full results of these combined lexicon experiments are detailed in Appendix B, with Table 4 showcasing the best 10 results.

Appendix C illustrates the existing difference between the combinations of lexicons, which refers to the best model, that is, the one that appears in the first row of Table 4. Appendix C displays the result of the same model with each combination of lexicons separately. There are combinations that harm the performance of the model, and there are those that improve it.

The best result achieved was 84.1% accuracy. To assess whether these improvements were meaningful and not due to random variation, we conducted a paired t-test. This statistical test compares two sets of results—in our case, the model’s accuracy scores across the 10 cross-validation folds before and after lexicon enrichment—while accounting for the fact that the same data splits are used in both cases. The test produced a t-statistic of 5.50 and a p-value of 0.0015. The negative t-value indicates that the accuracy scores after enrichment were consistently higher than those of the model without lexicons. The very small p-value (p < 0.01) suggests that this difference is highly statistically significant: there is less than a 0.2% probability that the observed improvement happened by chance. These results support the conclusion that lexicon enrichment has a real and beneficial impact on model performance.

Figure 3 shows the lexicon groups that contributed the most to accuracy improvement. In other words, we counted how many times each lexicon appeared in Table 1. Fewer than half of the lexicons appear here, which means that most of the lexicons were never chosen as the most improving.

5. Discussion and Conclusions

5.1. The Role of Feature Lexicons in Depression Detection

The incorporation of feature lexicons, as explored in this study, has been shown to significantly impact the accuracy of algorithms in detecting depression. Among the various feature lexicons utilized, those constructed from positive, negative, and neutral emojis demonstrated substantial success. This success highlights the prevalent role of emojis in online communication and social media platforms, where individuals often use them to express a range of emotions and sentiments, including those associated with depression. Our findings align with the conclusions of previous studies [29,30], which also identified a direct relationship between text content and the accompanying emojis. This is not limited to Hebrew because emojis are universal and are used in all languages.

Beyond emojis, certain emotion-based lexicons were particularly effective in highlighting sentiments characteristic of depression. Notable among these were emotions such as NOTTRUST, ANXIETY, NOTPROUD, PROUD, and HOSTILE. These lexicons identified distinctive markers of depression, such as feelings of anxiety, mistrust, lack of pride, and hostility, which are often expressed in posts within digital health communities. The inclusion of these emotion-based lexicons significantly enhanced the algorithm’s ability to accurately identify depressive content.

5.2. Understanding Negative Feature Interactions

However, it is also crucial to address the finding that some combinations of lexicons led to a decrease in model performance, as detailed in Appendix C. This suggests that simply adding more features does not guarantee better results. One potential reason for this is negative-feature interaction. For instance, combining lexicons with overlapping emotional themes, such as ‘ANXIETY’ and ‘HOSTILE,’ might introduce noise or redundancy, confusing the model rather than refining its predictive power. As shown in the results for the best-performing model, combining ‘HOSTILE’ and ‘NOTTRUST’ reduced the accuracy to 0.811, which is lower than the accuracy achieved when using the ‘HOSTILE’ lexicon alone (0.822). As demonstrated in Table 3, adding certain lexicons like ‘FATIGUE’ or ‘POS’ to the best SVM model decreased its accuracy from the baseline, while emoji-based lexicons provided a substantial boost to 0.839. Figure 2 further illustrates this, showing that a small subset of the 34 lexicons—primarily those related to emojis and strong sentiment states like ‘NOTPROUD’ and ‘ANXIETY’—were responsible for nearly all performance gains. This phenomenon can occur when the signals from two lexicons are not complementary; one lexicon might capture a general sentiment that conflicts with the more nuanced context provided by another, thereby diminishing the model’s ability to discern the correct classification. This highlights the importance of a careful, selective approach to feature engineering, where the interaction between different semantic features is considered to avoid degrading the model’s performance.

5.3. Contributions to Depression Detection

The study contributes to the growing body of knowledge on sentiment analysis and depression detection by underscoring the importance of diverse linguistic features in developing robust and nuanced classification models. The findings suggest that the careful selection and integration of feature lexicons, particularly those capturing nuanced emotional expressions through emojis and specific emotion-based terms, can substantially improve the performance of algorithms in identifying depression. This advancement in methodology not only enhances the accuracy of depression detection but also provides valuable insights into the emotional and psychological states expressed in digital health communities.

5.4. Limitations and Practical Implications

While our best model achieved 84.1% accuracy, an analysis of the misclassified posts reveals several challenges. The models were most often confused by posts containing sarcasm, where negative words are used to express a positive or neutral statement, and vice versa. Furthermore, very short posts lacking distinctive emotional cues were difficult to classify correctly. For example, a post simply stating “I need to talk to someone” could be flagged as non-depressive by the algorithm if it lacks specific keywords from the enriched lexicons, despite its implicit urgency. These findings suggest that future work could benefit from more advanced techniques capable of understanding nuanced context and pragmatics, which are limitations of the current TF-IDF and classical ML approach.

From a practical standpoint, an accuracy of 84.1% is a promising result for an automated screening tool, but it is not yet sufficient for standalone clinical diagnosis. The primary utility of such a model would be to assist human moderators in social health networks by flagging users who may be at risk, thereby prioritizing outreach. The 15.9% error rate means that clinical oversight remains essential.

6. Summary and Future Research

6.1. Summary of the Research

This study focused on improving the accuracy of depression detection in Hebrew social media posts from health digital communities. Our core approach involved enriching traditional text analysis with sentiment lexicons. We explored various machine learning (ML) algorithms, including Logistic Regression (LR), Random Forest (RF), Support Vector Classifier (SVC), Multilayer Perceptron (MLP), and ensemble learning methods, alongside fundamental preprocessing techniques.

A key innovation was the comprehensive integration of sentiment lexicons, specifically those derived from emojis and emotion-based lexicons, into the feature extraction process. These lexicons provided nuanced emotional and psychological insights crucial for identifying depression-related content. The inclusion of lexicons capturing emotions such as NOTTRUST, ANXIETY, PROUD, and HOSTILE significantly enhanced our models’ performance. Through careful selection and application, we achieved a more refined and accurate classification of posts, culminating in an optimal model configuration that yielded an 84% accuracy rate.

6.2. Limitations of the Research

Despite the promising results, this study has several limitations. Firstly, the research was conducted on specific Hebrew datasets from health digital communities. The models’ performance might vary when applied to more diverse or general Hebrew text sources, limiting their immediate generalizability. Secondly, while sentiment lexicons proved beneficial, the scope of the lexicons explored was finite. There may be other clinical or psychological concept lexicons that could further enhance detection accuracy. Lastly, the study primarily focused on traditional machine learning algorithms. While providing a strong baseline, these models may not fully capture the complex semantic and contextual nuances that advanced deep learning architectures are capable of discerning.

6.3. Future Work

Looking ahead, future research should delve deeper into the enrichment process using sentiment lexicons. Exploring additional clinical concept lexicons and incorporating more comprehensive sets of emotional and psychological indicators could further refine depression detection.

An important next step will be to compare the performance of our enhanced machine learning models against advanced deep learning architectures, such as BERT, RoBERTa, and other Transformer-based models. Such a comparison would clarify the trade-offs between traditional ML pipelines and more complex deep learning approaches for this specific task. Additionally, expanding the scope to include the integration of our sentiment lexicons into these deep learning frameworks could offer new perspectives on how sentiment analysis can be leveraged for mental health diagnostics.

Moreover, it is essential to test these refined models on diverse Hebrew datasets to assess their generalizability and robustness across different contexts and data sources. These future steps are crucial for advancing depression detection methodologies and enhancing their practical application in real-world settings.

Author Contributions

Conceptualization, E.M.; Methodology, E.M.; Software, R.K.; Resources, R.K.; Data curation, R.K.; Writing—original draft, R.K.; Writing—review & editing, E.M. and D.B.; Supervision, E.M. and D.B.; Project administration, D.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Feature Lexicons Enrichment Results

Model	Feature	Max Features	Best Feature Lexicon	ACC
SVM	3-c	6000	EMOJI_NEG	0.839
Stacking	3-c	6000	EMOJI_NEG	0.838
SVM	3-c	6000	EMOJI_NEG	0.838
SVM	4-c	9000	EMOJI_NEG	0.838
Stacking	3-c	5000	EMOJI_POS	0.837
Stacking	3-c	12,000	EMOJI_POS	0.837
SVM	4-c	9000	EMOJI_NEG	0.837
Voting	3-c	13,000	EMOJI_POS	0.837
SVM	3-c	6000	EMOJI_NEG	0.836
BG SVM	4-c	9000	EMOJI_NEG	0.836
BG SVM	4-c	9000	EMOJI_NEG	0.836
Stacking	5-c	15,000	EMOJI_POS	0.835
SVM	3-c	6000	EMOJI_NEG	0.835
Stacking	4-c	20,000	ANXIETY	0.835
Stacking	4-c	20,000	ANXIETY	0.835
SVM	3-c	5000	EMOJI_NEG	0.834
SVM	3-c	5000	EMOJI_NEG	0.834
SVM	5-c	9000	EMOJI_NEG	0.833
Stacking	5-c	11,000	EMOJI_NEU	0.833
Stacking	5-c	11,000	NOTVIGOR	0.833
Stacking	5-c	11,000	NOTPROUD	0.832
SVM	5-c	9000	EMOJI_NEG	0.832
SVM	5-c	9000	EMOJI_NEG	0.832
SVM	3-c	3000	EMOJI_NEG	0.831
SVM	4-c	9000	EMOJI_NEG	0.831
ERF	3-c	5000	POS	0.830
ERF	4-c	9000	EMOJI_POS	0.830
Voting	5-c	13,000	EMOJI_NEG	0.830
Voting	5-c	9000	EMOJI_POS	0.830
Stacking	5-c	16,000	NOTPROUD	0.830
ERF	3-c	20,000	NOTCALM	0.830
Stacking	5-c	11,000	EMOJI_POS	0.829
Stacking	5-c	11,000	EMOJI_NEU	0.829
Stacking	5-c	11,000	EMOJI_NEU	0.829
Voting	3-c	13,000	EMOJI_NEU	0.829
SVM	5-c	9000	EMOJI_NEG	0.829
ERF	4-c	12,000	PROUD	0.829
MLP	5-c	12,000	CALM	0.829
Stacking	5-c	15,000	DEPRESSIVE	0.829
SVM	5-c	9000	EMOJI_NEG	0.828
Stacking	5-c	11,000	NOTTRUST	0.828
ERF	4-c	19,000	NOTANTICIPATION	0.828
SVM	5-c	9000	EMOJI_NEG	0.828
ERF	5-c	14,000	PROUD	0.828
GB	3-c	8000	NOTTRUST	0.828
ERF	3-c	15,000	NEG2	0.828
Stacking	5-c	11,000	HOSTILE	0.828
BG SVM	5-c	11,000	HOSTILE	0.828
SVM	5-c	11,000	EMOJI_NEG	0.828
Stacking	5-c	11,000	NOTTRUST	0.828

Appendix B. Combined Feature Lexicons Results

Model	Feature	Best Lexicon	Combined Lexicon	Acc
Stacking	3-c	EMOJI_NEG	(‘EMOJI_NEG’, ‘EMOJI_NEU’, ‘EMOJI_POS’)	0.841
SVM	3-c	EMOJI_NEG	(‘EMOJI_NEU’,)	0.839
SVM	3-c	EMOJI_NEG	(‘EMOJI_POS’,)	0.838
Voting	3-c	EMOJI_POS	(‘EMOJI_NEG’, ‘EMOJI_NEU’)	0.838
Stacking	3-c	EMOJI_POS	(‘EMOJI_NEU’, ‘EMOJI_NEG’, ‘EMOJI_POS’)	0.838
SVM	4-c	EMOJI_NEG	(‘EMOJI_POS’,)	0.838
Stacking	4-c	ANXIETY	(‘ANXIETY’, ‘EMOJI_NEU’)	0.838
SVM	4-c	EMOJI_NEG	(‘EMOJI_NEU’, ‘EMOJI_NEG’)	0.837
Stacking	5-c	EMOJI_POS	(‘EMOJI_NEU’, ‘EMOJI_POS’)	0.837
Stacking	4-c	ANXIETY	(‘ANXIETY’, ‘EMOJI_NEG’, ‘EMOJI_POS’)	0.836
SVM	3-c	EMOJI_NEG	(‘EMOJI_NEG’,)	0.836
Stacking	3-c	EMOJI_POS	(‘EMOJI_NEU’, ‘EMOJI_POS’, ‘EMOJI_NEG’)	0.836
SVM	3-c	EMOJI_NEG	(‘EMOJI_NEG’,)	0.835
SVM	3-c	EMOJI_NEG	(‘EMOJI_NEU’,)	0.834
SVM	3-c	EMOJI_NEG	(‘EMOJI_NEU’,)	0.834
Stacking	5-c	NOTVIGOR	(‘NOTVIGOR’, ‘EMOJI_POS’)	0.834
SVM	5-c	EMOJI_NEG	(‘EMOJI_POS’,)	0.833
Stacking	5-c	EMOJI_NEU	(‘EMOJI_POS’,)	0.833
Stacking	5-c	NOTPROUD	(‘NOTPROUD’, ‘NOTTRUST’, ‘EMOJI_NEU’, ‘EMOJI_NEG’)	0.833
ERF	4-c	EMOJI_POS	(‘NOTTRUST’,)	0.832
MLP	4-c	EMOJI_NEG	(‘NERVOUS’, ‘ANXIETY’)	0.832
Stacking	5-c	DEPRESSIVE	(‘DEPRESSIVE’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.832
SVM	5-c	EMOJI_NEG	(‘EMOJI_POS’,)	0.832
SVM	5-c	EMOJI_NEG	(‘EMOJI_NEU’,)	0.832
SVM	4-c	EMOJI_NEG	(‘EMOJI_NEU’,)	0.832
Stacking	5-c	NOTTRUST	(‘EMOJI_NEU’, ‘EMOJI_POS’, ‘NOTTRUST’)	0.831
SVM	3-c	EMOJI_NEG	(‘EMOJI_NEG’,)	0.831
Stacking	5-c	EMOJI_NEU	(‘NOTTRUST’, ‘EMOJI_NEG’, ‘HOSTILE’)	0.831
ERF	5-c	PROUD	(‘ANXIETY’, ‘EMOJI_NEU’)	0.831
Stacking	5-c	NOTTRUST	(‘EMOJI_NEU’, ‘NOTTRUST’)	0.831
Voting	5-c	EMOJI_POS	(‘EMOJI_NEG’, ‘EMOJI_POS’)	0.831
Stacking	5-c	NOTPROUD	(‘EMOJI_NEG’, ‘EMOJI_POS’)	0.831
ERF	4-c	PROUD	(‘EMOJI_POS’, ‘NOTTRUST’, ‘ANXIETY’)	0.831
Voting	5-c	EMOJI_NEG	(‘EMOJI_NEU’, ‘EMOJI_NEG’)	0.830
ERF	3-c	NOTCALM	(‘NOTTRUST’, ‘EMOJI_NEG’, ‘EMOJI_NEU’, ‘NOTCALM’)	0.830
Stacking	5-c	EMOJI_POS	(‘NOTTRUST’, ‘EMOJI_NEG’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.830
MLP	5-c	CALM	(‘ANXIETY’, ‘EMOJI_NEG’, ‘EMOJI_POS’, ‘HOSTILE’)	0.830
ERF	3-c	POS	(‘NERVOUS’, ‘EMOJI_NEU’)	0.830
Stacking	5-c	EMOJI_NEU	(‘EMOJI_NEU’,)	0.829
Voting	3-c	EMOJI_NEU	(‘EMOJI_NEG’,)	0.829
SVM	5-c	EMOJI_NEG	(‘EMOJI_NEU’,)	0.829
Stacking	5-c	HOSTILE	(‘EMOJI_NEU’, ‘NOTTRUST’, ‘EMOJI_POS’, ‘EMOJI_NEG’)	0.829
SVM	5-c	EMOJI_NEG	(‘EMOJI_POS’,)	0.828
ERF	3-c	NEG2	(‘ANXIETY’, ‘NOTTRUST’, ‘EMOJI_NEG’)	0.828
SVM	5-c	EMOJI_NEG	(‘EMOJI_NEG’,)	0.828
ERF	4-c	NOTANTICIPATION	(‘ANXIETY’, ‘HOSTILE’, ‘EMOJI_NEU’, ‘NOTTRUST’)	0.828
SVM	5-c	EMOJI_NEG	(‘EMOJI_NEG’,)	0.828
Voting	5-c	EMOJI_NEG	(‘EMOJI_NEG’, ‘EMOJI_POS’)	0.827
GB	3-c	NOTTRUST	(‘EMOJI_NEG’, ‘EMOJI_POS’, ‘ANXIETY’)	0.825

Appendix C. Combined Lexicon Results for First Model

Lexicon Combination	Value
(‘EMOJI_NEU’,)	0.839
(‘EMOJI_POS’,)	0.839
(‘ANXIETY’,)	0.818
(‘HOSTILE’,)	0.822
(‘NOTTRUST’,)	0.815
(‘EMOJI_NEG’,)	0.839
(‘EMOJI_POS’, ‘EMOJI_NEU’)	0.839
(‘ANXIETY’, ‘EMOJI_NEU’)	0.818
(‘HOSTILE’, ‘EMOJI_NEU’)	0.822
(‘NOTTRUST’, ‘EMOJI_NEU’)	0.815
(‘EMOJI_NEG’, ‘EMOJI_NEU’)	0.839
(‘ANXIETY’, ‘EMOJI_POS’)	0.818
(‘HOSTILE’, ‘EMOJI_POS’)	0.822
(‘NOTTRUST’, ‘EMOJI_POS’)	0.815
(‘EMOJI_NEG’, ‘EMOJI_POS’)	0.839
(‘ANXIETY’, ‘HOSTILE’)	0.818
(‘ANXIETY’, ‘NOTTRUST’)	0.813
(‘ANXIETY’, ‘EMOJI_NEG’)	0.818
(‘HOSTILE’, ‘NOTTRUST’)	0.811
(‘HOSTILE’, ‘EMOJI_NEG’)	0.822
(‘NOTTRUST’, ‘EMOJI_NEG’)	0.815
(‘ANXIETY’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.818
(‘HOSTILE’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.822
(‘NOTTRUST’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.815
(‘EMOJI_NEG’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.839
(‘ANXIETY’, ‘HOSTILE’, ‘EMOJI_NEU’)	0.818
(‘ANXIETY’, ‘NOTTRUST’, ‘EMOJI_NEU’)	0.813
(‘ANXIETY’, ‘EMOJI_NEG’, ‘EMOJI_NEU’)	0.818
(‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_NEU’)	0.811
(‘HOSTILE’, ‘EMOJI_NEG’, ‘EMOJI_NEU’)	0.822
(‘EMOJI_NEG’, ‘NOTTRUST’, ‘EMOJI_NEU’)	0.815
(‘ANXIETY’, ‘HOSTILE’, ‘EMOJI_POS’)	0.818
(‘ANXIETY’, ‘NOTTRUST’, ‘EMOJI_POS’)	0.813
(‘ANXIETY’, ‘EMOJI_NEG’, ‘EMOJI_POS’)	0.818
(‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_POS’)	0.811
(‘HOSTILE’, ‘EMOJI_NEG’, ‘EMOJI_POS’)	0.822
(‘EMOJI_NEG’, ‘NOTTRUST’, ‘EMOJI_POS’)	0.815
(‘ANXIETY’, ‘HOSTILE’, ‘NOTTRUST’)	0.814
(‘ANXIETY’, ‘HOSTILE’, ‘EMOJI_NEG’)	0.818
(‘ANXIETY’, ‘NOTTRUST’, ‘EMOJI_NEG’)	0.813
(‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_NEG’)	0.811
(‘ANXIETY’, ‘HOSTILE’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.818
(‘ANXIETY’, ‘NOTTRUST’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.813
(‘ANXIETY’, ‘EMOJI_NEG’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.818
(‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.811
(‘HOSTILE’, ‘EMOJI_NEG’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.822
(‘EMOJI_NEG’, ‘NOTTRUST’, ‘EMOJI_POS’, ‘EMOJI_NEU’)	0.815
(‘ANXIETY’, ‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_NEU’)	0.814
(‘ANXIETY’, ‘HOSTILE’, ‘EMOJI_NEG’, ‘EMOJI_NEU’)	0.818
(‘EMOJI_NEG’, ‘ANXIETY’, ‘NOTTRUST’, ‘EMOJI_NEU’)	0.813
(‘EMOJI_NEG’, ‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_NEU’)	0.811
(‘ANXIETY’, ‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_POS’)	0.814
(‘ANXIETY’, ‘HOSTILE’, ‘EMOJI_NEG’, ‘EMOJI_POS’)	0.818
(‘EMOJI_NEG’, ‘ANXIETY’, ‘NOTTRUST’, ‘EMOJI_POS’)	0.813
(‘EMOJI_NEG’, ‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_POS’)	0.811
(‘ANXIETY’, ‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_NEG’)	0.814
(‘EMOJI_POS’, ‘ANXIETY’, ‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_NEU’)	0.814
(‘EMOJI_NEU’, ‘EMOJI_POS’, ‘ANXIETY’, ‘HOSTILE’, ‘EMOJI_NEG’)	0.818
(‘EMOJI_NEU’, ‘EMOJI_POS’, ‘ANXIETY’, ‘NOTTRUST’, ‘EMOJI_NEG’)	0.813
(‘EMOJI_NEU’, ‘EMOJI_POS’, ‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_NEG’)	0.811
(‘EMOJI_NEU’, ‘ANXIETY’, ‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_NEG’)	0.813
(‘EMOJI_POS’, ‘ANXIETY’, ‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_NEG’):	0.813
(‘EMOJI_NEG’, ‘EMOJI_POS’, ‘ANXIETY’, ‘HOSTILE’, ‘NOTTRUST’, ‘EMOJI_NEU’):	0.813

Appendix D. Experimental Setup

This section provides a detailed account of the experimental framework, including the machine learning models evaluated, the text preprocessing pipeline, feature extraction methods, and the hyperparameter tuning ranges explored. This is intended to ensure the full reproducibility of our research.

Cross-Validation and Evaluation: To ensure a robust and generalizable evaluation of our models, we employed a stratified 10-fold cross-validation strategy for all experiments. This method partitions the dataset into ten subsets, or ‘folds’, ensuring that each fold maintains the original distribution of ‘depressive’ and ‘non-depressive’ posts. The models were trained on nine folds and tested on the remaining fold, with this process repeated ten times so that each fold served as the test set once. The final accuracy reported for each model is the average across these ten folds. This approach minimizes the risk of overfitting and provides a more reliable estimate of model performance on unseen data than a single train-test split. For full reproducibility, the random seed for all shuffling and model initialization was fixed at 42.

Comprehensive Model Evaluation: In our initial phase, we conducted a broad survey of 16 different machine learning classifiers to identify the most effective model for depression detection in Hebrew text.

Base Classifiers:

k-Nearest Neighbors (KNN).
Logistic Regression (LR).
Multinomial Naive Bayes (MNB).
Support Vector Machine (SVM).
Decision Tree (DT).
A simple Multilayer Perceptron (MLP).

Ensemble Methods: To enhance performance and robustness, a wide range of ensemble techniques was also evaluated:

Bagging: BaggingClassifier with SVM, LR, and DT as base estimators.
Random Forest: ExtraTreesClassifier.
Boosting: GradientBoostingClassifier, XGBoost, and AdaBoost (with DT and LR as base estimators).
Hybrid Ensembles:
○
A VotingClassifier combining predictions from LR, SVC, DT, and MNB.
○
A StackingClassifier using LR, SVC, DT, and MNB as base learners and a meta-classifier to combine their outputs.

After initial evaluation, underperforming models (KNN, Decision Tree, AdaBoost) were excluded from later stages to focus computational resources on the most promising candidates.

Preprocessing Pipeline: We implemented and tested a multi-step preprocessing pipeline to clean and normalize the raw text data. An optimal combination of the following techniques was determined through iterative testing:

Removal of HTML tags and URLs.
Removal of punctuation.
Removal of Hebrew stopwords.
Reduction of repeated characters (e.g., “!!!!” to “!”).
Removal of non-Hebrew characters.
Removal of numbers.
Normalization of whitespace.
Lemmatization: Hebrew words were reduced to their root form using the YAP (Yet Another Parser) library.

Feature Extraction and Hyperparameter Tuning: Text was vectorized using the Term Frequency-Inverse Document Frequency (TF-IDF) method. An extensive hyperparameter search was performed to optimize feature representation.

The TfidfVectorizer was tuned across the following parameter ranges:

Parameter	Tuning Range/Values	Description
min_df	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	Tested integer values from 1 to 10 to filter out very rare terms.
max_df	[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]	Tested 10 levels to filter out very common terms.
ngram_range	Word and Character n-grams	Explored various word (1–3) and character (3–10) n-gram ranges.

References

Feldman, R. Techniques and applications for sentiment analysis. Commun. ACM 2013, 56, 82–89. [Google Scholar] [CrossRef]
Liu, B. Sentiment Analysis and Opinion Mining; Springer Nature: London, UK, 2022. [Google Scholar]
Shapira, N.; Atzil-Slonim, D.; Juravski, D.; Baruch, M.; Stolowicz-Melman, D.; Paz, A.; Alfi-Yogev, T.; Azoulay, R.; Singer, A.; Revivo, M.; et al. Hebrew psychological lexicons. In Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, Online, 11 June 2021; pp. 55–69. [Google Scholar]
Keinan, R.; Margalit, E.; Bouhnik, D. Analysis of user trends in digital health communities using big data mining. PLoS ONE 2024, 19, e0290803. [Google Scholar] [CrossRef] [PubMed]
Keinan, R.; Margalit, E.A.; Bouhnik, D. Impacts of a Public Health Crisis on Health-Centered Online Social Networks. Informing Sci. Int. J. Emerg. Transdiscipl. 2025, 28, 022. [Google Scholar] [CrossRef] [PubMed]
Tsarfaty, R.; Seker, A.; Sadde, S.; Klein, S. What’s wrong with Hebrew NLP? And how to make it right. arXiv 2019, arXiv:1908.05453. [Google Scholar]
Itai, A.; Wintner, S. Language resources for Hebrew. Lang. Resour. Eval. 2008, 42, 75–98. [Google Scholar] [CrossRef]
Keinan, R. Sexism Identification in Social Networks Using TF-IDF Embeddings, PreProccessing, Feature Selection, Word/Char n-grams and Various Machine Learning Models in Spanish and English. In Proceedings of the CLEF 2024: Conference and Labs of the Evaluation Forum, Grenoble, France, 9–12 September 2024. [Google Scholar]
Keinan, R. Text Mining at SemEval-2024 Task 1: Evaluating Semantic Textual Relatedness in Low-Resource Languages using Various Embedding Methods and Machine Learning Regression Models. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Mexico City, Mexico, 20–21 June 2024; pp. 420–431. [Google Scholar]
Keinan, R.; HaCohen-Kerner, Y. JCT at SemEval-2023 Tasks 12 A and 12B: Sentiment Analysis for Tweets Written in Low-resource African Languages using Various Machine Learning and Deep Learning Methods, Resampling, and HyperParameter Tuning. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), Toronto, ON, Canada, 13–14 July 2023; pp. 365–378. [Google Scholar]
Amram, A.; David, A.B.; Tsarfaty, R. Representations and architectures in neural sentiment analysis for morphologically rich languages: A case study from Modern Hebrew. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 2242–2252. [Google Scholar]
Aisopos, F.; Tzannetos, D.; Violos, J.; Varvarigou, T. Using n-gram graphs for sentiment analysis: An extended study on Twitter. In Proceedings of the 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), Qxford, UK, 29 March–1 April 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 44–51. [Google Scholar]
HaCohen-Kerner, Y.; Miller, D.; Yigal, Y.; Shayovitz, E. Cross-domain Authorship Attribution: Author Identification using Char Sequences, Word Uni-grams, and POS-tags Features: Notebook for PAN at CLEF 2018. In Proceedings of the 19th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2018, Avignon, France, 10–14 September 2018. [Google Scholar]
Yang, L.; Li, Y.; Wang, J.; Sherratt, R.S. Sentiment analysis for E-commerce product reviews in Chinese based on sentiment lexicon and deep learning. IEEE Access 2020, 8, 23522–23530. [Google Scholar] [CrossRef]
De Choudhury, M.; Counts, S.; Horvitz, E. Predicting postpartum changes in emotion and behavior via social media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Paris, France, 27 April–2 May 2013; pp. 3267–3276. [Google Scholar]
O’Dea, B.; Wan, S.; Batterham, P.J.; Calear, A.L.; Paris, C.; Christensen, H. Detecting suicidality on Twitter. Internet Interv. 2015, 2, 183–188. [Google Scholar] [CrossRef]
Aldarwish, M.M.; Ahmad, H.F. Predicting depression levels using social media posts. In Proceedings of the 2017 IEEE 13th International Symposium on Autonomous Decentralized System (ISADS), Bangkok, Thailand, 22–24 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 277–280. [Google Scholar]
Tadesse, M.M.; Lin, H.; Xu, B.; Yang, L. Detection of depression-related posts in reddit social media forum. IEEE Access 2019, 7, 44883–44893. [Google Scholar] [CrossRef]
Pang, B.; Lee, L. Opinion mining and sentiment analysis. Found. Trends^® Inf. Retr. 2008, 2, 1–135. [Google Scholar] [CrossRef]
Milintsevich, K.; Dias, G.; Sirts, K. Evaluating Lexicon Incorporation for Depression Symptom Estimation. arXiv 2024, arXiv:2404.19359. [Google Scholar] [CrossRef]
Ogunleye, B.; Sharma, H.; Shobayo, O. Sentiment Informed Sentence BERT-Ensemble Algorithm for Depression Detection. Big Data Cogn. Comput. 2024, 8, 112. [Google Scholar] [CrossRef]
Chiong, R.; Budhi, G.S.; Dhakal, S. Combining sentiment lexicons and content-based features for depression detection. IEEE Intell. Syst. 2021, 36, 99–105. [Google Scholar] [CrossRef]
Shalumov, V.; Haskey, H. Hero: Roberta and Longformer Hebrew Language Models. arXiv 2023, arXiv:2304.11077. [Google Scholar]
Liu, C.; Sheng, Y.; Wei, Z.; Yang, Y.-Q. Research of text classification based on improved TF-IDF algorithm. In Proceedings of the 2018 IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE), Lanzhou, China, 24–27 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 218–222. [Google Scholar]
Zhang, W.; Yoshida, T.; Tang, X. A comparative study of TF* IDF, LSI and multi-words for text classification. Expert Syst. Appl. 2011, 38, 2758–2765. [Google Scholar] [CrossRef]
Wieting, J.; Bansal, M.; Gimpel, K.; Livescu, K. Charagram: Embedding words and sentences via character n-grams. arXiv 2016, arXiv:1607.02789. [Google Scholar] [CrossRef]
Tatman, R. Sentiment Lexicons for 81 Languages, Kaggle. Available online: https://www.kaggle.com/datasets/rtatman/sentiment-lexicons-for-81-languages/data (accessed on 30 July 2025).
Hakami, S.A.A.; Hendley, R.J.; Smith, P. Arabic emoji sentiment lexicon (Arab-ESL): A comparison between Arabic and European emoji sentiment lexicons. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Online, 19 April 2021; pp. 60–71. [Google Scholar]
Li, M.; Ch’ng, E.; Chong, A.Y.L.; See, S. Multi-class Twitter sentiment classification with emojis. Ind. Manag. Data Syst. 2018, 118, 1804–1820. [Google Scholar] [CrossRef]
Liebeskind, C.; Liebeskind, S. Emoji prediction for hebrew political domain. In Proceedings of the 2019 World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 468–477. [Google Scholar]

Figure 1. Machine Learning Classification flow chart.

Figure 2. 140 Final Results Distribution of Best Feature Lexicon.

Figure 3. Top 50 Models, Distribution of Combined Feature Lexicons.

Table 1. Description of the Feature Lexicons.

Name	Description	Number
ANGER	Words related to expressing anger or irritation	230
ANXIETY	Words associated with feelings of anxiety and nervousness	241
ASHAMED	Words reflecting a sense of shame or embarrassment	175
CALM	Words indicative of a calm and composed emotional state	160
CONFUSION	Words representing a state of confusion or bewilderment	175
DEPRESSIVE	Words associated with feelings of depression	162
DISGUST	Words expressing a sense of strong dislike or revulsion	191
EMOJI_NEG	Emojis conveying negative emotions	297
EMOJI_NEU	Emojis conveying neutral emotions	226
EMOJI_POS	Emojis conveying positive emotions	511
FATIGUE	Words related to feelings of tiredness or exhaustion	212
GUILTY	Words indicating a sense of guilt or remorse	183
HOSTILE	Words reflecting a hostile or aggressive attitude	179
JOY	Words associated with feelings of joy and happiness	207
NEG	Negative sentiment words	1626
NEG2	Additional negative sentiment words	115
NERVOUS	Words expressing nervousness or apprehension	214
NOTAMUSED	Words conveying a lack of amusement or boredom	111
NOTANTICIPATION	Words indicating a lack of anticipation	106
NOTCALM	Words suggesting a lack of calmness or tranquility	187
NOTCONTENTMENT	Words indicating a lack of contentment	201
NOTINTERESTED	Words conveying a lack of interest or enthusiasm	151
NOTJOY	Words indicating a lack of joy	365
NOTNERVOUS	Words reflecting a lack of nervousness	158
NOTPROUD	Words indicating a lack of pride	117
NOTTRUST	Words suggesting a lack of trust	141
NOTVIGOR	Words indicating a lack of vigor or energy	172
PARALINGUISTIC	Words related to paralinguistic features, such as intonation	150
POS	Positive-sentiment words	906
POS2	Additional positive-sentiment words	82
PROUD	Words expressing a sense of pride	153
SAD	Words associated with feelings of sadness	203
SURPRISE	Words reflecting a sense of surprise	138
TRUST	Words associated with feelings of trust and confidence	156

Table 2. Feature Lexicons Enrichment Best Results.

Model	Feature	Max Features	Best Feature Lexicon	ACC
SVM	3-c	6000	EMOJI_NEG	0.839
Stacking	3-c	6000	EMOJI_NEG	0.838
SVM	3-c	6000	EMOJI_NEG	0.838
SVM	4-c	9000	EMOJI_NEG	0.838
Stacking	3-c	5000	EMOJI_POS	0.837
Stacking	3-c	12,000	EMOJI_POS	0.837

Table 3. Accuracy of Lexicons on A Single Model.

Lexicon	Acc
ANGER	0.820
ANXIETY	0.818
ASHAMED	0.817
CALM	0.822
CONFUSION	0.810
DEPRESSIVE	0.822
DISGUST	0.819
EMOJI_NEG	0.839
EMOJI_NEU	0.839
EMOJI_POS	0.839
FATIGUE	0.803
GUILTY	0.818
HOSTILE	0.822
JOY	0.811
NEG	0.817
NEG2	0.808
NERVOUS	0.812
NOTAMUSED	0.818
NOTANTICIPATION	0.809
NOTCALM	0.813
NOTCONTENTMENT	0.816
NOTINTERESTED	0.813
NOTJOY	0.817
NOTNERVOUS	0.820
NOTPROUD	0.829
NOTTRUST	0.815
NOTVIGOR	0.823
PARALINGUISTIC	0.828
POS	0.802
POS2	0.813
PROUD	0.820
SAD	0.820
SURPRISE	0.823
TRUST	0.808

Table 4. Combined Feature Lexicons Best Results.

Model	Feature	Best Lexicon	Combined Lexicon	Acc
Stacking	3-c	EMOJI_NEG	(‘EMOJI_NEG’, ‘EMOJI_NEU’, ‘EMOJI_POS’)	0.841
SVM	3-c	EMOJI_NEG	(‘EMOJI_NEU’,)	0.839
SVM	3-c	EMOJI_NEG	(‘EMOJI_POS’,)	0.838
Voting	3-c	EMOJI_POS	(‘EMOJI_NEG’, ‘EMOJI_NEU’)	0.838
Stacking	3-c	EMOJI_POS	(‘EMOJI_NEU’, ‘EMOJI_NEG’, ‘EMOJI_POS’)	0.838
SVM	4-c	EMOJI_NEG	(‘EMOJI_POS’,)	0.838
Stacking	4-c	ANXIETY	(‘ANXIETY’, ‘EMOJI_NEU’)	0.838
SVM	4-c	EMOJI_NEG	(‘EMOJI_NEU’, ‘EMOJI_NEG’)	0.837
Stacking	5-c	EMOJI_POS	(‘EMOJI_NEU’, ‘EMOJI_POS’)	0.837
Stacking	4-c	ANXIETY	(‘ANXIETY’, ‘EMOJI_NEG’, ‘EMOJI_POS’)	0.836

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Keinan, R.; Margalit, E.; Bouhnik, D. Emotional Analysis in a Morphologically Rich Language: Enhancing Machine Learning with Psychological Feature Lexicons. Electronics 2025, 14, 3067. https://doi.org/10.3390/electronics14153067

AMA Style

Keinan R, Margalit E, Bouhnik D. Emotional Analysis in a Morphologically Rich Language: Enhancing Machine Learning with Psychological Feature Lexicons. Electronics. 2025; 14(15):3067. https://doi.org/10.3390/electronics14153067

Chicago/Turabian Style

Keinan, Ron, Efraim Margalit, and Dan Bouhnik. 2025. "Emotional Analysis in a Morphologically Rich Language: Enhancing Machine Learning with Psychological Feature Lexicons" Electronics 14, no. 15: 3067. https://doi.org/10.3390/electronics14153067

APA Style

Keinan, R., Margalit, E., & Bouhnik, D. (2025). Emotional Analysis in a Morphologically Rich Language: Enhancing Machine Learning with Psychological Feature Lexicons. Electronics, 14(15), 3067. https://doi.org/10.3390/electronics14153067

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Emotional Analysis in a Morphologically Rich Language: Enhancing Machine Learning with Psychological Feature Lexicons

Abstract

1. Introduction

1.1. Research Background and Significance

1.2. The Main Problem: Challenges in Hebrew Natural Language Processing

1.3. Limitations of Existing Methods

1.4. Our Contribution: Research Goal, Method, and Innovation

2. Related Work

3. Methodology

3.1. The Data

3.2. Classification Methodology

3.3. Feature Lexicons Enrichment

3.4. Combined Feature Lexicons

4. Results

5. Discussion and Conclusions

5.1. The Role of Feature Lexicons in Depression Detection

5.2. Understanding Negative Feature Interactions

5.3. Contributions to Depression Detection

5.4. Limitations and Practical Implications

6. Summary and Future Research

6.1. Summary of the Research

6.2. Limitations of the Research

6.3. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Feature Lexicons Enrichment Results

Appendix B. Combined Feature Lexicons Results

Appendix C. Combined Lexicon Results for First Model

Appendix D. Experimental Setup

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI