1. Introduction
The relationship between language and society has long been a central concern in the social sciences [
1]. Language does not merely reflect social structures but actively participates in their construction and reproduction [
2], shaping how individuals present themselves and are interpreted by others. Linguistic expression operates simultaneously at multiple levels, encompassing both what is said and how it is conveyed through emotional framing, stylistic choices, and behavioral signals. While recent advances in natural language processing (NLP) have improved the modeling of semantic content, important challenges remain in capturing these contextual dimensions of language use. Approaches that focus on content alone, or that reduce emotion to coarse labels, risk overlooking how meaning is situated within social and institutional contexts. These limitations become particularly consequential when NLP systems are applied in real-world decision-making settings, where incomplete representations of language may lead to systematic misinterpretation and reinforce existing biases.
One specific domain where these challenges manifest is the analysis of biographical narratives, i.e., personal reflections on significant life events that are widely used in university admissions, scholarships, hiring, and promotion. In such high-stakes settings, narratives are not merely descriptive texts but key inputs into evaluative processes that affect individuals’ educational, economic, and professional trajectories. College application essays provide a compelling example [
3,
4]. These essays have become a standard and increasingly important requirement for admission, particularly at selective institutions. Through these narratives, students describe their life stories and draw on them to illustrate challenges they have faced and overcome, or to articulate experiences of disadvantage that have shaped their paths. The ways in which students construct and present these stories carry significant consequences for their admission prospects and future opportunities. Indeed, application essays have been described as ritualized performances through which merit is constructed and evaluated [
5].
In this domain, factors such as agentive positioning, rhetorical strategies, and temporal framing are often sociolinguistic resources through which individuals position themselves within institutional contexts [
6,
7,
8,
9,
10]. As studies on gender and racial disparities in college applications show [
3,
4], applicants often modify not only what they disclose but how they disclose it. Some applicants adopt confident, assertive language such as definitive statements of achievement (“I led,” “I designed,” “I created”), direct claims of causality and individual attribution (“my work resulted in,” “I increased”), while others employ humble, self-effacing expressions including hedging qualifiers (“I think,” “perhaps,” “somewhat”), achievement minimization (“I was fortunate to,” “by God’s will”), collective rather than individual attribution (“my family,” “we achieved”), downplaying of personal contributions, apologetic framing, and self-deprecating qualifiers, reflecting both personal communication styles and cultural dispositions. As a result, evaluators may implicitly reward particular emotional framings or narrative styles, even when applicants describe comparable experiences.
Building on these observations, we hypothesize that emotional expression in biographical narratives cannot be adequately captured using predefined emotion taxonomies or standalone emotion classifiers. Instead, emotional meaning depends on context and arises from the interaction between thematic content and linguistic expression. Specifically, we hypothesize that (i) narratives describing similar life events may differ systematically in their emotional and behavioral framing, and (ii) emotionally relevant patterns can be identified only when emotion is analyzed together with underlying thematic structure. Under this hypothesis, emotion is not treated as a fixed label assigned to a text, but as a property that is embedded within topics and shaped by cultural, institutional, and linguistic contexts.
In this paper, we introduce a framework for Behavioral and Emotional Theme Detection (BET) to illuminate the relationship between thematic content, stance, and emotional expression in sociocultural narratives. BET explicitly separates and then fuses two complementary forms of knowledge: latent knowledge, captured by contextual sentence and document embeddings that encode implicit semantic structure, and explicit knowledge, provided by expert-defined lexicon categories from the Linguistic Inquiry and Word Count (LIWC) lexicon. Concretely, we use pre-trained language models to obtain latent embedding representations of narratives, and we operationalize explicit knowledge by embedding LIWC category keywords in the same space. BET then estimates the alignment between latent textual representations and explicit category definitions via semantic similarity, yielding signals of behavioral and emotional themes within topic-organized narratives.
Unlike traditional text analysis methods, BET emphasizes sociological dimensions within the thematic context. For example, analyzing narratives of previous educational experiences reveals not only the content but also the emotional frames through which these experiences are interpreted and conveyed, whether through expressions of anger, achievement, conflict, or other sentimental orientations and perspectives. Importantly, unlike many approaches to related problems in NLP, such as sentiment analysis and emotion classification, which formulate emotional interpretation as a supervised learning task and rely on labeled training data (e.g., positive or negative sentiment), BET does not require annotated datasets. Instead, it leverages expert-defined lexical categories aligned with latent embeddings, enabling the identification of emotional and behavioral themes without task-specific labels. This design enables access to emotional and sociolinguistic expressions that would otherwise remain undetected, as no model has been explicitly trained to capture these categories.
The proposed methodology is particularly compelling when applied to diverse languages. Drawing on established research [
11,
12,
13], we recognize how culture shapes emotional expression and narrative construction, reflecting the unique social contexts of their respective linguistic environments. Hence, we demonstrate the application of BET on two datasets: an English-language synthetic student profile dataset from a public Kaggle repository and a Hebrew-language financial aid application dataset provided to us by a known fellowship foundation.
The choice of two datasets in this study, one written in English (Synthetic Student Profile Dataset) and the other in Hebrew (Financial Aid Application Dataset), was methodologically motivated. English was selected because it is the dominant language in NLP research and serves as a widely used reference point for evaluating methodological applicability and comparability with prior work. Hebrew was chosen as a contrasting case due to its morphological richness and right-to-left script, which pose well-documented challenges for NLP methods. Analyzing Hebrew narratives therefore provides a meaningful test of the robustness of the proposed framework beyond English, particularly in high-stakes narrative settings. Together, these two languages allow us to demonstrate that BET can operate across languages with substantially different linguistic properties while remaining grounded in language-specific lexica and expert knowledge. Importantly, BET is language-agnostic in principle and can be extended to additional languages given appropriate lexical resources and domain expertise.
Our contributions are as follows:
We introduce a novel framework for detecting behavioral and emotional themes by integrating topic detection with linguistic patterns.
We develop a hybrid methodology that combines predefined categories from an official lexicon (e.g., LIWC) with embedding techniques for fine-grained emotional theme detection.
We propose an unsupervised, label-free approach to emotional and behavioral theme analysis that does not require annotated data, relying instead on explicit lexical knowledge.
We demonstrate the method’s adaptability using datasets in English and Hebrew, addressing varying morphological and syntactical complexities.
We illustrate how the method improves the content analysis and provides new insights for social science analysis.
2. Related Work
2.1. Topic Modeling
Traditional topic detection models use supervised and unsupervised methods to identify implicit knowledge such as the topics and main themes in texts [
14,
15,
16,
17]. Recent unsupervised methods, such as BERTopic [
18] and Top2Vec [
19], integrate contextual embeddings to improve topic coherence and applicability to diverse datasets. Embedding-based topic modeling has shown strong performance because dense representations help capture interpretable latent semantic structure at the document level [
20]. In particular, pretrained embeddings can improve document representations and, consequently, enhance topic coherence [
21]. Embedding representations have also been shown to better encode multi-word expressions and collocations, enabling more accurate modeling of shared topical structure across documents [
22]. In addition, the use of multilingual embeddings supports cross-lingual topic modeling, allowing topics to be aligned and compared across languages [
23]. More recently, large language models (LLMs) have been used to support topic modeling via guided prompting and iterative refinement procedures, offering an alternative path to generating and consolidating topic representations [
24].
2.2. Sentiment and Emotion Analysis
Sentiment analysis models aim to identify the emotional tone expressed in text, typically by classifying polarity into categories such as positive, neutral, and negative [
25]. These models have been applied across a range of domains, including economics, healthcare, and law, where large volumes of textual data are analyzed to support decision-making processes [
26]. Several related tasks extend this basic framework. One prominent example is aspect-based sentiment analysis, which leverages sentiment lexica to determine the polarity associated with specific targets or aspects mentioned in the text [
27]. Sentiment analysis has been particularly common in studies of social media, where it is often applied to short, informal texts to infer public opinion, stance, or affect [
28,
29].
Topic-sentiment detection approaches integrate topic modeling with sentiment analysis to uncover how sentiments are expressed with respect to specific themes, thereby moving beyond document-level polarity toward more nuanced interpretations of opinion and stance [
30,
31,
32]. In applied settings such as social media, these methods have been shown to reveal how public sentiment toward the same topic can shift over time or differ across subgroups, for example in discussions of climate change [
33,
34,
35].
Early probabilistic models, including the Topic Sentiment Mixture (TSM) model [
36] and the Joint Sentiment–Topic (JST) model [
37], established a foundational link between thematic content and sentiment by jointly modeling topics and affective labels. Subsequent work extended this line of research by introducing more expressive representations of emotional dynamics. For instance, Tang et al. [
38] proposed the Hidden Topic–Emotion Transition Model to capture interactions between topics and multi-level emotions, while more recent approaches have incorporated contextual embeddings and external knowledge through knowledge-aware transformers to improve coherence and interpretability in dialogue and narrative analysis [
39].
Emotion analysis is an important extension of sentiment analysis that provides a more fine-grained representation of emotional expression beyond polarity-based classification. Most existing studies rely on predefined emotion taxonomies, most notably Ekman’s model [
40], which defines six basic emotions: anger, fear, sadness, joy, disgust, and surprise. Plutchik’s framework [
41] extends this set by introducing additional categories such as trust and anticipation, thereby broadening the range of emotional states considered. In NLP research, emotion detection is often treated as a subtask of sentiment analysis and focuses on identifying emotions expressed in textual input [
42]. However, emotion detection is conceptually distinct from sentiment analysis, as it aims to identify specific emotional categories rather than assigning coarse-grained positive, neutral, or negative polarity labels [
43]. Importantly, accurately identifying emotional meaning in text remains challenging, as emotional expression is highly context-dependent. This challenge is illustrated by Ocal et al. [
44], who showed that emotions encoded in film scripts frequently diverge from those expressed in audience reviews, highlighting the relational and contextual nature of emotional expression.
With the advent of BERT-based models, emotion detection has gained substantial attention in recent years due to notable performance improvements across benchmarks [
45]. For example, Wang et al. [
46] demonstrated sentence-level emotion detection by leveraging BERT within a Siamese neural network architecture. In addition, multilingual BERT-based models have enabled emotion detection across languages, often using English as a source language for transfer learning [
47]. Relatedly, transformer-based models such as RoBERTa [
48] have been used to predict predefined emotional dimensions, including valence, arousal, and dominance [
49].
Beyond emotion classification, emotional information has also been shown to be valuable for related tasks such as stance detection, which aims to classify the attitude expressed in an opinionated text (e.g., favor, against, or none) [
50]. More recently, Zhang et al. [
51] demonstrated that incorporating LLMs into hypergraph convolutional networks can further improve stance prediction accuracy, highlighting the important role of emotion-aware and knowledge-enhanced representations.
2.3. Limitations of Existing Approaches and Research Gaps
Prior work reveals three main limitations in narrative text analysis that motivate our framework’s (BET) design.
Separation of theme and emotion. Existing approaches typically treat thematic structure and emotional expression as independent analytical layers, overlooking how emotional framing operates within shared topics. BET addresses this by explicitly separating topic detection from emotional theme detection, then integrating them through topic-level emotional profiles.
Rigid emotion taxonomies and annotation dependence. Most emotion detection methods rely on supervised learning with fixed taxonomies, limiting their applicability where annotated data are unavailable or where relevant sociocultural categories extend beyond standard emotion labels. BET removes the need for manual annotation through a fully label-free approach, avoiding the cost, subjectivity, and domain constraints of human labeling. By aligning expert-defined emotional and sociocultural categories with latent sentence embeddings, it discovers richer, more context-sensitive emotional themes.
Limited cross-linguistic generalizability. Prior methods are predominantly developed for English and short, informal genres, raising concerns about robustness in morphologically rich languages and long-form narratives. BET applies the same framework to both English and Hebrew, demonstrating its ability to extract coherent structures and interpretable patterns across linguistically distinct contexts. Additionally, while LLMs offer analytical power, their black-box nature obscures the linguistic features driving classifications, and their reliance on distributional semantics conflates what applicants say with how they say it—the pragmatic dimensions central to stance and self-presentation.
These gaps define BET’s design and the empirical evaluations presented here, positioning the framework as a general, interpretable approach for jointly analyzing thematic content and emotional framing in high-stakes narrative contexts.
3. Materials and Methods
Our method comprises four primary steps. First, we transform documents into dense embeddings that capture their latent semantic structures, providing continuous vector representations of narrative content. Second, these embeddings serve as input to unsupervised topic identification through clustering techniques, grouping narratives by thematic similarity. Third, we detect explicit emotional and sociocultural dimensions by computing semantic similarity between sentence embeddings and predefined categories from an established psychological and sociological lexicon. Finally, we integrate the detected emotional patterns with document-level topic assignments to generate a distribution of behavioral and emotional themes that characterizes each narrative.
Figure 1 illustrates our framework, and the following subsections provide a detailed description of the steps.
3.1. Latent Representation
To capture the latent semantic structure of the text, we transform each document,
, into a dense vector representation, or embedding,
, where
d is the dimensionality of the embedding space, using Sentence-Transformers [
52], a framework for producing dense embeddings:
. These embeddings encode the semantic meaning of the text and are subsequently used as input for topic detection model. In their paper, Reimers and Gurevych [
52] emphasize that their sentence embedding model captures sentiment information better than other sentence embedding techniques, hence we choose to use it in our framework.
3.2. Topic Detection
We identify document topics by clustering the document embeddings obtained during the latent representation phase,
, where
n is the number of documents. For this purpose, we employ BERTopic [
18], a topic modeling framework that leverages dynamic topic representation and unsupervised clustering to uncover meaningful patterns within the data. The embedding step can be applied at the sentence level, allowing multiple topics to be assigned to each document.
The clustering is held via the HDBSCAN algorithm [
53], which groups embeddings
into a cluster
if they satisfy the density threshold determined by the algorithm parameters. Embeddings that do not meet the density criteria are labeled as noise. Clusters
are formed by grouping points where the local density exceeds a threshold. The local density at
is inversely proportional to the Mutual Reachability (
MR) distance:
where
, and
is the minimal number of points required to form a dense region (a hyperparameter of HDBSCAN).
We select HDBSCAN since it does not require specifying the number of clusters in advance, which is particularly important in exploratory analyses of narrative data where the number and granularity of topics are not known a priori and may differ across datasets. In addition, HDBSCAN can identify clusters of varying density while explicitly labeling low-density points as noise. This property was selected to avoid forcing marginal or atypical narratives into spurious topics, a risk that is especially pronounced in biographical and institutional texts where some narratives intentionally deviate from dominant themes. In combination with contextual sentence embeddings, HDBSCAN therefore supports the extraction of coherent topic structures while preserving narrative heterogeneity and reducing contamination from outlier documents.
A notable feature of BERTopic is its ability to support dynamic topic modeling, which involves adjusting the topic representations over time across different contexts. This adaptability ensures that the extracted topics remain accurate and relevant even when applied to datasets that evolve or exhibit temporal variations. Dynamic topic modeling is particularly useful for capturing trends, shifts in discourse, and changes in thematic emphasis.
3.3. BET: Behavioral and Emotional Theme Detection
To detect emotional themes we utilize the LIWC lexicon [
54], a psycholinguistic tool that organizes keywords into predefined emotional, cognitive, and linguistic categories. LIWC, used as an explicit knowledge in our method, is a valuable resource for understanding the psychological and sociological aspects of the text, providing a structured framework to connect textual content with thematic categories such as “Drives,” “States,” and “Emotions.”
Table 1 presents the categories and keywords employed in this study to identify emotional themes for English
1. The analysis incorporates
validated lexical categories (with 62 keywords) derived from the English version of LIWC2022 Version 1.10.0. For other languages, we recommend either using existing multilingual versions of LIWC
2 or translating the keywords and categories from the English LIWC. For the Hebrew corpus analyzed in this paper, we use the Hebrew version of LIWC defined by Shapira et al. [
55], and in consultation with a domain expert, we draw upon
contextually adapted semantic categories (with 7019 keywords).
In our proposed methodology, each keyword
w within a category
k from the lexicon is represented with an embedding using the same pre-trained embedding model applied to generate document and sentence embeddings, i.e.,
. Let
be the set of lexicon keywords for category
k and let
be their corresponding embeddings. For each document
, the similarity score between a sentence embedding
, and a keyword embedding
is computed using cosine similarity. The strength of theme category
k in document
is determined by the highest similarity score across all sentences:
For example, in document , if the cosine similarity scores for the keywords “Affiliation,” “Achieve,” and “Power” are , , and , the emotional theme corresponding to the “Drives” category in this document would be determined by the highest score, 0.745. At the same time, the full set of maximum similarity scores is preserved, yielding a distribution of emotional-theme signals at the document level rather than a single categorical assignment. This approach enables context-aware identification of thematic elements across documents while effectively capturing multiple emotional and cognitive dimensions within the text.
By combining the explicit structured knowledge of LIWC with latent embeddings, we are able to create a methodology for emotional theme detection. Importantly, this integration does not enforce a one-to-one mapping between topics and emotions, but instead supports a multidimensional representation in which emotional themes are expressed with varying strength within each topic. This integration allows us to bridge the gap between traditional lexicon-based approaches and modern contextualized embeddings, enhancing the adaptability of the results. Moreover, the combination ensures that emotional themes are also sensitive to contextual subtleties present in the text.
BET does not assume that semantic similarity directly represents emotional tone in a categorical sense. Rather, similarity between sentence embeddings and lexicon-based category embeddings is interpreted as contextual alignment between a localized linguistic expression and an expert-defined emotional or sociocultural construct. By operating at the sentence level and preserving graded similarity scores rather than discrete labels, BET treats emotional tone as an emergent, probabilistic signal rather than a fixed property of either topics or texts.
The methodological framework yields a multidimensional matrix that interweaves topics with their associated affective dimension, with emotional strength represented through similarity scores. Using this output, a domain expert, such as a social science researcher or evaluator, can create a thorough analysis that connects both the thematic content and the emotional undercurrents, offering an understanding of the data and its implications within the context of the research or the decision they make.
3.4. Methodological Design Choices
We selected Sentence Transformers, BERTopic, and the LIWC lexicon to align with the conceptual goals of BET, namely domain adaptability, and applicability to long-form, high-stakes narratives without reliance on labeled training data.
Sentence Transformers were chosen to generate dense semantic representations because they are optimized for capturing sentence level meaning in a shared embedding space. Unlike token-level representations, sentence embeddings enable direct semantic comparison between narrative segments and abstract category definitions, which is essential for aligning latent textual representations with explicit emotional and sociocultural categories. In addition, the availability of multilingual pretrained models supports consistent representation across languages with different morphological properties, which is critical for the cross-linguistic analysis presented in this work.
BERTopic was selected for topic detection because it combines contextual embeddings with unsupervised clustering, allowing topics to emerge from semantic similarity rather than surface-level word co-occurrence. This property is particularly important for biographical and institutional narratives, where thematically related texts may use diverse vocabularies. Moreover, BERTopic supports dynamic topic modeling and integrates density-based clustering through HDBSCAN, enabling the identification of coherent topics while avoiding forced assignment of atypical or marginal narratives to spurious clusters.
The LIWC lexicon was chosen as the source of explicit knowledge due to its long-standing use and validation in psychology and social science research. LIWC provides theoretically grounded categories that capture emotional, cognitive, and sociocultural dimensions of language, making it well suited for analyzing narratives in institutional and evaluative contexts. Importantly, using a lexicon-based approach allows BET to remain unsupervised and label-free, avoiding the need for annotated emotion datasets that are often unavailable or inappropriate for high-stakes personal narratives. Integrating LIWC categories into the same semantic space as the narratives enables flexible, context-sensitive alignment between textual expressions and expert-defined constructs without enforcing predefined emotion taxonomies at the document level.
4. Experimental Setup
4.1. Datasets
4.1.1. Synthetic Student Profile Dataset (English)
A Kaggle dataset
3 that provides a collection of student profiles was used, showcasing a wide spectrum of academic and personal characteristics. Each profile encapsulates demographic information, academic details, hobbies, unique qualities, and personal narratives. From this dataset, consisting of 23,236 synthetically generated observations, we used the `Story’ field that includes a narrative or background story about a student to extract sociocultural themes about the student’s background story. The narratives average 428 words in length, with a median of 427 words and a word count range between 282 and 614.
4.1.2. Financial Aid Application Dataset (Hebrew)
The financial aid application dataset was collected by a non-governmental organization that facilitates educational access for socioeconomically disadvantaged students through monetary support mechanisms. The dataset comprises 28,424 financial aid applications spanning 2012–2024, with an annual submission rate of approximately 2200 applications. A key feature of these applications is the requirement for students to write personal narratives describing significant life challenges, their responses to these challenges, and their future goals
4. These Hebrew-written narratives contain, on average, 474 words, with a median length of 473 words. The word count ranges from 23 to 1632. To protect the privacy of participants and maintain data security, access to this dataset is restricted.
4.2. Rationale for Dataset Selection
The two datasets serve complementary methodological roles in evaluating the proposed framework. The Hebrew Financial Aid Application Dataset constitutes the primary empirical setting of this study. It consists of real, human-written narratives produced in a high-stakes institutional context, and it exhibits the full complexity of natural language use, including emotional nuance, cultural grounding, linguistic variability, and writing imperfections. The analysis of this dataset therefore provides the main evidence for the practical usefulness of BET in extracting meaningful behavioral and emotional themes from authentic narratives.
The English Synthetic Student Profile Dataset was intentionally included for a different purpose. Due to its synthetic nature, these narratives are expected to exhibit lower overall emotional valence than real-world texts. Nevertheless, this dataset offers a controlled environment in which the methodological behavior of the framework can be examined independently of domain-specific institutional constraints. Its inclusion allows us to demonstrate the applicability of BET to English-language data, which remains the dominant language in NLP research, and to illustrate that the framework can still uncover interpretable thematic and emotional patterns even when emotional expression is comparatively muted.
We also note that access to real-world, high-stakes narrative datasets is often restricted due to privacy, ethical, and legal considerations. In this respect, the Hebrew dataset represents a rare example of such data made available for research under strict confidentiality conditions. Accordingly, the framework’s usefulness should be evaluated primarily based on its performance on the real Hebrew narratives, with the synthetic English dataset understood as a complementary illustrative case rather than a substitute for authentic emotional expression.
4.3. Preprocessing and Parameters
4.3.1. Preprocessing
The preprocessing pipeline was designed to clean the textual data and prepare it for further analysis. We begin this process by tokenizing the text and removing stop words based on the Hebrew/English stop word list from the NLTK library. We then filter tokens to retain only alphabetic characters and numbers, while excluding tokens with a length of one character and specific cases such as line breaks and dashes.
4.3.2. Sentence Embeddings
Embeddings were generated using the Sentence Transformer of paraphrase-multilingual-MiniLM-L12-v2
5. To assess the semantic similarity between sentences and emotional categories, cosine similarity was employed.
4.3.3. Topic Detection Models
We employ BERTopic with all-MiniLM-L6-v2 for English and paraphrase-multilingual-MiniLM-L12-v2 for Hebrew as the underlying embedding model and use HDBSCAN for document clustering. The HDBSCAN configuration includes a minimum cluster size of 5, a minimum sample size of 5, Euclidean distance, and the “excess of mass” clustering method. Additionally, we utilize a custom vectorizer with dynamically determined document frequency thresholds based on the dataset size, with a maximum threshold set to 0.9. The vectorizer incorporates both unigrams and bigrams to capture contextual relationships. For the Financial Aid Application Dataset, we set the number of topics to 100, selected based on an optimal trade-off between coherence and diversity scores. This choice was further validated by a domain expert in sociology. For the Synthetic Student Profiles Dataset, the number of topics is determined automatically by the HDBSCAN clustering method, resulting in 180 topics.
4.3.4. Baseline
Although there is no direct baseline for comparison, since our method focuses on extracting emotional themes, a closely related approach involves topic detection followed by sentiment analysis, particularly at a fine-grained level. This approach has been employed in prior studies [
56,
57,
58], which motivate our choice of baseline. Specifically, we adopt a method that combines Latent Dirichlet Allocation (LDA) for topic modeling with sentence-level sentiment analysis. For the sentiment component, we employ BERT-based models:
bert-base-multilingual-uncased-sentiment for English texts and HeBERT [
59] for Hebrew texts.
5. Results
5.1. Synthetic Student Profile Dataset (English)
Before discussing individual cases, we emphasize how BET operationalizes the joint modeling of topics and emotional themes. Topics identified by BERTopic serve as a thematic grouping mechanism, while emotional themes are detected independently at the sentence level. For each document, emotional signals are aggregated within topic assignments, yielding topic-specific emotional profiles rather than assigning a single emotion to an entire topic or document. This design allows emotional framing to be analyzed within shared thematic contexts, enabling comparison across narratives that discuss similar content but differ in emotional and behavioral expression.
5.1.1. Topic and Emotion Detection
The analysis of the Kaggle dataset uncovers distinct thematic and emotional patterns within the narratives. BERTopic identifies 180 topics, with professional trajectories emerging as the most prevalent theme (), followed by narratives centered on marine and ocean exploration (), fashion and design (), environmental consciousness (), and dance and ballet (). This distribution reflects a diverse spectrum of student self-representations, ranging from career-oriented aspirations to narratives emphasizing lifestyle, leisure, and personal identity formation.
Given that the English texts are synthetically generated rather than written by real individuals, we expect overall emotional expression to be more attenuated than in real-world narratives, as the generation process prioritizes content diversity over authentic emotional articulation. Consistent with this expectation, the semantic analysis of the emotional keywords (
) listed in
Table 1 reveals relatively low average affect scores across this dataset.
The Drives category exhibits the highest salience (average cosine similarity score of 0.25). Within this category, the drive for achievement (0.32) outweighs both affiliation (0.24) and power orientations (0.19), underscoring the centrality of individualistic success narratives. To demonstrate the analytical utility of our methodological approach, we examine four cases exhibiting heightened emotional articulation, subjecting them to detailed BET analysis.
5.1.2. Behavioral and Emotional Theme Analysis
Figure 2 presents a radar plot illustrating the maximum similarity scores of four selected documents from the synthetic student profile dataset. Each document pertains to a distinct topic: document #546 is associated with Photography & Film, document #1614 with Nutrition & Wellness, document #3676 with Dance & Ballet, and document #8366 with Yoga & Certification Training
6.
Focusing on document #546 (Photography & Film), we observe a combination of positive and negative emotions alongside motivational elements. This emotional distribution is reflected in sentences such as “He was nervous but hopeful” and “Bruce remained humble and continued to pursue his studies with the same dedication and passion.” Additionally, the document aligns with its designated topic, as evidenced by sentences like “But what truly set him apart was his love for photography” and “Bruce was a photography aficionado, always carrying his camera with him wherever he went.”
The analysis of this document with BET allows us to demonstrate how emotional and motivational themes relate to the topic of Photography & Film. Similarly, document #16140 (Nutrition & Wellness) exhibits a strong association with the Physical theme, as reflected in its high similarity score (0.63). This correlation is likely influenced by text such as “Allison was a health and wellness advocate, always promoting a balanced and mindful lifestyle.” and “She believed that taking care of one’s physical and mental well-being was crucial for success and happiness.”
Furthermore, we find that the Motive theme is strongly linked to documents #3676 (0.62) and #8366 (0.61). For document #3676 (Dance & Ballet), this association can be explained by sentences such as “One day, while walking to his Environmental Science class, William heard loud music coming from the auditorium. Curiosity got the better of him, and he decided to check it out.” Similarly, document #8366 (Yoga & Certification Training) also aligns with the Motive theme, likely due to excerpts like “He was thrilled to see how his passion for yoga had brought people together and helped them find inner peace.” Interestingly, both Dance & Ballet and Yoga & Certification Training are disciplines that emphasize achievement within competitive or structured training environments. Both documents demonstrate high scores in the Motive theme, which may suggest a broader connection between motivation and engagement in structured physical and artistic pursuits.
Importantly, comparing emotional profiles across topics highlights how similar emotional themes emerge in distinct thematic contexts, and vice versa. For example, the Motive theme appears prominently in both Dance & Ballet and Yoga & Certification Training, despite their different topical content. In Dance & Ballet, motivational expressions are tied to curiosity and discovery, whereas in Yoga & Certification Training they are associated with fulfillment and community-oriented achievement. Conversely, the Physical theme is salient in Nutrition & Wellness but largely absent from Photography & Film, even though both narratives exhibit a generally positive orientation when viewed through coarse-grained sentiment measures. These contrasts demonstrate that BET captures how emotional and behavioral dimensions are selectively activated within different topics, rather than being reducible to topic identity or overall sentiment polarity.
To understand the contribution of the BET framework, we ran the LDA and sentiment analysis baseline on the four documents. The results are presented in
Table 2. Using the baseline approach, which relies on LDA for topic detection, documents #546 and #3676 were both assigned to topic 20, characterized by generic academic terms such as univers, major, senior, science, and freshman. Document #16140 was assigned to topic 52 (represented by team, fit, accomplish, member, motiv), while document #8366 was assigned to topic 90 (represented by unique, made, quality, stand, among). This suggests that the LDA-based baseline tends to emphasize frequently occurring or broadly shared words rather than capturing the central emotional themes expressed in each document. In contrast, the BET framework provides topic assignments that are more aligned with the substantive content of the profiles.
Focusing on the sentiment analysis results, we observe a mixed distribution of sentiments with high valence of all sentiments (positive, negative, and neutral), with a small preference to the positive sentiment for all documents. It is difficult to draw a clear and interpretable conclusion from this distribution and to analyze how the sentiment connects and affects the topic or vice versa. In contrast, our methodology allows the researcher to reveal rich emotional and behavioral themes. For instance, Document #16140, which our method clearly links to the Physical theme, receives only a generic positive sentiment score in the baseline model, entirely missing the subtle emotional perspective. In contrast, our method captures emotions and themes that are overlooked by traditional sentiment analysis and topic modeling approaches.
5.2. Financial Aid Application Dataset (Hebrew)
We now turn to our main dataset to demonstrate the method. BET enables the construction of topic-level emotional profiles by aggregating sentence-level emotional signals within each topic. This structure allows us to examine how applicants describe comparable life circumstances using different emotional framings, and how multiple emotional themes coexist within the same topic rather than collapsing into a single sentiment label.
5.2.1. Topic and Emotion Detection
Analysis of the financial aid dataset reveals distinct patterns in human-generated narratives. Among the 100 extracted topics, education emerges as the dominant theme, appearing in 1193 () documents. These narratives particularly emphasize high school experiences, matriculation certificates, and academic achievements, elements that directly correspond to the agency’s financial aid selection criteria. National service constitutes the second most prevalent topic, present in 811 () documents, where applicants detail their military service experiences and related life events. The narratives also prominently feature immigration stories, with rich accounts from diverse ethnic communities, including Ethiopian, former Soviet Union, and French immigrants. Emotional analysis of all applications yields an average cosine similarity score of 0.530, characterized primarily by expressions of trust and interest, with notably low levels of hostility.
5.2.2. Behavioral and Emotional Theme Analysis
To demonstrate how BET jointly models thematic and emotional dimensions, we aggregate document-level emotional theme scores within each topic. Specifically, for each topic identified by BERTopic, we compute the average semantic similarity score between sentences assigned to that topic and each emotional or sociocultural category. This yields a topic-by-theme matrix, in which each topic is represented not only by its semantic content but also by a structured emotional and behavioral profile.
The intersection of topics (
) and emotional categories (
) generates analytical matrix that reveals notable patterns at the intersection of emotions and themes.
Figure 3 visualizes the similarity scores of a subset of 10 topics and 10 emotional valences, from which we analyze two topics in-depth: Ballet and Dance and Economic Hardship.
Ballet & Dance
Within the broader landscape of extracurricular engagement, the Ballet and Dance topic represents one compelling window into how students articulate their academic trajectories. Analysis of semantic patterns, as presented in
Figure 3, shows correlations between the Ballet/Dance topic and multiple positive affect indicators. The Joy marker exhibits the highest average similarity coefficient (0.593), with three additional dimensions ranking high: absence of hostility (0.582), non-confusion (0.579), and enthusiasm (0.556).
Comparing Ballet & Dance with other topics such as Economic Hardship reveals that positive markers such as joy and enthusiasm are topic-dependent rather than uniformly distributed across narratives. While both topics involve sustained effort and adversity, they differ markedly in emotional framing, underscoring the value of analyzing emotions in conjunction with thematic structure rather than in isolation.
Further qualitative analysis reveals dance training’s dual functionality in participants’ educational trajectories: it serves both as a mobilizing resource facilitating access to educational opportunities and as a transformative mechanism shaping academic development. The data indicate consistent patterns of long-term dance engagement among participants, with involvement typically initiating in early childhood and sustaining through adolescence into early adulthood via community performance groups and institutional programs. As exemplified by one participant who articulates with energy and enthusiasm: “From a young age, I developed a special fondness for dance, first in ballet and then in a folk dance troupe. In fifth grade, I joined a folklore ensemble, which was my city’s official folk dance group. Being accepted into the troupe was the first significant formative event in my life. The troupe operates like a youth movement that encourages social involvement, as it combines representative performances and dance competitions with community activities.”
Analysis of participant narratives reveals a consistent conceptualization of dance practice through two primary mechanisms: first, as a structured environment fostering discipline, self-efficacy, and expressive capabilities; and second, as a pathway facilitating access to advanced educational opportunities through professional training. These dual functions align with the previously noted affect markers of reduced hostility and non-confusion. This pattern is exemplified in one participant’s reflection: “The dance teacher’s demands for persistence and dedication are extremely demanding. From a young age, he has instilled in us values of discipline, punctuality, professionalism, and teamwork. This is while teaching us that there are no limits to what we can accomplish if we only desire it. Since the troupe is built in part on hidden competition between members for solo roles, and I don’t always get the lead role, I learn to experience both successes and disappointments, this toughens me for the future.”
Economic Hardship
Analysis of the Economic Hardship topic also reveals pronounced affective dimensions. Select correlations across emotional valences including lack of vigor (0.595), not joy (0.623), guilt (0.624), disgust (0.604), and trust (0.670), suggest the overall emotional spectrum of the topic. Participants’ narratives consistently articulate internalized perceptions of familial burden, manifesting in assumed financial responsibilities and labor participation to support household economies.
Accounts of witnessing parental navigation of economic precarity, health challenges, and financial obligations emerge as significant catalysts for emotional distress, correlating with elevated measures of sadness (0.621) and anger (0.594).
The data indicate heightened confusion markers (0.614) regarding educational trajectory maintenance amid financial instability, particularly in relation to the complex negotiation of competing demands across employment, academic pursuits, and familial obligations. However, the findings simultaneously reveal countervailing narratives of aspiration and achievement. These manifest in expressions of pride (0.627) associated with first-generation college attendance and the anticipated disruption of intergenerational poverty cycles.
5.2.3. Temporal Dynamics of Emotional Themes
To examine the temporal dynamics of the emotional landscape, we analyze five key emotions with relation to the topic of Youth Movements with the Financial Aid Application Dataset. We focus on pride, anxiety, anger, interest, and trust.
Figure 4 illustrates the average similarity score for each emotional theme per year and the distribution of documents related to youth movements over the years using bar charts.
Throughout the examined period, trust and interest emerge as dominant emotional markers, highlighting the aspirational and engaged nature of youth movements. In contrast, anger exhibits the lowest scores over the years, suggesting its limited role in the core experiences and contributions of these movements. The marked decline in emotional expression across all categories during 2020–2021 aligns with the imposition of COVID-19 restrictions and the subsequent shift to digital mobilization. Although a modest resurgence followed, emotional intensity remained subdued compared to pre-pandemic levels. This longitudinal analysis not only demonstrates the methodology’s capacity to capture fine emotional dynamics that conventional topic-based approaches might overlook, but also situates the analyzed narratives within the context of macro-level societal events. Because emotional themes are tracked within a fixed topical structure, these temporal patterns reflect shifts in emotional framing rather than changes in topic prevalence, further illustrating how BET jointly models thematic and emotional dimensions.
To exemplify the value of emotional theme extraction and why conventional methods often overlook such themes, we examine the baseline results for documents associated with youth movements (denoted by the black dashed line in
Figure 4). Similarly to BERTopic, the LDA model identifies a coherent topic labeled Youth Movements, and sentence-level sentiment analysis reveals predominantly positive sentiment scores associated with this topic. For example, between 2013 and 2014, there is a slight increase in average sentiment, indicating a generally optimistic tone. However, this surface-level positivity fails to capture the richness and complexity of the emotional content expressed in the texts.
While sentiment scores provide a broad directional signal (e.g., positive or negative), they miss differences in emotional framing that are essential for understanding the lived experiences conveyed in the youth movement documents. The baseline’s inability to distinguish between varied emotional expressions and a generic sense of positivity underscores the need for an approach that models emotional framing beyond a single sentiment polarity.
6. Discussion
In this study we examine how thematic content and emotional framing interact in long-form narratives and evaluate the extent to which existing computational approaches capture this interaction. Our results indicate that, in biographical texts, emotional expression plays a substantive role in shaping how experiences are presented and interpreted, beyond topical content alone. Narratives that describe comparable life events may converge thematically while differing substantially in emotional and behavioral orientation, a distinction that is often missed by methods that reduce emotion to a single label or treat sentiment as an inherent property of a topic. These nuances matter especially in high-stakes contexts such as college admissions or financial aid decisions. Biographies convey not only what happened but also how individuals frame those experiences—whether in terms of success, failure, resilience, or agency. This attitudinal layer adds meaning that proves critical for decision-making yet remains difficult for current methods to capture.
We present the BET framework as a response to this challenge by explicitly decoupling thematic layers from emotional theme detection. Thematic layers is modeled using unsupervised topic representations derived from contextual embeddings, which capture latent semantic regularities across narratives. Emotional and sociocultural information is introduced separately through expert-defined lexicon categories derived from LIWC and aligned with sentence-level embeddings. This separation allows us to represent what is discussed and how it is framed as distinct but related analytical dimensions, rather than conflating content and affect within a single representation.
A central implication of this design is that BET does not infer emotional tone from topic embeddings themselves. Topics extracted using BERTopic serve solely as an organizational scaffold for grouping semantically related content. Emotional themes are identified at the sentence level by computing semantic similarity between latent sentence representations and embeddings of explicit emotional and sociocultural categories. Similarity scores, therefore, indicate alignment between an utterance and a given category, rather than attributing a single emotional value to an entire topic. This formulation enables BET to capture mixed and even contrasting emotional framings within the same topic. For example, narratives describing adverse circumstances, such as illness or economic hardship, may simultaneously express distress, hope, determination, or pride. Instead of enforcing a single dominant emotional label, BET preserves a distribution of emotional-theme signals within topic-specific narratives. We therefore conceptualize emotional themes as contextual and relational properties of discourse that emerge from localized linguistic choices, rather than as fixed attributes of thematic content.
These capabilities have direct relevance for high-stakes institutional evaluation, such as in holistic college admissions—approaches that assess personal qualities, experiences, and context alongside previous achievements [
60]. BET incorporates emotional and sociocultural framing into narrative analysis, enabling evaluators to distinguish how applicants describe similar life events: whether they emphasize agency or circumstance, resilience or hardship, or integrate multiple orientations. This distinction matters for two reasons. First, research on social mobility emphasizes that goal-engagement strategies—particularly agency, resilience, and social engagement—predict success in the transition to higher education [
61,
62]. BET systematically identifies these behavioral orientations in naturally occurring narratives and tracks how they evolve over time. Second, because certain themes co-occur with specific emotional orientations, evaluators unfamiliar with the full distribution of these patterns may inadvertently introduce bias. BET makes these distributions explicit, supporting more equitable assessment and helping identify applicants who may need additional institutional support.
The multilingual and lexicon-aware design of BET further strengthens its applicability. Many existing narrative analysis tools are optimized for English and relatively homogeneous text genres, which limits their usefulness in linguistically diverse settings. By demonstrating the applicability of BET to both English and Hebrew and grounding emotional themes in language-specific lexica curated with domain experts, we show that it is feasible to construct analysis pipelines that respect linguistic and cultural variation rather than treating it as noise. This is an important consideration for the development of equitable computational systems in multilingual societies.
While we use emotion and theme as our case study, the framework is generalizable: researchers can define any lexicon relevant to their analytic goals, enabling applications beyond affective framing to other dimensions of narrative structure and meaning.
Despite these contributions, several limitations should be acknowledged. First, due to privacy, ethical, and legal constraints, the financial aid application dataset cannot be publicly released, nor can full-length narrative examples be provided. This restriction limits reproducibility and prevents external researchers from directly validating the framework within this specific institutional context.
Second, the outputs produced by BET are not self-interpreting and require analysis by domain experts. Interpreting emotional and behavioral themes necessarily depends on knowledge of the relevant institutional, cultural, and social contexts, and the framework is intended to support, rather than replace, qualitative and sociological interpretation.
Third, while LIWC provides a well-established foundation for incorporating explicit psychological and sociocultural knowledge, its full English lexicon remains proprietary. Restricted access may limit coverage of certain emotional or sociocultural dimensions and constrain reproducibility in settings where licensed resources are unavailable. Broader availability of open, multilingual lexica would facilitate wider adoption of the framework and enable more comprehensive and transparent category definitions, particularly for underrepresented languages and domains.
7. Conclusions
The exponential growth of large-scale textual data within the social sciences presents significant methodological challenges regarding the extraction of meaningful insights through empirically validated analytical frameworks. Current topic modeling approaches predominantly emphasize content-based thematic classification while failing to account for the writer’s sociocultural positionality, a critical dimension through which meaning is constructed and interpreted. Social locations and subjective standpoints can imbue ostensibly similar topics with distinctly different meanings, yet these latent dimensions remain underexplored in existing frameworks.
Addressing this methodological gap, this paper proposes a framework for identifying and analyzing emotional themes within topic structures. Our approach evaluates the thematic composition of documents by integrating explicit knowledge from researcher-defined word sets or official lexicons, with latent semantic representations, using semantic similarity as the evaluation metric. Furthermore, our method is language-agnostic and adaptable to any official lexicon, as evidenced by its successful application in both Hebrew, a morphologically rich language, and English. Future research can expand our method to include not only granular analysis of sociocultural themes but also integrate intersectional dimensions of social stratification, including gender, race, and social class, to more comprehensively theorize and empirically examine the situated nature of human-generated text. Understanding writers’ social locations and sociocultural positionality is critical when analyzing high-stakes personal narratives. Such texts play decisive roles in evaluating candidates across multiple domains: college application essays and financial aid statements in higher education, job applications and promotion materials in professional settings, and research statements or teaching philosophies in academia. Differences in how individuals from different social backgrounds articulate their stories can significantly affect evaluation outcomes, potentially reinforcing existing inequalities. The proposed framework can help mitigate bias by validating marginalized narrative styles and creating space for alternative expressions of merit and struggle.
As directions for future research, several extensions of the proposed framework are worth pursuing. First, BET can be applied to additional languages and sociocultural contexts, provided that appropriate lexical resources and domain expertise are available. This includes extending the framework beyond educational and financial aid narratives to other high-stakes domains such as healthcare communication, employment-related texts, and legal narratives, where emotional framing and sociolinguistic positioning play a critical role in decision-making.
Second, future work may incorporate richer or fully open multilingual lexica, enabling broader coverage of emotional, behavioral, and sociocultural categories and improving transparency and reproducibility. In parallel, expert-guided refinement of category definitions could be explored to tailor emotional and behavioral themes to specific institutional or cultural settings.
Finally, BET lends itself naturally to longitudinal and intervention-based analyses. Future studies could examine how emotional and behavioral framing evolves over time, for example before and after participation in educational, social, or therapeutic programs, or across major societal events. Evaluating the framework in such settings would further clarify its utility for understanding narrative change and for supporting reflective and equitable decision-making processes.