Next Article in Journal
Effects of Syntactic Structures on Intonational Pitch Movement in Mandarin Chinese
Previous Article in Journal
Linguistic Universals and Dialects: The Future as ‘Injunctive’ in the Inscriptions of Mytilene
Previous Article in Special Issue
Austriacisms and Their Co-Variants—Short-Term Diachrony in the 21st Century
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Does Variation in Lexical Sentiment Scores Reflect Emotional Polysemy and Ambivalence?

by
Andreas Baumann
1,2,3
1
Department of German Studies, University of Vienna, 1010 Vienna, Austria
2
Research Network Data Science, University of Vienna, 1010 Vienna, Austria
3
Austrian Centre for Digital Humanities, Austrian Academy of Sciences, 1010 Vienna, Austria
Languages 2026, 11(6), 118; https://doi.org/10.3390/languages11060118
Submission received: 5 December 2025 / Revised: 6 May 2026 / Accepted: 5 June 2026 / Published: 11 June 2026

Abstract

To measure the emotional meaning of words, numerical sentiment scores are ascribed to them. However, within individual words these scores show variation: within and across annotation surveys, across linguistic contexts, and across semantic neighbors. While such variation could be set aside as undesirable noise, this study examines to what extent variation in lexical sentiment scores is in fact informative of the degree of the emotional ambiguity of words. Four different ways of estimating emotional polysemy and ambivalence are employed to analyze a set of 117 German words. Data from 16 sentiment dictionaries, an additional sentiment survey conducted for this study, automatically annotated contexts drawn from a contemporary German corpus, and pre-trained word embeddings were used for this purpose. These estimates are compared against subjectively rated ambivalence collected through crowdsourcing. It is shown that only variation within and across surveys robustly relates to subjective ambivalence. Context and neighborhood-based estimates, both of which are inherently sensitive to lexical frequency, cannot be shown to be related to ambivalence. This suggests, (i) that variation in lexical sentiment scores across dictionaries and annotators, but not across semantic neighbors and contexts, carries information about emotional meaning, and is hence valuable for cognitive-variationist research, and (ii) that speakers’ retrieval of positive and negative senses of words, in judging the degree of ambivalence, is not strongly affected by frequency, which is fundamental to NLP methods that build on distributional semantics. This implicitly challenges usage-based approaches to semantics that consider frequency as a predominant factor.

1. Introduction

Words convey emotional meaning. When we are angry about a particular person or situation, we opt for different lexical choices than when we indulge in joy. Similarly, words evoke emotions when we hear or read them (Kauschke et al., 2019). Perhaps the most fundamental aspect of emotion is that of valence, also referred to as polarity or sentiment, which is about evaluating if something is ‘pleasant’ or ‘unpleasant’ (or equivalently: ‘good’ or ‘bad’; ‘positive’ or ‘negative’), or indeed something in between. It is fundamental on the evolutionary level in that it relates to approaching to vs. withdrawing from something (Wielgopolan & Imbir, 2023), in extreme cases a matter of life and death (consider the actual words safe and danger). And it is fundamental on the level of cultural evolution of language, where shifts in the emotional meaning of words belong to the most evident and well-studied phenomena in language change and inform us through the linguistic lens about how societies and cultures change (Blank, 1993; Keller, 1994; Cook & Stevenson, 2010; Acerbi et al., 2013; Hilpert, 2019; Buechel et al., 2016; Tiessler et al., 2025). Likewise, lexical sentiment modulates processing (Kousta et al., 2009; Kauschke et al., 2019), reading (Kissler & Herbert, 2013), and age of acquisition of words (Ponari et al., 2018) as well as their semantic development over an individual’s life span (Martínez-Huertas et al., 2025).
Information about emotional properties of words, and most prominently about lexical sentiment, is typically harvested through surveys (Schröder, 2011; Warriner et al., 2013), sometimes in combination with computational propagation methods through which numerical values that represent the emotional meaning of a word are inferred from those of its semantic neighbors (Buechel et al., 2016; Hamilton et al., 2016a; Köper & Schulte im Walde, 2016). In such surveys, participants are asked to rate the emotional dimension of interest, such as sentiment, on a given scale (e.g., from −1 to 1; on a 7-point Likert scale; in terms of nominal categories ‘negative’, ‘neutral’, ‘positive’, etc.). Typically, more than one rating is collected per word. The resulting sentiment (or more generally, emotion) dictionaries are, in essence, lists of words together with aggregated scores (such as the arithmetic mean of all scores collected per word). Such sentiment dictionaries are used for psycholinguistic or cognitive research (e.g., to address the question of how emotional meaning and lexical processing interact; see Kauschke et al. (2019), and others cited above); or they serve as input for unsupervised emotion detection algorithms (Taboada et al., 2011).
There is variation in sentiment lexica on two levels. First, sentiment ratings (annotations) collected from multiple individuals for a single word could differ from each other. That is, there could be low inter-annotator agreement for some words within a sentiment dictionary. Second, aggregated sentiment scores from multiple dictionaries for a single word could disagree. That is, sentiment dictionaries as models of the emotional meaning of (a share of) the lexicon could diverge. For instance, if sentiment is measured on a cardinal scale both rating distributions would display a comparably high standard deviation. For the purpose of unsupervised emotion detection, such kind of variation is undesirable. Neither does one want to have unreliable aggregated scores, as errors would be transferred to words (or texts) that need to be labeled with respect to their sentiment, and hence reduce an algorithm’s accuracy; nor does one want an emotion detection algorithm to be overly sensitive to the sentiment dictionary that it is based on. It would, at first sight, not speak to the reliability of a method or analysis, were it to predict a negative sentiment for some target sentence if one sentiment dictionary is used, but a positive sentiment if informed by a different sentiment dictionary.
As a consequence of this, researchers in computational linguistics have invested efforts in finding methods that aim at maximizing agreement in sentiment annotations, hence minimizing word-level variation (Kiritchenko & Mohammad, 2016), or filtering words depending on their amount of variation (Knupleš et al., 2023). Likewise, psycholinguistic research has framed the observation that the ratings of midscale words (i.e., words that on average score in the middle of the available spectrum, e.g., 0 on a scale from −1 to 1; 4 on a 7-point Likert scale) display a high standard deviation as problematic, in fact dubbing it the “midscale disagreement problem” (Paisios et al., 2023, p. 1). In a nutshell, variation, as manifested in terms of high standard deviations, is seen as a methodological artifact that potentially biases analyses involving such ratings.
While these concerns are of course legit, this variation could be potentially informative if one adopts a cognitive-variationist perspective (Tagliamonte, 2011; Geeraerts, 2005). After all, each survey represents a variety of speakers, each of which is characterized by their own linguistic setup construed by individual linguistic experiences. Moreover, multiple surveys represent a variety of subsets of the speaker population, which differ in their age, gender, and socio-geographic as well as cultural profile (Ruette et al., 2014), a view that is also embraced in a subbranch of computational linguistics referred to as “perspectivism” (Plank, 2022). For instance, it becomes evident from the meta-analysis of German sentiment dictionaries by Kern et al. (2021) that they differ considerably in how and from whom sentiment scores were retrieved.
At the same time, variation in sentiment scores could inform us about emotional semantics and cognition. Consider, for instance, a word that displays a neutral sentiment score when aggregating over all collected sentiment scores. While it is possible that the word’s sentiment is indeed neutral, it could in fact also be that it is emotionally ambivalent and is perceived as positive in some situations but negative in others (Cacioppo et al., 2011). That is, it can have positive as well as negative connotations. The German adjective scharf, for instance, has the negative meaning ‘sharp/hot/fierce’ as well as the positive meaning ‘attractive’. Of course, it is not necessary that a word’s senses are located at opposing ends of the valence spectrum. A word can be on average positive but having, say, one mildly positive and one extremely positive reading, and the same applies, mutatis mutandis, to the negative scale. The German noun Teufel, for example, means ‘devil’ as well as ‘poor guy’, the latter of which is slightly less negative that the former. Still, both senses are negative. We will refer to the former cases as emotional ambivalence (Kuhlmann et al., 2017; Wielgopolan & Imbir, 2023), and to the latter more general case in which senses may or may not have opposing valence as emotional polysemy (and in line with the literature, this will be operationalized by means of the standard deviation). Evidently, both phenomena represent instances of ambiguity.
This variation is not only present in surveys, but it also manifests itself in text data, which consists of expressed words and their semantics. An emotionally ambivalent word is expected to surface in both negative and positive contexts, and likewise its semantic neighbors (i.e., other words it shares its contexts with) must also be partially negative and partially positive. Thus, surveys and texts represent two sides of the same coin. While variation in (and across) sentiment surveys informs about perceptive ambiguity (ambivalence, polysemy), variation in the sentiment of contexts and that of a word’s neighbors corresponds to ambiguity (ambivalence, polysemy) in production (cf. Kuhlmann et al., 2017).
The question to be asked in this contribution is this: to what extent does variation in lexical sentiment scores reflect emotional polysemy and ambivalence when considering surveys, semantic neighborhoods, and contexts? To address this question, multiple data resources will be used. Variation will be measured across sentiment dictionaries, within a single survey, in semantic neighborhoods derived through word embeddings, i.e., numerical representations of lexical meaning, and in actual contexts that a word surfaces in. This will be complemented with a crowd-sourced gold-standard dataset of subjective ambivalence ratings. Several covariates will be considered as well. Note that we will restrict our analysis to the dimension of valence, i.e., sentiment. Other emotional dimensions, such as arousal or dominance (Russell, 1980; Bakker et al., 2014), will not be covered in this study (although ambivalence in them has been subject to research as well; Wielgopolan & Imbir, 2023).
As will be seen, (i) subjective ambivalence corresponds well to variation in and across surveys, (ii) but not with variation across semantic neighbors and linguistic contexts. We take this to show, first, that variation in survey-based sentiment scores is a feature rather than a bug in that it provides relevant information about emotional ambiguity (because of (i)); and second, that cognitive representations of positive and negative senses are only weakly affected by frequency distributions (because of (ii)). Individuals can retrieve the sentiment of senses even though they are rare.
The paper is structured as follows. Section 2 describes the data and how emotional polysemy and ambivalence were measured. Section 3 presents the results of the correlational analysis. Section 4 and Section 5 discuss and contextualize the results, outline limitations, and summarize the findings of this study.

2. Materials and Methods

Different ways of measuring emotional polysemy and ambivalence were adopted in this study: (i) through variation across sentiment dictionaries (dct), (ii) through variation across sentiment scores of semantically neighboring words (nbh), (iii) through variation across sentiment scores across contexts words are used in (ctx), (iv) through variation across participant ratings within a single survey (srv), and (v) through collecting subjective ratings about the extent to which words are perceived as emotionally ambivalent. All approaches will be described in the subsequent sections, followed by a description of the statistical modeling procedure employed for uncovering relationships among measures and covariates. Figure 1 illustrates approaches (i–iv) as well as subjective ratings (v) by means of sample data for the German adjective scharf (‘sharp/hot’), representing an ambivalent word. Data and code are available at https://gitlab.com/andreas.baumann/variation_and_emotional_ambivalence/ (accessed on 4 June 2026). See Supplement File S1 for a table of aggregated data.

2.1. Dictionary-Based Polysemy and Ambivalence

A selection of 16 German sentiment dictionaries was assessed, including all sentiment dictionaries reviewed in Kern et al. (2021) as well as scores compiled by Schröter and Schroeder (2017): AffDict (Schröder, 2011), AffMeaning (Ambrasat et al., 2014), AffNorms (Köper & Schulte im Walde, 2016), ALPIN (Kolb et al., 2022), ANGST (Schmidtke et al., 2014), BAWL.R (Võ et al., 2009), EmotionDict (Klinger et al., 2016), LANG (Kanske & Kotz, 2010), Morph (Ruppenhofer et al., 2017), PolarityClues (Waltinger, 2010), Polart (Klenner et al., 2009), SentiMerge (Emerson & Declerck, 2014), SentiWS (Remus et al., 2010), SePL (Rill et al., 2012), WordNorms (Lahl et al., 2009), and DeveL (Schröter & Schroeder, 2017).1 In line with Kern et al. (2021), all scores were scaled to the interval [−1,1] going from negative to positive with 0 denoting a neutral score. Outliers, defined as scores belonging to the bottom and top 2.5%, were excluded prior to scaling. All dictionaries were subsequently merged.
The whole dataset consists of more than 400,000 words, but most of them (N = 347,503) only surface in a single dictionary. However, since we are interested in how far scores vary across dictionaries, we need to limit the selection of words to those that surface in a suitable number of dictionaries. For the purpose of this study, a limit of at least 10 dictionaries per word was adopted, yielding a set of 117 words. Most of them (N = 86) surface in ten dictionaries, 19 surface in 11 dictionaries, 7 in 12 dictionaries, 3 in 13 dictionaries, and only 2 (gewinnen, ‘to win’; traurig, ‘sad’) in 14 dictionaries. None of them surface in all dictionaries. See Figure 1 (red box) for sample data of a single word.
The mean sentiment score μ d c t was computed for each word. Two different measures were used to operationalize emotional polysemy and ambiguity. Dictionary-based emotional polysemy was straightforwardly measured as the standard deviation σ d c t across all scores present for each word separately. If the standard deviation of a word’s scores is small, scores from different dictionaries center around the mean sentiment score of that word. If it is large, then scores are more dispersed around μ . This reflects the concept of emotional polysemy well: highly polysemous words do not necessarily need to have senses on both ends of the sentiment spectrum. It is perfectly viable for an emotionally polysemous word to have, say, one mildly positive sense alongside a very positive one.
Measuring emotional ambivalence requires a bit more work. Intuitively, an emotionally ambivalent word should have negative as well as positive senses, and negative and positive senses should be represented equally. To capture this intuition, we first operationalize neutrality ν d c t as the closeness of a words mean sentiment to 0, measured as ν d c t = 1 / exp μ d c t (Deza & Deza, 2009) and ambivalence as the product of standard deviation and neutrality, i.e.,
α d c t = σ d c t · ν d c t = σ d c t / exp μ d c t .
Thus, a word is highly ambivalent when it is strongly emotionally ambiguous, but only if its mean sentiment is close to neutrality, i.e., if its senses occupy the negative and the positive regime of the sentiment spectrum (Kuhlmann et al., 2017). Emotional polysemy σ d c t and ambivalence α d c t were computed for all 117 words.

2.2. Neighborhood-Based Polysemy and Ambivalence

The emotional content of a word can be inferred from that of its semantic neighbors. Indeed, many emotion dictionaries were built based on this insight: Köper and Schulte im Walde (2016) (AffDict) and Li et al. (2017) have used word embedding based on semantic similarities to regress emotional (and other semantic) scores for a large number of words given a relatively small set of seed words for which ratings are already available. Buechel et al. (2016) and Hamilton et al. (2016a) use similar methods to infer emotion scores for historical periods, and Buechel et al. (2020) further develop techniques to propagate emotion scores to compile emotion dictionaries for 91 languages. In all these studies, the goal was to infer an average score for a semantic dimension of interest per word, reflecting that word’s prototypical valence (i.e., sentiment), arousal, dominance, etc.
If information about the distribution of the sentiment of a word’s semantic neighbors can be used to learn about its own average sentiment, then the same information should be of use for learning about the extent to which the sentiment of a word is dispersed, i.e., emotional polysemy and ambivalence. For this study, we analyze the distribution of sentiment scores of all words in the near semantic neighborhoods of each of the 117 target words, proceeding as follows.
We used the German CoNLL17 model from the NLPL embedding repository (Fares et al., 2017), retaining only embeddings for those words that surface in the combined sentiment lexicon of 400,000 words derived in Section 2.1 above. Embeddings in that model have 100 dimensions and were trained with word2vec (Mikolov et al., 2013). Note that words are lowercased in this model, potentially inducing ambiguities through merging nouns (that are typically uppercased in German) with then homographemic items of other (typically lowercased) word classes. For that reason, all words in the combined sentiment lexicon, including the 117 target words, needed to be lowercased as well.
For each target word, the 50 closest words (excluding the target word itself) by means of cosine similarity to the target were extracted (Figure 1, blue box) and the corresponding mean sentiment scores for each of the 50 neighbors were retrieved from the combined sentiment lexicon (see Figure A1 for a robustness check covering a range of different neighborhood sizes). Subsequently, neighborhood-based emotional polysemy  σ n b h was computed as the standard deviation of all 50 sentiment scores in the neighborhood, and neighborhood-based emotional ambivalence  α n b h was computed, mutatis mutandis, following Equation (1) above.

2.3. Context-Based Polysemy and Ambivalence

Lexical sentiment is often derived indirectly from labeled text snippets (e.g., sentences, paragraphs, or other text chunks) serving as contexts that a word surfaces in (van Atteveldt et al., 2021). For instance, Rill et al. (2012) infer word-level sentiment scores (SePL) from review ratings, and in Kolb et al. (2022) we apply the SPLM algorithm (Almatarneh & Gamallo, 2018) to text-chunk-level sentiment human sentiment annotations collected through crowdsourcing to construct a domain-specific sentiment dictionary.
Arguably, then, the variation in emotional usage of a word, and hence emotional polysemy and ambivalence, should be reflected in differential contexts that the word surfaces in. So, it is expected that a highly ambivalent word occurs both in negative as well as in positive utterances. We applied a sentiment classification model to samples of sentences to measure context-based polysemy and ambivalence in the following steps.
First, a sample of 50 sentences was randomly drawn from the DWDS Core 21 corpus2 for each word separately. This corpus is balanced with respect to genre (fiction, practical literature, academic, journalism) and contains German texts (15 million tokens) from the first decade of the 21st century. Each of the sampled sentences shows at least one occurrence of the respective target word. In a second step, we made use of the model implemented in the Python library germansentiment 1.1.0 (Guhr et al., 2020), a BERT-based sentiment classifier. It was developed as a domain-general sentiment classifier, having been trained on diverse data resources (reviews, social media messages, encyclopedic texts, utterances from human–computer interactions). The model puts out the most likely of three sentiment categories (‘negative’, ‘neutral’, ‘positive’) together with a three-dimensional probability distribution over these categories. To make the model output comparable with the scores on the one-dimensional sentiment scale used in our study, a scalar sentiment score was computed as the difference between the probabilities Pr(‘positive’) and Pr(‘negative’), naturally yielding sentiment scores in the interval [−1,1].
Given the 50 sentiment scores computed for each word in this way (Figure 1, purple box), context-based emotional polysemy  σ c t x and ambivalence  α c t x were again computed as the standard deviation of all scalar sentiment scores and as per Equation (1), respectively.

2.4. Survey-Based Polysemy and Ambivalence

In Section 2.1, variation across dictionaries, many of which were collected through surveys, was examined for measuring emotional polysemy and ambivalence. It could be, however, that emotional polysemy (and ambivalence) can be already derived from the distribution of ratings collected within a single survey. Thus, a survey was conducted to collect a number of ratings per word in line with the sets of scores in Section 2.2 and Section 2.3. In this survey, which was implemented in SoSciSurvey, 50 anonymous participants were recruited through the crowdsourcing platform Prolific Academic (one of which did not meet the quality criteria, see below). Only Prolific users with German as their first language living in Germany, Austria, and Switzerland were admitted. The final sample consists of 46 individuals from Germany, two from Austria, and one from Switzerland. There were 21 female and 28 male participants with a mean age of 38.2 years (range: 21 to 73 years).
After a legal disclaimer and an introduction to the task, individuals were first asked to rate 140 words regarding their sentiment with a slider going from “sehr negativ” [‘very negative’] to “sehr positiv” [‘very positive’], with the middle of the scale representing neutral sentiment (initial question: “Wie negativ oder positiv sind die folgenden Wörter für Sie?” [‘How negative or positive are the following words for you?’]). No contexts were provided during the annotation process so as to avoid biases towards specific sense. In the backend, slider positions were mapped to a scale from 0 (negative) to 100 (positive). Of these 140 words, 117 were the target words and 23 were test words for quality control. For this purpose, 11 words were picked from the most negative items and 12 words were picked from the most positive items in the combined sentiment lexicon of 400,000 words (Section 2.1) beforehand. None of the test words belong to the target words.
In the second part of the survey, individuals were asked to indicate all target words that can have a negative as well as a positive meaning (“Welche der folgenden Wörter können für Sie sowohl eine positive als auch eine negative Bedeutung haben? Klicken Sie die jeweiligen Wörter bitte an!” [‘Which of the following words can have a positive as well as a negative meaning for you? Please, click on the respective words!’]). In the final part of the survey, individuals were asked to provide additional information (age, gender, country). On average, filling in the survey took 11.49 min (sd = 2.53). Participants received GBP 2.70 for their efforts.
Quality was assessed by comparing ratings for the positive and negative test words against each other, and by examining the number of missing sentiment ratings. More specifically, Cohen’s d of the difference between ratings for positive test words and ratings for negative test words together with a 95% confidence interval for d was computed. In order to pass the quality check, d must be a significant effect of at least medium strength, and more than 90% of all sentiment scores must be present. All except one individual met these criteria, so that we ended up with 49 participants, i.e., mostly 49 ratings per target word (Figure 1, green box).
As for the previous measures, survey-based emotional polysemy  σ s r v and ambivalence  α s r v were computed as the standard deviation of all sentiment ratings collected in this way (rescaled to the interval [−1,1]) and according to Equation (1), respectively. In addition, rated ambivalence was measured for each word as the fraction of all individuals that considered that word as having both a negative and a positive meaning, given the results from the second part of the survey (see pie-chart in Figure 1 for the distribution of answers for the word scharf, ‘sharp/hot’). Rated ambivalence hence represents a direct subjective estimate of how emotionally ambivalent a word is. Thus, rated ambivalence corresponds to “explicit ambiguity”, while survey-based emotional polysemy corresponds to what was referred to “implicit ambiguity” by Poesio and Artstein (2005, p. 83) and Uma et al. (2021, p. 1402).

2.5. Covariates and Modeling Procedure

Three additional covariates were considered in this study: (i) the number of senses, as listed in DWDS for each word; (ii) token frequency, fetched from DWDS and subsequently log transformed; (iii) and word class (noun, verb, adjective). It is possible that rated ambivalence correlates with these covariates, e.g., because it might be easier for participants to retrieve positive and negative senses if a word is highly frequent. For this reason, a controlled version of rated ambivalence was computed. This was done by fitting a linear model of rated ambivalence depending on the number of senses, (log) frequency, and word class. The residuals of this model (i.e., all variation in the data that cannot be explained by covariates (i–iii)) were then taken to define controlled rated ambivalence.
We restrict ourselves to a correlational analysis of the pairwise relationships between all polysemy and ambivalence measures (including rated ambivalence and controlled rated ambivalence) and their covariates. For each pairwise relationship, Pearson’s R was computed, which is equivalent with fitting a univariate linear model to the z-transformed variables (one acting as predictor, and one as outcome). Word class was one-hot encoded into three binary variables (for noun, verb, and adjective) for this purpose. Each way of estimating ambivalence was individually described with the help of violin charts ordered by decreasing ambivalence. In addition, pairwise correlations of all ambiguity measures, number of senses, and frequency were computed for each word class separately.

3. Results

The distribution of sentiment scores for all words derived through all methods (dictionary based, neighborhood based, context based, survey based) is illustrated in Figure 2. In this visualization, words are ranked by decreasing ambivalence α i (where i is dct, nbh, ctx, or srv). It can be clearly seen, in general, that words with high ambivalence display a more dispersed distribution of sentiment scores as well as a median relatively close to zero than words with low ambivalence. Of course, this is to be expected, given how ambivalence was defined.
The most ambivalent words, i.e., the ones on the left-hand-side of the four panels in Figure 2 are: wild (‘wild’), Feuer (‘fire’), scharf (‘hot/sharp’), frech (‘impudent/bold’), faul (‘lazy/rotten’) for dictionary-based sentiment; Wahrheit (‘truth’), loyal (‘loyal’), lieb (‘dear’), geduldig (‘patient’), froh (‘happy’) for neighborhood-based sentiment; nett (‘nice’), lieb (‘dear’), ehrlich (‘honest’), Lüge (‘lie’), kalt (‘cold’) for context-based sentiment; and Pflicht (‘duty’), sichern (‘secure’), dunkel (‘dark’), scharf (‘hot/sharp’), and Aufstand (‘insurrection’) for survey-based sentiment. The words with the highest values for rated ambivalence (medians colored from light to dark for increasing ambivalence in Figure 2) are wild (‘wild’), scharf (‘hot/sharp’), allein (‘alone’), Feuer (‘fire’), and dunkel (‘dark’).
The four methods differ from each other with respect to the distributional patterns that they yield. For dictionary-based sentiment, and to a lesser extent for survey-based sentiment followed by neighborhood-based sentiment, a clear separation can be seen for non-ambivalent words (word on the right-hand-side in Figure 2) in that there are words that are mostly negative (e.g., Furcht, ‘fear’, Schmerz ‘pain’) as well as words that are mostly positive (e.g., weise, ‘wise’, Hochzeit, ‘wedding’, Sicherheit, ‘security’). In contrast, context-based sentiment does not yield very distinctive locations of central sentiment scores (medians), all of which are essentially around zero. It is only the tails to the left and to the right of the median that indicate if a word is negative or positive (or both). Moreover, there are words that display almost no variation in context-based sentiment scores (e.g., weise, ‘wise’, sichern, ‘secure’, aktiv, ‘active’).
From the distribution of colors in each of the panels of Figure 2 one can learn about the relationship between ambivalence measured through each of the methods and subjectively rated ambivalence. In the case of dictionary-based ambivalence and survey-based ambivalence, a shift from dark (high rated ambivalence) to light (low rated ambivalence) is visible (i.e., there are more dark dots to the left than to the right). For neighborhood-based and context-based ambivalence the distribution of colors seems to be much more arbitrary, suggesting that the latter measures do not align well with subjectively assessed ambivalence.
The correlogram in Figure 3 displays pairwise correlation coefficients between all variables in the dataset. Only statistically non-trivial correlations (at a 95% confidence level) are displayed. The mean scores derived from the four methods (first four variables in the correlogram) show medium to strong positive correlations among each other for all combinations. That is, predictions for mean sentiment as such are matching.
The picture looks drastically different for measures of emotional polysemy (standard deviations σ i ) and ambivalence ( α i , with i being dct, nbh, ctx, or srv). In the case of emotional polysemy, there are no robust correlations among the four methods. In the case of emotional ambivalence, only dictionary-based ambivalence, and survey-based ambivalence display a significantly positive correlation (R = 0.28). Unsurprisingly, emotional polysemy and ambivalence are strongly correlated within each method.
Subjectively rated ambivalence displays interesting differential relationships. There are robust and relatively strong correlations with dictionary-based and survey-based ambivalence (as well as with dictionary-based emotional polysemy), but no correlation whatsoever with measures derived through contexts or neighborhoods. The pattern is almost the same when rated ambivalence is controlled for frequency, number of senses, and word class. Indeed, controlling for these covariates does not matter much, seeing that rated ambivalence and its controlled version display a correlation of R = 0.97. In other words, rated ambivalence is not substantially affected by polysemy, frequency, and word class.
Regarding the other measures, some of the covariates show interesting interactions. Frequency seems to be throughout positively correlated with mean sentiment, but negatively with context-based emotional polysemy. However, frequency is not correlated with any of the polysemy and ambiguity measures (except for neighborhood-based ambivalence). Nouns tend to display lower emotional polysemy and ambivalence as per context-based and neighborhood-based scores. In contrast, adjectives seem to display, on average, higher context and neighborhood-based polysemy and ambivalence (e.g., nett, ‘nice’). Verbs seem to be somewhere in the middle. Lexical class does not affect dictionary-based and survey-based measures, though.
The word-class specific analysis (Figure 4) corroborates these results. In nouns (n = 63) and adjectives (n = 39), rated ambivalence is positively correlated with survey-based and dictionary-based ambivalence (and less strongly with dictionary-based polysemy). For words, no such correlations can be found, likely due to the small sample size (n = 15). However, there are no consistently significant associations between rated ambivalence and neighborhood-based or context-based measures. Notably, there is a moderate correlation (R = 0.25) between rated ambivalence and neighborhood-based ambivalence in nouns that is absent in general case (Figure 3), verbs, and adjectives.

4. Discussion

4.1. Emotional Polysemy and Ambivalence

We have seen that the sentiment scores that each of the four methods provide match to a large extent. For the field of sentiment analysis, this is of course good. It tells us that methods are robust and, in particular, that human-annotated average scores of lexemes by and large coincide with unsupervised methods based on word embeddings as well as more complicated supervised models (here: BERT).
However, this was not the focus of this study. Rather, the goal was to test to what extent variation in scores is informative in the sense that it tells us about the degree to which words are emotionally polysemous and ambivalent. Here, ambivalence was conceptualized as a special case of emotional polysemy: a word is ambivalent if it features a comparable number of positive and negative senses. Both are instances of ambiguity.
The results are clear regarding the different methods of inferring emotional polysemy and ambivalence. To the extent that subjectively rated ambivalence can be seen as a gold standard against which other methods must be tested (see Section 4.2 for limitations, however), it was found that while dictionary-based and survey-based ambivalence matches well with rated ambivalence, this was not the case for NLP-based methods, i.e., inferring sentiment through contexts and semantic neighborhoods.
A reason for this mismatch could be that neural-network models that build on distributional semantics (to which both word2vec and BERT, as used here, belong) overemphasize usage frequency, or, conversely, that emotional ambivalence as represented cognitively is more categorical than one might think. That is, it could be that individuals can retrieve negative/positive senses easily even if they are rare (and hence not prototypical). This would, at least to a mild extent, challenge approaches to cognitive linguistics that foreground the role of usage frequency (Bybee, 2006; Divjak & Caldwell-Harris, 2019; Schmid, 2010). In line with this, we find that frequency is not systematically correlated with emotional polysemy and ambivalence across methods (Figure 3).
At least for neighborhood-based sentiment, as measured here, it could be that the semantic field that a word belongs to is not as strongly modulated by sentiment as could be expected. For instance, sentiment as a semantic dimension could be less relevant, and hence overshadowed by other semantic aspects. However, this view would contradict findings about emotional prosody (Snefjella & Kuperman, 2016), and at the same time challenge sentiment-propagation methods that rely on semantic neighborhoods defined by word embeddings (Buechel et al., 2016; Hamilton et al., 2016a). The notable exception of the lack of correlation is the relationship between rated ambivalence and neighborhood-based ambivalence in nouns. Indeed, modeling lexical meaning in terms of semantic neighbors was suggested to be particularly useful for nouns (Hamilton et al., 2016b). Thus, it seems that the extent to which neighborhood-based measures reflect variation in sentiment is somewhat dependent on word class.
Much in contrast to NLP-based methods, ambivalence derived from sentiment scores across dictionaries as well as within surveys show a consistently robust correlation with subjectively rated ambivalence, suggesting that both function as reliable proxies of the latter. From a variationist point of view (Tagliamonte, 2011; Geeraerts, 2005; Ruette et al., 2014), different emotion dictionaries reflect different groups of speakers, such as different age ranges (e.g., if annotators are students, as in ANGST, Schmidtke et al., 2014), education (e.g., when all annotators of a dictionary work in academia as in Ruppenhofer et al., 2017), or geographic varieties (e.g., German spoken in Austria in ALPIN, Kolb et al., 2022). This then translates into emotional variation between speakers. Crucially, it is plausible that speakers have knowledge of the different emotional specifications that a word can obtain in different socio-cultural settings, even if they do not actively use all of them. For example, the German adjective fett might be used in formal settings primarily to refer to ‘fat food’ or ‘bold letters’, in informal settings to derogatorily denote ‘adipose’ individuals, in conversations in younger age classes as synonym for ‘impressive’, and in the Austrian variety of German to refer to being ‘intoxicated through alcohol’.
Similarly, variation between individuals (annotators) in the same survey seems to reflect ambivalence reasonably well. This is particularly interesting, given that we know from research on the annotation of lexical concreteness (Reijnierse et al., 2019) that annotators tend to rate the prototypical sense of a word; non-prototypical senses only influence concreteness ratings when individuals are specifically primed for them. In our context, this can now mean two things. Either the variation in sentiment scores reflects the differential prototypical senses and their corresponding sentiment, or sentiment annotation works different than annotation of lexical concreteness to the effect that sentiment annotations reflect a broader bandwidth of senses. Indeed, the strong relationship between rated ambivalence and survey-based ambivalence suggests that annotators are aware of the emotional variety that words can express. More nuanced experimental setups are required to satisfactorily disentangle these aspects.
Either way, what is evident is that variation between surveys and within surveys does have informative value. This matter was acknowledged in current approaches in computational linguistics—operating under the umbrella term “perspectivism” (Plank, 2022)—that take individual annotations into account rather than just relying on aggregated scores. So, Uma et al. (2021) demonstrate that learning from individual annotations (“soft labels”) can improve performance in NLP and computer-vision tasks. Of course, the methodological concerns raised by Pollock (2018) and Paisios et al. (2023), i.e., the midscale disagreement problem that was discussed with respect to semantic variables other than sentiment, remain; the high standard deviation of words in the middle of the rated spectrum emerges because many midscale words are in fact instances of ambivalent words which end up with aggregated midscale ratings. This leads to an overrepresentation of words in the middle of the scale and conflation with actually non-ambiguous midscale words. As a result, the relationship between the rated variable and rating standard deviation assumes a concave inverse U-shape (Pollock, 2018; a pattern that could not be consistently replicated in the present study, however, perhaps due to the limited sample; Figure A2). This is a methodological issue that mainly affects the usage of aggregated scores. In this view, it is all the more important to report both aggregated scores and corresponding measures of dispersion.
However, for interpreting variation in ratings within a survey it is important as well, since it is not a priori clear whether high variation entails semantic ambiguity (polysemy/ambivalence; see discussion below) or indeed sociolinguistic-variationist diversity. This stresses the relevance of additionally studying variation across surveys, as done in the present study, to disentangle these dimensions. Related to this, it is interesting to see that survey-based and dictionary-based ambiguity measures are not (or not strongly) correlated with each other (although both correlate with rated ambivalence), suggesting that they indeed capture different aspects of ambiguity (cognitive-semantic vs. variationist).
The fact that rated ambivalence behaves almost exactly as controlled ambivalence is reassuring as well. Rated ambivalence is not significantly affected by frequency and word class, and only weakly so by actual polysemy (number of senses in DWDS). The latter is surprising; one would assume that many senses imply a high probability of a word having positive and negative senses. There is no strong evidence, however, for such a relationship, suggesting that the majority of senses of a word are typically found at the same end of the sentiment spectrum.
Related to that, it is also surprising that the number of senses is not systematically correlated with the emotional polysemy or ambiguity measures derived in Section 2.1, Section 2.2, Section 2.3 and Section 2.4 (with the single exception of neighborhood-based ambivalence). One interpretation of this could be that emotional meaning is too connotative (vs. denotative) to be properly reflected in a lexicographically curated dictionary (Zgusta, 2010). Emotional semantics can be conveyed as strongly context-dependent and hence part of pragmatic meaning (Scarantino, 2017). For a recent discussion of emotional meaning and to what extent it is context dependent see Ferré et al. (2025).
Ambivalence seems to cover conceptually different types of semantic setups. Some of the words that were rated as highly ambivalent like wild (‘wild’) and scharf (‘hot/sharp’) are ambivalent in its most technical sense. They can refer to both negative and positive concepts. The word wild can be used to denote a potentially dangerous animal or person (negative), but also to a hilarious party or an extraordinary idea (positive). The word scharf can refer to a sharp and possibly harmful blade (negative) or hot food (potentially also negative), but also to a sexually attractive person (positive). In contrast to that, words like allein (‘alone’) or Pflicht (‘duty’), which have been attributed with high ambivalence as well, rather tend to refer to a single concept that is intrinsically ambivalent (nota bene: both score low on polysemy in DWDS). There might be good and bad aspects about being alone and about fulfilling a duty. Such words are emotionally ambiguous in the sense of Wielgopolan and Imbir (2023) in that they display both high positivity and high negativity. Complementing the subjective ambivalence score with separated assessments of positivity and negativity would be necessary to provide more insights.

4.2. Limitations

Several aspects of this study can be improved. To begin with, the sample of words examined here was small (N = 117), hence limiting statistical power. It might be that the present study has missed out weak interactions among variables that could be theoretically interesting, nevertheless. Of course, sample size was constrained by the available resources, more specifically on the amount and coverage of sentiment dictionaries available for German (Kern et al., 2021). That being said, the threshold of at least ten data points per word (cf. Section 2.1) that has entailed a sample of 117 words is low for the reliable estimation of mean standard deviation, anyway. I.e., there is a trade-off between reliability of the estimates derived and sample size. Increasing this threshold would have reduced the number of words even further. Only the addition of sentiment dictionaries not covered in this study (or, indeed, switching from German to English, which certainly offers richer resources) can mitigate this issue.
It could also be that the final sample is biased towards emotionally monosemous and non-ambivalent words. It is imaginable that sentiment dictionaries, many of which were designed for being applied in unsupervised emotion detection, are designed in such a way that they include words that unambiguously relate to specific emotions, if only to reduce disagreement between annotators, i.e., to make emotion detection more reliable. However, the fact that we do see significant interactions with emotional ambivalence measures shows that there is a reasonable variation in ambivalence in the final dataset.
NLP-based methods did not show substantial interactions between (emotional) polysemy and ambivalence measures. While this could be grounded in the overemphasis of frequency information (cf. Section 4.1), it could also be that the employed methods simply lack accuracy. Only a single embedding model was tested (word2vec; cf. Section 2.2) and only one emotion-detection model was employed (cf. Section 2.3) for deriving sentiment scores based on sentence-level information. Other models could perform better (i.e., produce more nuanced distributions of scores). In particular, the pre-trained embedding model (Fares et al., 2017) was not an ideal fit for German, given that it ignores uppercasing. As far as the BERT-based sentiment classifier (Guhr et al., 2020) is concerned, it may well be that it performs excellently when the task is to assign discrete sentiment categories (‘positive’, ‘neutral’, ‘negative’) but lacks accuracy if sentiment is measured as a scalar (of course, the latter was neither claimed nor aimed for by Guhr et al., 2020).
In Section 2.2, Section 2.3 and Section 2.4, the number of data points per word was arbitrarily set to 50. This number could be criticized as being too small to provide reliable information about the distribution of scores. Table A1 shows the average margin of error at a 95% confidence level for the estimation of word-wise standard deviation, i.e., emotional polysemy σ i , given the respective sample sizes and median σ i estimates. The margin of error is about 0.05 (with median estimates being around 0.23 to 0.29, depending on the method), which can be judged as reasonably precise. More data points are, of course, desirable; note, however, that 50 scores per word exceeds the number of annotations that is typically employed in sentiment-dictionary generation (where sometimes only two or three annotations per word are considered sufficient; e.g., Klinger et al. (2016), and Ruppenhofer et al. (2017)). It is rather word lists compiled for psycholinguistic research (Warriner et al., 2013; Schröter & Schroeder, 2017) that feature more annotations per word.
The procedure for querying subjective ambivalence in the survey was coarse grained, since only binary answers (positive and negative meanings: ‘yes’/’no’) were allowed to minimize time (and financial expenses). This ignores a potential relative dominance of one of the two ends of the sentiment spectrum (e.g., it could be that a word has one negative and ten positive senses, while another word has one sense each; the answer ‘yes’ would apply to both). Also, a list of sense descriptions was not asked for to be provided by the participants. Such information could be potentially valuable, however. Finally, it needs to be stressed that rated ambivalence, as operationalized in this study, is a subjective measure. While this is not problematic per se, it could be complemented with a more objective assessment, e.g., by inferring the distribution of sentiment from sense descriptions in a lexicographically curated dictionary.

5. Conclusions

Variation in lexical sentiment scores is present across annotators within surveys, across surveys, across semantic neighbors of words, and across contexts that words surface in. We have argued in this study that this variation is indeed meaningful in that it informs about the degree of emotional ambiguity. Two aspects of ambiguity, emotional polysemy and ambivalence, were examined, and for each of these aspects four different approaches were used to provide quantitative estimates for a set of German words. While average sentiment scores derived through each of these methods correlate with each other at reasonably high levels, emotional polysemy scores do not show systematic pairwise correlations, and the same holds true for emotional ambivalence. Moreover, it was shown that subjective estimates of emotional ambivalence only correlate with scores derived from survey-related sentiment distributions. Neither context-based nor neighborhood-based variation is associated with subjectively rated ambivalence. The two main conclusions drawn from this are (i) that survey-based variability (both across and within surveys) informs about emotional ambivalence, and (ii) that context- and neighborhood-based variability might be overshadowed by frequency effects, or, put differently, that speakers can retrieve senses with negative and positive sentiment alike, even if their corresponding occurrence frequencies strongly differ from each other. What this study demonstrates, in particular, is that what could be misjudged as undesirable noise in sentiment scores contains valuable information about linguistic cognition and variation.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/languages11060118/s1, File S1: Table containing all estimates computed in this contribution for the list of 117 target words.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data and code are available at https://gitlab.com/andreas.baumann/variation_and_emotional_ambivalence/ (accessed on 4 June 2026).

Acknowledgments

I would like to thank Bettina M. J. Kern for her efforts in compiling the database of German sentiment resources. GenAI (OpenAI, GPT 5.1) was used (i) for coding assistance and (ii) for formatting the list of references. No text in the main body, abstract, or appendix of this manuscript was generated and/or modified through GenAI. Open Access Funding by the University of Vienna.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DWDSDigitales Wörterbuch der deutschen Sprache
dctdictionary (cf. Section 2.1)
nbhneighborhood (cf. Section 2.2)
ctxcontext (cf. Section 2.3)
srvsurvey (cf. Section 2.4)

Appendix A

The average margin of error (at a 95% confidence level) was quantified for emotional polysemy σ i for all methods in Section 2.1, Section 2.2, Section 2.3 and Section 2.4 by computing the respective standard error as S E i = m d ( σ i ) / 2 ( N i 1 ) , i.e., the standard error of the sample standard deviation of median emotional polysemy σ i given an average sample size (number of scores) of N i per word for each method i among dct, nbh, ctx, and srv. Values for the standard error and the implied margin of error are shown in Table A1 below.
Figure A1 shows correlations among neighborhood-based scores for different neighborhood sizes. It can be seen that all pairwise correlations are high to the effect that neighborhood size does not substantially affect ambiguity estimates. However, since their SE still is affected by the number of data points (Table A1), neighborhood size is set to 50.
Figure A2 displays the relationship between mean sentiment and standard deviation (emotional polysemy) for all methods in Section 2.1, Section 2.2, Section 2.3 and Section 2.4. The inverse U-shape reported by Paisios et al. (2023) for Body–Object Interaction scores is not visible in all cases (and, in particular not for survey-based scores), perhaps due to the constrained sample.
Table A1. Margin of error estimation for emotional polysemy.
Table A1. Margin of error estimation for emotional polysemy.
MethodMedian Emotional PolysemySE (ME in 95% CI)
dct0.33 0.078 (±0.153)
nbh0.230.023 (±0.045)
ctx0.250.025 (±0.049)
srv0.290.030 (±0.059)
Figure A1. Correlogram neighborhood-based polysemy (left) and ambivalence (right) for four different neighborhood sizes: 20, 30, 40, and 50 (i.e., nbh_sd_50 equals neighborhood-based polysemy in the main text). Scores represent Pearson’s R. All pairwise correlations are strong (all greater than 0.88) and significant at a 95% confidence level. Yellow (light) indicates a positive correlation, purple (dark) a negative correlation.
Figure A1. Correlogram neighborhood-based polysemy (left) and ambivalence (right) for four different neighborhood sizes: 20, 30, 40, and 50 (i.e., nbh_sd_50 equals neighborhood-based polysemy in the main text). Scores represent Pearson’s R. All pairwise correlations are strong (all greater than 0.88) and significant at a 95% confidence level. Yellow (light) indicates a positive correlation, purple (dark) a negative correlation.
Languages 11 00118 g0a1
Figure A2. Relationship between mean sentiment and standard deviation (i.e., emotional polysemy) for all four methods.
Figure A2. Relationship between mean sentiment and standard deviation (i.e., emotional polysemy) for all four methods.
Languages 11 00118 g0a2aLanguages 11 00118 g0a2b

Notes

1
Abbreviations of sentiment dictionaries are aligned with Kern et al. (2021).
2
https://www.dwds.de/d/korpora/kern21 (accessed on 29 April 2026).

References

  1. Acerbi, A., Lampos, V., Garnett, P., & Bentley, R. A. (2013). The expression of emotions in 20th century books. PLoS ONE, 8, e59030. [Google Scholar] [CrossRef]
  2. Almatarneh, S., & Gamallo, P. (2018). Automatic construction of domain-specific sentiment lexicons for polarity classification. In F. De la Prieta, Z. Vale, L. Antunes, T. Pinto, A. T. Campbell, V. Julián, A. J. R. Neves, & M. N. Moreno (Eds.), Trends in cyber-physical multi-agent systems: The PAAMS collection—15th international conference, PAAMS 2017 (pp. 175–182). Springer. [Google Scholar]
  3. Ambrasat, J., von Scheve, C., Conrad, M., Schauenburg, G., & Schröder, T. (2014). Consensus and stratification in the affective meaning of human sociality. Proceedings of the National Academy of Sciences of the United States of America, 111(22), 8001–8006. [Google Scholar] [CrossRef] [PubMed]
  4. Bakker, I., van der Voordt, T., Vink, P., & de Boon, J. (2014). Pleasure, arousal, dominance: Mehrabian and Russell revisited. Current Psychology, 33(3), 405–421. [Google Scholar] [CrossRef]
  5. Blank, A. (1993). Zwei phantome der historischen Semantik: Bedeutungsverbesserung und bedeutungsverschlechterung. Romanistisches Jahrbuch, 44, 57–85. [Google Scholar] [CrossRef]
  6. Buechel, S., Hellrich, J., & Hahn, U. (2016). Feelings from the past—Adapting affective lexicons for historical emotion analysis. In Proceedings of the workshop on language technology resources and tools for digital humanities (LT4DH) (pp. 54–61). The COLING 2016 Organizing Committee. [Google Scholar]
  7. Buechel, S., Rücker, S., & Hahn, U. (2020). Learning and evaluating emotion lexicons for 91 languages. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 1202–1217). Association for Computational Linguistics (ACL). [Google Scholar]
  8. Bybee, J. (2006). Frequency of use and the organization of language. Oxford University Press. [Google Scholar]
  9. Cacioppo, J. T., Berntson, G. G., Norris, C. J., & Gollan, J. K. (2011). The evaluative space model. In P. A. M. Van Lange, A. W. Kruglanski, & E. T. Higgins (Eds.), Handbook of theories of social psychology (Vol. 1). SAGE. [Google Scholar]
  10. Cook, P., & Stevenson, S. (2010). Automatically identifying changes in the semantic orientation of words. In Proceedings of LREC 2010. European Language Resources Association (ELRA). [Google Scholar]
  11. Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. Springer. [Google Scholar]
  12. Divjak, D., & Caldwell-Harris, C. L. (2019). Frequency and entrenchment. In Cognitive linguistics: Foundations of language (pp. 61–86). De Gruyter Mouton. [Google Scholar]
  13. Emerson, G., & Declerck, T. (2014). Sentimerge: Combining sentiment lexicons in a Bayesian framework. In Proceedings of the workshop on lexical and grammatical resources for language processing (pp. 30–38). Association for Computational Linguistics (ACL). [Google Scholar]
  14. Fares, M., Kutuzov, A., Oepen, S., & Velldal, E. (2017). Word vectors, reuse, and replicability: Towards a community repository of large-text resources. In Proceedings of the 21st Nordic conference on computational linguistics (pp. 271–276). Association for Computational Linguistics (ACL). [Google Scholar]
  15. Ferré, P., Fraga, I., & Hinojosa, J. A. (2025). The interplay between language and emotion: Introduction to the special issue. Cognition and Emotion, 39(7), 1405–1417. [Google Scholar] [CrossRef]
  16. Geeraerts, D. (2005). Lectal variation and empirical data in cognitive linguistics. In Cognitive linguistics: Internal dynamics and interdisciplinary interaction (Vol. 32, pp. 163–189). De Gruyter Mouton. [Google Scholar]
  17. Guhr, O., Schumann, A. K., Bahrmann, F., & Böhme, H. J. (2020). Training a broad-coverage German sentiment classification model for dialog systems. In Proceedings of the twelfth language resources and evaluation conference (pp. 1627–1632). European Language Resources Association (ELRA). [Google Scholar]
  18. Hamilton, W. L., Clark, K., Leskovec, J., & Jurafsky, D. (2016a). Inducing domain-specific sentiment lexicons from unlabeled corpora. In Proceedings of the conference on empirical methods in natural language processing (p. 595). Association for Computational Linguistics (ACL). [Google Scholar] [CrossRef]
  19. Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016b). Cultural shift or linguistic drift? comparing two computational measures of semantic change. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 2116–2121). Association for Computational Linguistics (ACL). [Google Scholar]
  20. Hilpert, M. (2019). Historical linguistics. In E. Dąbrowska, & D. Divjak (Eds.), Cognitive linguistics (Chapter 5). De Gruyter. [Google Scholar] [CrossRef]
  21. Kanske, P., & Kotz, S. A. (2010). Leipzig affective norms for German: A reliability study. Behavior Research Methods, 42(4), 987–991. [Google Scholar] [CrossRef]
  22. Kauschke, C., Bahn, D., Vesker, M., & Schwarzer, G. (2019). The role of emotional valence for the processing of facial and verbal stimuli: Positivity or negativity bias? Frontiers in Psychology, 10, 1654. [Google Scholar] [CrossRef]
  23. Keller, R. (1994). On language change: The invisible hand in language. Routledge. [Google Scholar]
  24. Kern, B. M., Baumann, A., Kolb, T. E., Sekanina, K., Hofmann, K., Wissik, T., & Neidhardt, J. (2021). A review and cluster analysis of German polarity resources for sentiment analysis. In 3rd conference on language, data and knowledge (LDK 2021) (pp. 37:1–37:17). Schloss Dagstuhl–Leibniz-Zentrum für Informatik. [Google Scholar]
  25. Kiritchenko, S., & Mohammad, S. (2016). Capturing reliable fine-grained sentiment associations by crowdsourcing and best–worst scaling. In Proceedings of NAACL-HLT 2016 (pp. 811–817). Association for Computational Linguistics (ACL). [Google Scholar]
  26. Kissler, J., & Herbert, C. (2013). Emotion, etmnooi, or emitoon? Faster lexical access to emotional than to neutral words during reading. Biological Psychology, 92(3), 464–479. [Google Scholar] [CrossRef] [PubMed]
  27. Klenner, M., Fahrni, A., & Petrakis, S. (2009). POLART: A robust tool for sentiment analysis. In Proceedings of the 17th Nordic conference of computational linguistics (NODALIDA 2009) (pp. 235–238). Northern European Association for Language Technology (NEALT). [Google Scholar]
  28. Klinger, R., Suliya, S. S., & Reiter, N. (2016). Automatic emotion detection for quantitative literary studies: A case study based on Franz Kafka’s Das Schloss and Amerika. In Proceedings of the digital humanities 2016. Alliance of Digital Humanities Organizations (ADHO). [Google Scholar]
  29. Knupleš, U., Frassinelli, D., & im Walde, S. S. (2023). Investigating the nature of disagreements on mid-scale ratings: A case study on the abstractness-concreteness continuum. In Proceedings of the 27th conference on computational natural language learning (CoNLL) (pp. 70–86). Association for Computational Linguistics (ACL). [Google Scholar]
  30. Kolb, T., Katharina, S., Kern, B. M. J., Neidhardt, J., Wissik, T., & Baumann, A. (2022). The ALPIN sentiment dictionary: Austrian language polarity in newspapers. In Proceedings of LREC 2022 (pp. 4708–4716). European Language Resources Association (ELRA). [Google Scholar]
  31. Kousta, S. T., Vinson, D. P., & Vigliocco, G. (2009). Emotion words, regardless of polarity, have a processing advantage over neutral words. Cognition, 112, 473–481. [Google Scholar] [CrossRef] [PubMed]
  32. Köper, M., & Schulte im Walde, S. (2016). Automatically generated affective norms of abstractness, arousal, imageability and valence for 350,000 German lemmas. In Proceedings of LREC 2016 (pp. 2595–2598). European Language Resources Association (ELRA). [Google Scholar]
  33. Kuhlmann, M., Hofmann, M. J., & Jacobs, A. M. (2017). If you don’t have valence, ask your neighbor: Evaluation of neutral words as a function of affective semantic associates. Frontiers in Psychology, 8, 343. [Google Scholar] [CrossRef]
  34. Lahl, O., Göritz, A. S., Pietrowsky, R., & Rosenberg, J. (2009). Using the world-wide web to obtain large-scale word norms: 190,212 ratings on 2654 German nouns. Behavior Research Methods, 41(1), 13–19. [Google Scholar] [CrossRef]
  35. Li, M., Lu, Q., Long, Y., & Gui, L. (2017). Inferring affective meanings of words from word embedding. IEEE Transactions on Affective Computing, 8(4), 443–456. [Google Scholar] [CrossRef]
  36. Martínez-Huertas, J. Á., Jorge-Botana, G., Martínez-Mingo, A., Iglesias, D., & Olmos, R. (2025). Are valence and arousal related to the development of amodal representations of words? A computational study. Cognition and Emotion, 39(7), 1465–1473. [Google Scholar] [CrossRef]
  37. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv, arXiv:1301.3781. [Google Scholar] [CrossRef]
  38. Paisios, D., Huet, N., & Labeye, E. (2023). Addressing the elephant in the middle: Implications of the midscale disagreement problem through the lens of body-object interaction ratings. Collabra: Psychology, 9(1), 84564. [Google Scholar] [CrossRef]
  39. Plank, B. (2022, December). The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 10671–10682). Association for Computational Linguistics (ACL). [Google Scholar]
  40. Poesio, M., & Artstein, R. (2005). The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account. In Proceedings of the workshop on frontiers in corpus annotations II: Pie in the sky (pp. 76–83). Association for Computational Linguistics (ACL). [Google Scholar]
  41. Pollock, L. (2018). Statistical and methodological problems with concreteness and other semantic variables: A list memory experiment case study. Behavior Research Methods, 50(3), 1198–1216. [Google Scholar] [CrossRef] [PubMed]
  42. Ponari, M., Norbury, C. F., & Vigliocco, G. (2018). Acquisition of abstract concepts is influenced by emotional valence. Developmental Science, 21(2), e12549. [Google Scholar] [CrossRef]
  43. Reijnierse, W. G., Burgers, C., Bolognesi, M., & Krennmayr, T. (2019). How polysemy affects concreteness ratings: The case of metaphor. Cognitive Science, 43(8), e12779. [Google Scholar] [CrossRef]
  44. Remus, R., Quasthoff, U., & Heyer, G. (2010). SentiWS: A publicly available German-language resource for sentiment analysis. In Proceedings of LREC 2010. European Language Resources Association (ELRA). [Google Scholar]
  45. Rill, S., Adolph, S., Drescher, J., Reinel, D., Scheidt, J., Schütz, O., Wogenstein, F., Zicari, R. V., & Korfiatis, N. (2012). A phrase-based opinion list for the German language. In Proceedings of KONVENS (pp. 305–313). Österreichische Gesellschaft für Artificial Intelligence (ÖGAI). [Google Scholar]
  46. Ruette, T., Speelman, D., & Geeraerts, D. (2014). Lexical variation in aggregate perspective. In Pluricentricity: Language variation and sociocognitive dimensions (pp. 103–126). de Gruyter Berlin. [Google Scholar]
  47. Ruppenhofer, J., Steiner, P., & Wiegand, M. (2017). Evaluating the morphological compositionality of polarity. In Proceedings of the 11th international conference on recent advances in natural language processing, RANLP 2017, Varna, Bulgaria, September 2–8 (pp. 625–633). Incoma Ltd. [Google Scholar]
  48. Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. [Google Scholar] [CrossRef]
  49. Scarantino, A. (2017). How to do things with emotional expressions: The theory of affective pragmatics. Psychological Inquiry, 28(2–3), 165–185. [Google Scholar] [CrossRef]
  50. Schmid, H. J. (2010). Does frequency in text instantiate entrenchment in the cognitive system? In Quantitative methods in cognitive semantics: Corpus-driven approaches (pp. 101–133). De Gruyter Mouton. [Google Scholar]
  51. Schmidtke, D. S., Schröder, T., Jacobs, A. M., & Conrad, M. (2014). ANGST: Affective norms for German sentiment terms, derived from the affective norms for English words. Behavior Research Methods, 46(4), 1108–1118. [Google Scholar] [CrossRef]
  52. Schröder, T. (2011). A model of language-based impression formation and attribution among Germans. Journal of Language and Social Psychology, 30(1), 82–102. [Google Scholar] [CrossRef]
  53. Schröter, P., & Schroeder, S. (2017). The developmental lexicon project: A behavioral database to investigate visual word recognition across the lifespan. Behavior Research Methods, 49(6), 2183–2203. [Google Scholar] [CrossRef]
  54. Snefjella, B., & Kuperman, V. (2016). It’s all in the delivery: Effects of context valence, arousal, and concreteness on visual word processing. Cognition, 156, 135–146. [Google Scholar] [CrossRef]
  55. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-based methods for sentiment analysis. Computational Linguistics, 37(2), 267–307. [Google Scholar] [CrossRef]
  56. Tagliamonte, S. A. (2011). Variationist sociolinguistics: Change, observation, interpretation. Wiley-Blackwell. [Google Scholar]
  57. Tiessler, M., Motger, Q., Piroi, F., & Baumann, A. (2025). EmoTracker—A framework for modeling and forecasting diachronic emotion dynamics. Anthology of Computers and the Humanities, 3, 795–819. [Google Scholar]
  58. Uma, A. N., Fornaciari, T., Hovy, D., Paun, S., Plank, B., & Poesio, M. (2021). Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72, 1385–1470. [Google Scholar] [CrossRef]
  59. van Atteveldt, W., van der Velden, M. A. C. G., & Boukes, M. (2021). The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms. Communication Methods and Measures, 15(2), 121–140. [Google Scholar] [CrossRef]
  60. Võ, M. L. H., Conrad, M., Kuchinke, L., Urton, K., Hofmann, M. J., & Jacobs, A. M. (2009). The Berlin affective word list reloaded (BAWL–R). Behavior Research Methods, 41(2), 534–538. [Google Scholar] [CrossRef]
  61. Waltinger, U. (2010). German polarity clues: A lexical resource for German sentiment analysis. In Proceedings of LREC 2010 (pp. 1638–1642). European Language Resources Association (ELRA). [Google Scholar]
  62. Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45(4), 1191–1207. [Google Scholar] [CrossRef] [PubMed]
  63. Wielgopolan, A., & Imbir, K. K. (2023). Affective norms for emotional ambiguity in valence, origin, and activation spaces. Behavior Research Methods, 55(3), 1141–1156. [Google Scholar] [CrossRef] [PubMed]
  64. Zgusta, L. (2010). Manual of lexicography (Vol. 39). De Gruyter. [Google Scholar]
Figure 1. Sample data for the word scharf (‘sharp/hot/fierce/attractive’) collected through each of the four methods (dct: red; nbh: blue; ctx: green; srv: purple; Section 2.1, Section 2.2, Section 2.3 and Section 2.4; boxes), as well as resulting distributions of all sentiment scores for that word visualized as violin plots (bottom left). The more dispersed a distribution is, the higher the degree of ambivalence. The pie-chart illustrates the distribution of ‘ambivalent’ vs. ‘not ambivalent’ annotations for scharf in order to measure subjectively rated ambivalence.
Figure 1. Sample data for the word scharf (‘sharp/hot/fierce/attractive’) collected through each of the four methods (dct: red; nbh: blue; ctx: green; srv: purple; Section 2.1, Section 2.2, Section 2.3 and Section 2.4; boxes), as well as resulting distributions of all sentiment scores for that word visualized as violin plots (bottom left). The more dispersed a distribution is, the higher the degree of ambivalence. The pie-chart illustrates the distribution of ‘ambivalent’ vs. ‘not ambivalent’ annotations for scharf in order to measure subjectively rated ambivalence.
Languages 11 00118 g001
Figure 2. Distributions of sentiment scores for all target words and all methods (dct, nbh, ctx, srv). For each word, the distribution of sentiment scores is visualized as a violin plot. Words are ranked by decreasing ambivalence (from left to right), given the respective method. Color-coding of the medians (circles) reflects rated ambivalence (dark: high α ; light: low α ).
Figure 2. Distributions of sentiment scores for all target words and all methods (dct, nbh, ctx, srv). For each word, the distribution of sentiment scores is visualized as a violin plot. Words are ranked by decreasing ambivalence (from left to right), given the respective method. Color-coding of the medians (circles) reflects rated ambivalence (dark: high α ; light: low α ).
Languages 11 00118 g002
Figure 3. Correlogram of all main variables and covariates. Scores represent Pearson’s R. Only correlation coefficients that are significantly non-zero at a 95% confidence level are shown. Yellow (light) indicates a positive correlation, purple (dark) a negative correlation.
Figure 3. Correlogram of all main variables and covariates. Scores represent Pearson’s R. Only correlation coefficients that are significantly non-zero at a 95% confidence level are shown. Yellow (light) indicates a positive correlation, purple (dark) a negative correlation.
Languages 11 00118 g003
Figure 4. Correlograms of all polysemy and ambivalence measures and covariates for nouns, verbs, and adjectives separately (sample sizes in parentheses). Scores represent Pearson’s R. Only correlation coefficients that are significantly non-zero at a 95% confidence level are shown. Yellow (light) indicates a positive correlation, purple (dark) a negative correlation.
Figure 4. Correlograms of all polysemy and ambivalence measures and covariates for nouns, verbs, and adjectives separately (sample sizes in parentheses). Scores represent Pearson’s R. Only correlation coefficients that are significantly non-zero at a 95% confidence level are shown. Yellow (light) indicates a positive correlation, purple (dark) a negative correlation.
Languages 11 00118 g004
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Baumann, A. Does Variation in Lexical Sentiment Scores Reflect Emotional Polysemy and Ambivalence? Languages 2026, 11, 118. https://doi.org/10.3390/languages11060118

AMA Style

Baumann A. Does Variation in Lexical Sentiment Scores Reflect Emotional Polysemy and Ambivalence? Languages. 2026; 11(6):118. https://doi.org/10.3390/languages11060118

Chicago/Turabian Style

Baumann, Andreas. 2026. "Does Variation in Lexical Sentiment Scores Reflect Emotional Polysemy and Ambivalence?" Languages 11, no. 6: 118. https://doi.org/10.3390/languages11060118

APA Style

Baumann, A. (2026). Does Variation in Lexical Sentiment Scores Reflect Emotional Polysemy and Ambivalence? Languages, 11(6), 118. https://doi.org/10.3390/languages11060118

Article Metrics

Back to TopTop