Next Article in Journal
A Syntactic and Pragmatic Analysis of the Colloquial Expression ʔinno ‘That’ in Jordanian Arabic: Evidence from Social Media Conversation
Previous Article in Journal
Iterative/Semelfactive = Collective/Singulative? Parallels in Slavic
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Measuring Emotion Recognition Through Language: The Development and Validation of an English Productive Emotion Vocabulary Size Test

by
Allen Jie Ein Chee
1,*,
Csaba Zoltan Szabo
2 and
Sharimila Ambrose
1
1
School of Education, The University of Nottingham Malaysia, Semenyih 43500, Malaysia
2
Institute for Language Education, The University of Edinburgh, Edinburgh EH16 4UU, UK
*
Author to whom correspondence should be addressed.
Languages 2025, 10(9), 204; https://doi.org/10.3390/languages10090204
Submission received: 26 February 2025 / Revised: 14 August 2025 / Accepted: 18 August 2025 / Published: 25 August 2025

Abstract

Emotion vocabulary is essential for recognising, expressing, and regulating emotions, playing a critical role in language proficiency and emotional competence. However, traditional vocabulary assessments have largely overlooked emotion-specific lexicons, limiting their ability to identify learners’ gaps in this area. Therefore, this study addresses this gap by developing and validating the Productive Emotion Vocabulary Size Test (PEVST), a tool designed to evaluate productive emotion vocabulary knowledge in ESL/EFL contexts. The PEVST incorporates low-, mid-, and high-frequency emotion words, assessed through context-rich vignettes, offering a comprehensive tool for measuring productive emotion vocabulary knowledge. The study recruited 156 adult participants with varying language proficiency levels. Findings revealed that word frequency significantly influenced production accuracy: higher frequency words were more easily retrieved, while lower frequency words often elicited higher frequency synonyms. Rasch analysis provided validity evidence for the test’s scoring, highlighting the effectiveness of a granular scoring system that considers nuanced responses. However, some limitations arose from misfitting items and the homogeneity of participants’ language proficiency, calling for further evidence with a more linguaculturally diverse target group and careful control for individual differences. Future iterations should address these challenges by incorporating cultural adaptations and accounting for individual differences. The PEVST offers a robust foundation for advancing emotion vocabulary assessment, deepening our understanding of the interplay between language, emotions, and cognition, and informing emotion-focused language pedagogy.

1. Introduction

Language proficiency is often associated with knowledge of generic, academic, and technical vocabulary (see Milton, 2013; Qian & Lin, 2019a). Despite this, how learners acquire and produce emotion vocabulary in an additional language has received comparatively little attention (Lindquist, 2021; Nook et al., 2017). From a language learning perspective, the descriptors for the Common European Framework of Reference at the B2 level dictate that learners should be able to express various degrees of emotions, i.e., “Can compose letters conveying degrees of emotion and highlighting the personal significance of events and experiences…” (Council of Europe, 2020, p. 83). However, the efficacy of emotional expression is contingent on one’s proficiency in recognising and articulating emotions, an ability deeply rooted in possessing a rich and precise repertoire of emotion vocabulary (Ekman, 1999; Hoemann et al., 2019; Kopp, 1989; Pérez-García & Sánchez, 2020). Despite its importance, emotion vocabulary often receives insufficient attention in educational resources, which may contribute to learners’ limited ability to verbalise and express emotions effectively in a second or additional language (Sánchez & Pérez-García, 2020). Another major challenge arising from this lack of focus is assessing learners’ knowledge of emotion vocabulary.
While traditional tests such as the Vocabulary Size Test (Nation & Beglar, 2007) and the Vocabulary Levels Test (Webb et al., 2017) have been validated and widely adopted, the same rigour and standards have not been applied to emotion vocabulary tests, especially in English L21 contexts. Similarly, existing emotion vocabulary measures, including the label generation task (e.g., Bazhydai et al., 2019; Ebert et al., 2014; Mavrou, 2021), Mayer Salovey Caruso Emotional Intelligence Test (MSCEIT; Mayer et al., 2016), Children’s Emotion-Specific Vocabulary Vignettes Test (CEVVT; Streubel et al., 2020), and Productive Emotion Vocabulary Test (PEVT; Szabo et al., forthcoming), lack validation evidence in ESL/EFL contexts and only assess a limited number of emotions. While the PEVT was a novel adaptation of the CEVVT for the Malaysian ESL/EFL context, both tests have a limited number of target emotions and do not control for lexical characteristics such as frequency. The limitations of these tests significantly impact researchers and practitioners alike, making it challenging to accurately identify gaps in learners’ recognition and production of emotion vocabulary.
Recognising the integral role of emotion vocabulary in both language proficiency and emotional intelligence (Ekman, 1999; Hoemann et al., 2019; Pentón Herrera & Darragh, 2024), this study aims to develop and validate the Productive Emotion Vocabulary Size Test (PEVST). Designed to systematically evaluate emotion vocabulary in English L2 learners, the PEVST expands target emotions beyond existing tools by controlling for word frequency, valence, and arousal, specifically targeting subtler emotions. By offering a practical and reliable assessment tool, this study seeks to enhance the understanding of emotion vocabulary recognition and production, bridging a significant gap in the field and contributing to the advancement of language learning research and pedagogy.

2. Literature Review

2.1. Emotion Vocabulary in Language Learning

“Without being able to name feelings, it is harder to distinguish them, think about them, plan regarding them, etc.” (Ekman, 1999, p. 317). Therefore, it is imperative to acknowledge the ability to use a wide array and precise vocabulary to recognise and express emotions in various contexts (Kopp, 1989; Pérez-García & Sánchez, 2020). For example, being able to use superlative emotion words to distinguish between “happy” and “overjoyed” or “elated” shows one’s ability to understand and recognise those nuanced differences in varied contexts. This ability to label emotions may influence other emotion-related competencies. For instance, learners with high levels of emotion awareness or knowledge tend to exhibit prosocial behaviours (e.g., peer likability and demonstrating socially appropriate behaviours) and enhanced social competence (e.g., recognising others’ emotional facial expressions) (Denham et al., 2015; Fabes et al., 2001; Israelashvili et al., 2019; Sette et al., 2017). In other words, the ability to recognise and understand one’s own emotions also contributes to understanding emotions in others. Therefore, it is essential to expand emotional concept representations from a dichotomous ‘positive or negative’ experience to a multi-dimensional organisation of emotions, enabling the comprehension and expression of a wide range of nuanced emotions (Nook et al., 2017, 2020). However, for this translational relationship to work, learners must first be cognizant of the emotional words that can be used to describe their feelings (Hoemann et al., 2019).
While understanding these cognitive effects is fundamental, L2 learners face unique challenges in acquiring and understanding emotion vocabulary compared to L1 users. Ferré et al. (2022) investigated how L1 and L2 users rate words based on both the meaning and the emotions they evoke. They employed a 7-point Likert scale for both tasks, where 1 = very low/weak and 7 = very high/strong. In the meaning task, participants judged each word’s clarity or familiarity; in the feeling task, they rated how strongly the word made them feel a certain way (valence). Their findings revealed that L2 users consistently rated words with lower emotional intensity than L1 users, particularly in feeling-focused ratings. This pattern of attenuated “meaning” and “feeling” ratings supports Pavlenko’s theory of “disembodied cognition”, which implies that L2 users may have a fragmented or imprecise understanding of L2 emotion words. According to this theory, knowing the emotional meaning of words in L2 does not guarantee the same intensity and understanding as L1 users, underscoring the challenges L2 users may face in emotion vocabulary learning (Ferré et al., 2022; Pavlenko, 2012). This means that while L2 users may know the dictionary meaning of an emotion word, they may not feel that emotion as intensely when encountering the word in L2. For instance, hearing the word “love” in L1 might trigger vivid memories and a racing heart, but the same word in L2 might feel more abstract or muted. L2 learners may experience greater difficulty with emotion recognition and expression, given how they process L2 words and their language proficiency to express these.
In essence, a rich emotion vocabulary not only aids in cognitive processes but also plays a crucial role in learners’ emotional development and the nuances of language use in diverse linguistic settings. Therefore, it is imperative to further understand and investigate meaningful ways to identify gaps in English L2 learners’ emotion vocabulary knowledge.

2.2. Challenges in Identifying Emotion Vocabulary

Distinguishing emotion concepts has long perplexed researchers across psychology, applied linguistics, education, and teachers, as “emotion concepts consisting of prototypical scripts are typically more diffuse, more fuzzy, and therefore harder to pin down” (Dewaele, 2008, p. 173). Unlike general, academic, and technical vocabulary, which is relatively straightforward to identify from existing word lists (Baumann & Graves, 2010; Ha & Hyland, 2017), identifying emotion words is more complex due to the lack of a standardised definition across disciplines (Berscheid et al., 1990; Pavlenko, 2008). This has led to a proliferation of emotional word lists in the literature within the field of psychology in some of the more commonly used word lists such as the Dictionary of Affect (DAL; Whissel, 1989), Affective Norms for English Words (ANEW; Bradley & Lang, 1999), Words for Emotion Vocabulary List (Encyclopaedia Britannica, 2022), and Linguistic Inquiry and Word Count Dictionary (LIWC; Boyd et al., 2022).
A critical challenge emerges, however, when attempting to leverage emotion lists for educational or language assessments. This is largely due to the different methods of classification and rating criteria employed by the different lists. For instance, the DAL (Whissel, 1989) uses valence, arousal, and imagery, while the ANEW (Bradley & Lang, 1999) uses pleasure, arousal, and dominance in their categorisation of emotion words. Defining “emotion words” in a way that is universally accepted and meets all necessary criteria remains challenging (Lakoff, 2016; Pavlenko, 2008). These differences mean that some lists include words that evoke emotions (e.g., “spider”, “death”) rather than directly describing them (Pavlenko, 2008; Weidman et al., 2017). The lack of a standard definition for emotion words further complicates their identification and introduces methodological barriers, such as sampling words to develop emotion vocabulary tests.
Pavlenko’s (2008) purportedly groundbreaking framework paved the way for researchers to better identify emotion words by classifying them into three categories: 1. emotion words that directly define and refer to an emotional state (e.g., “happy”, “sad”, “angry”, etc.), or a process (e.g., “to rage”, “to worry”, etc.); 2. emotion-related words that describe behaviours linked to a particular emotion without explicitly naming the emotion, such as facial expressions (e.g., “smile”, “sullen”), bodily reactions (e.g., “giddy”, “cry”), and behavioural tendencies (e.g., “scream”, “tantrum”) (Ng et al., 2019; Pavlenko, 2008); 3. emotion-laden words which do not refer to emotions directly but rather express or elicit emotions from the interlocutors such as taboo words and swearwords, insults, reprimands, terms of endearment, aversive words, and interjections (Pavlenko, 2008). Therefore, to prevent confusion, emotion word lists should focus only on emotion and emotion-related words, but not emotion-laden words, as they do not define an emotional state. Consequently, Szabo et al. (forthcoming) used this framework to compile a list consisting of only emotion and emotion-related words from over 40 existing emotion word lists. The words were then matched with their respective headwords using the BNC/COCA word family lists (Nation, 2004), frequency values from SUBTLEX-UK (van Heuven et al., 2014), and emotionality. This resulted in a total of 4236 emotion words, from which 1607 unique headwords were derived. At the time of writing, this list remains the most comprehensive emotion word list, which only consists of emotion and emotion-related words and is adopted herein for lexical sampling.

2.3. Existing Measures of Emotion Vocabulary

Emotion vocabulary measures, like other vocabulary tests, are generally classified into two types: receptive and productive. Although several measures for emotion vocabulary exist in distinct fields, each catering to different purposes, they all have limitations or lack validation evidence, especially in L2 contexts, thus making it problematic for assessment. One widely used receptive task for measuring emotion vocabulary is the Mayer Salovey Caruso Emotional Intelligence Test (MSCEIT; Mayer et al., 2012, 2016), which includes a receptive emotion vocabulary section. This section consists of multiple-choice questions where test-takers choose the most appropriate emotion word to describe an image of a facial expression. Similarly, the Reading the Mind in the Eyes Test (Baron-Cohen et al., 2001) assesses the ability to perceive others’ mental states using grayscale images of only the eyes.
Another widely used instrument is the label generation task (see Bazhydai et al., 2019; Dylman et al., 2020; Ebert et al., 2014; Mavrou, 2021). This productive task prompts participants to generate as many emotion words as possible based on a given emotion category (e.g., “happy”, “sad”). However, there is no consensus on the optimal scoring method, as studies employ varying coding schemes. For instance, Bazhydai et al. (2019) coded responses into five main emotion categories (“happy”, “sad”, “angry”, “nervous”, and “relaxed”), whereas Dylman et al. (2020) used three main emotional states (positive, negative, and neutral). While this flexibility allows for addressing diverse research needs, it also risks compromising the instrument’s validity by introducing subjectivity and bias due to the absence of a standardised coding scheme. Although Bazhydai et al.’s (2019) comparison of the two instruments (label generation and MSCEIT) indicates that the label generation task is more effective at eliciting a wider range of discrete emotions, the lack of an automated coding and scoring system makes their adaptation to L2 contexts problematic.
Despite their widespread use, both receptive and productive emotion vocabulary tests share a fundamental limitation: lack of context. These tests often require test-takers to make provisional judgments based solely on partial or incomplete information (e.g., facial expression) in the stimuli. This does not align with the Theory of Constructed Emotion (Barrett, 2017a, 2017b; Barrett & Westlin, 2021), which posits that emotions are deeply rooted in experiences and are contextually dependent. Emotions can be expressed or manifested differently in their physical and perceptual attributes (Hoemann et al., 2019). Essentially, this means that individual instances of an emotion within the same category may manifest different forms of affective and psychological characteristics. For instance, we can “cry” when we are “happy” but also when we are “sad” or “overwhelmed”. In other words, while broad labels can capture the general gist of an emotion, finer-grained terms help us convey exactly how we feel in a specific context. Given that emotions are highly subjective and often tied to personal contexts (Barrett, 2017a; Grosse et al., 2021; Vine et al., 2020), expecting accurate responses without contextual information challenges the ecological validity of these tests.
In contrast, vignette methodology (Goetze, 2023b) offers an alternative approach to assessing emotion vocabulary, as it provides rich contextual information and realistic emotional stimuli, aligning with the core tenets of the Theory of Constructed Emotion (Barrett, 2017a, 2017b; Barrett & Westlin, 2021). This approach stands in stark contrast to the pervasive obsession with vocabulary tests in multiple-choice questions and other decontextualised and receptive formats, which often fail to capture the nuanced and dynamic nature of emotional experiences and language use in real-world contexts. For instance, Goetze (2023a) used anxiety-provoking vignettes to explore teachers’ emotional experiences, revealing the complex nuances of emotional landscapes that language teachers navigate in scenarios traditionally thought to evoke singular emotions (e.g., dealing with disruptive students). Similarly, Bielak and Mystkowska-Wiertelak (2020) employed vignettes and interviews to investigate emotion regulation strategies among Polish university-level English learners, highlighting the benefits of using vignettes to recreate emotional scenarios without waiting for them to occur naturally. While acknowledging that vignettes may not fully replicate the same emotions and intensity as real-life events (Bielak & Mystkowska-Wiertelak, 2020), these studies collectively provide a robust foundation to advocate the use of vignette methodology as an innovative tool for measuring emotion vocabulary.
The Children’s Emotion-Specific Vocabulary Vignettes Test (CEVVT), developed by Streubel et al. (2020), effectively addresses the issue of decontextualisation in measuring emotion vocabulary through vignettes. In this test, children describe the emotions experienced by a protagonist in 20 vignettes, covering basic emotions (e.g., “joy”, “fear”) and 14 complex emotions categorised as subsets of these basics (e.g., “safe/secure”). An interview component from the test allowed children to elaborate on their responses, enhancing contextual richness. However, the CEVVT was originally designed and validated in the German context, without considering lexical characteristics such as valence, arousal, or word frequency. Recognising the need for socio-cultural and linguistic adaptation, Szabo et al. (forthcoming) adapted the CEVVT for an English L2 context, creating the Productive Emotion Vocabulary Test (PEVT). This adaptation was tailored for adult L1 and L2 English speakers, redesigning vignettes and illustrations for broader applicability (Figure 1). Notably, 18 out of 22 vignettes (82%) in the PEVT successfully elicited the target emotion or a close synonym (e.g., regret instead of remorse/repentance), demonstrating that the vignettes perform as designed. However, from the analysis, it transpired that the large majority of the responses (75%) are from the 2000 most frequent English words. Thus, the possibility of eliciting lower frequency words with the validated vignettes is limited. While the findings align with previous studies indicating that high-frequency words are easier to recall and use (Chen & Truscott, 2010; Schmitt & Schmitt, 2014), this may hinder the test’s potential to discriminate reliably between higher proficiency test takers. Typically, mid- and low-frequency emotion words tend to denote more specific, nuanced emotions that are currently under-represented in the PEVT. Therefore, it remains to be seen whether vignettes built around these less common words will be just as effective at eliciting the intended emotion.
While the PEVT addresses linguistic adaptations, it overlooks the effect of word frequency, which is pivotal for language learning. The effect of frequency on vocabulary acquisition is a prevalent factor, particularly in ESL/EFL contexts (Crossley et al., 2014; Wilkens et al., 2014; Zhao & Huang, 2023). High-frequency words, comprising the most frequent 3000-word families, form a foundational stage for ESL/EFL learners, providing the lexical resources necessary for understanding and producing conversational English (Schmitt & Schmitt, 2014). Mastery of mid/low frequency (3000-14,000-word families) words enables learners to use English across diverse topics authentically (Schmitt & Schmitt, 2014). However, English L2 learners have the tendency to learn and produce higher frequency words (Masrai, 2019; Takizawa, 2024). This finding (Szabo et al., forthcoming) aligns with Zipf’s Principle of Least Effort, which posits that individuals are inclined to opt for high-frequency words as responses due to their minimal cognitive effort (Zhu et al., 2018). Taken together, this underscores the importance of accounting for the effect of word frequency when designing emotion vocabulary tests, particularly with mid- and low-frequency words.
While the importance of emotion vocabulary in language learning is increasingly acknowledged, there is a notable dearth of validated and reliable measures to assess learners’ knowledge of emotion words. Existing measures lack consistency in scoring methods and often disregard critical lexical dimensions such as word frequency, which significantly influences language acquisition and recall (Crossley et al., 2014; Schmitt & Schmitt, 2014, 2020; Wilkens et al., 2014; Zhao & Huang, 2023). Furthermore, existing instruments fail to provide contextualised assessments, making it difficult to gauge learners’ ability to use emotion vocabulary in authentic situations. Additionally, while the CEVVT and PEVT addressed the problem of decontextualised assessments, all the target emotions are from high-frequency words. This limits the range of emotion words tested for ESL/EFL contexts and makes it challenging to assess the extent to which lower frequency and more precise emotions are recognised and produced by English L2 adults. Therefore, this study builds upon the PEVT (Szabo et al., forthcoming) by integrating a systematic approach to word frequency selection, expanding the item bank to include more mid- and low-frequency words.
Therefore, this study aims to develop a productive emotion vocabulary size test, targeting low-, mid-, and high-frequency words with the following research question: To what extent do scoring and item performance support the validity of the PEVST in measuring vocabulary production? To investigate the extent to which scoring and item performance support the validity of the PEVST as a measure of emotion vocabulary production, we employed a dual-scoring approach alongside Rasch analyses. Dominant scoring examines whether participants converge on the intended target emotion by awarding points for the most frequent responses, while accuracy scoring evaluates each response against dictionary definitions and our emotion word list (Szabo et al., forthcoming) to ensure consistent and semantically correct responses to demonstrate content validity. Complementing these scoring metrics, both dichotomous and polytomous Rasch models were applied to assess unidimensionality and item fit, thereby confirming that all vignettes measure a single latent construct—emotion vocabulary production. Reliability indices were also calculated to ensure that the test is sufficiently sensitive to distinguish between high- and low-ability participants.

3. Methodology

3.1. Participants

To effectively measure emotion vocabulary in English L2, the performance of the vignettes has to be first established. Therefore, L1 and L2 speakers of varying language proficiency were recruited. Only participants 18 years old and above were eligible for this study. Using snowball sampling, a total of 183 adult English speakers at various proficiency levels were recruited for this study. However, 27 participants did not complete the full study and were omitted from analysis, resulting in a total of 156 participants. Table 1 shows a concise overview of the demographic characteristics of the participants.

3.2. Designing the Productive Emotion Vocabulary Size Test (PEVST) (Appendix A)

The PEVST was designed as a measure for emotion vocabulary in English L2, extending on the items of the PEVT (Szabo et al., forthcoming) to include mid- and low-frequency words. The item sampling procedure was conducted with systematic controls for various linguistic and emotional parameters. Controlling for word frequency (Schmitt & Schmitt, 2014; van Heuven et al., 2014), valence, and arousal ratings (Warriner et al., 2013), 512 words out of 1607 words were randomly sampled from the emotion word list developed by Szabo et al. (forthcoming). These words abide by Pavlenko’s categorisation of emotion words: “I/(S)he feel(s)…”. These words were selected based on the following combination: 1. Low arousal, low valence; 2. Low arousal, high valence; 3. High arousal, low valence; 4. High arousal, high valence. Words with valence and arousal ratings between 1 and 4.99 were classified as low arousal/valence, while those with ratings between 5 and 9 were categorised as high arousal/valence (Warriner et al., 2013). Randomly sampled words using Tidyverse (Wickham et al., 2019) and LexOPS (Taylor et al., 2020) packages were controlled for even distribution of frequency levels. From those randomly sampled words, 51 words were selected based on the distinctiveness of emotions where close synonyms were avoided to enable selection of similar emotions (e.g., “hungry” vs. “famished”), occurrence of emotion words across different emotion word lists in recognition of the prevalence of emotion categories, and a balanced distribution of low- and high-frequency emotion words using the Schmitt and Schmitt (2014) framework for high-, mid- and low-frequency words (Table A1, Appendix A). Out of the 51 words, 17 words were high-frequency, 22 words were mid-frequency, and 12 words were low-frequency. Vignettes were created for those 51 words according to the following criteria (Goetze, 2023b; Streubel et al., 2020; Szabo et al., forthcoming).
Vignette creation criteria:
  • The emotion word was not explicitly mentioned in the context, but was conveyed through the protagonist’s actions or thoughts to prevent informing participants of the target emotion.
  • Each character’s thoughts were presented in the first person to simulate emotional experiences that participants might feel.
  • Each character’s thoughts did not exceed 2 sentences to keep the vignettes as brief as possible.
Each emotion word was then illustrated to depict a protagonist within a situated context to mimic an emotional episode, accompanied by a short sentence of the protagonist’s thoughts (Figure 2). Each item was presented in a random order on Qualtrics with the following instructions: “Write 2 words that you think the main character might feel”.

3.3. Supplementary Instruments

Lexical Test for Advanced Learners of English (LexTALE)

LexTALE (Lemhöfer & Broersma, 2012) was used to control for participants’ language proficiency. Participants were required to respond to 60 lexical items, consisting of 40 words and 20 non-words, with “Yes” or “No” to determine if the word presented was a word or non-word. The score for LexTALE is calculated using the average percentage of correct responses (average % correct), which is corrected for the unequal proportion of words and non-words in the test (Lemhöfer & Broersma, 2012). In this study, it yielded an acceptable reliability coefficient, with a Cronbach’s alpha of 0.79. Table 2 shows a summary of participants’ language proficiency.

3.4. Data Analysis

3.4.1. Scoring the PEVST

The PEVST assesses written emotion word production, using two scoring methods: dominant and accuracy. Since the test is not focusing on assessing spelling or writing skills, minor and obvious errors in participants’ responses were corrected.
Dominant scoring represents the top three most frequent responses from all participants for each item. Since participants were not restricted to any specific parts of speech, all inflected forms of a word were grouped using the headword form for the dominant scoring (e.g., responses such as “sadness”, “saddened”, and “sadder” were grouped under the headword “sad”). For each vignette, the frequency of all headword responses was calculated, and the three most frequently produced headwords were identified as “dominant”. In the event that there were more than three most dominant responses (i.e., more than one headword has the same frequency as the top three most frequent rank), the subsequent top four or five most frequent headwords are awarded one point. One point was awarded for responses where the headword is the most frequent response for each item; no additional points were awarded for listing more than one dominant response. Non-dominant responses were scored as 0. Consequently, with 51 vignettes in total, the maximum possible dominant scoring total was 51 points.
Unlike dominant scoring, accuracy scoring uses the actual response provided by the participants, not the headword form. Definitions of responses were checked by the first two authors using two online dictionaries, the Britannica Dictionary (Encyclopædia Britannica, 2024) and Merriam-Webster Dictionary (Merriam-Webster, 2024). Responses were coded as correct if they included synonyms of the target emotion or other acceptable emotion words (e.g., “dirty: disgusted”) that described the vignette and appeared in the emotion word list (Szabo et al., forthcoming). Responses were coded as incorrect if non-obvious spelling errors resulted in a completely different word, irrelevant responses, and responses absent from the emotion word list. The sum of the accurate responses was calculated for each participant for the accuracy scoring. The maximum possible score attainable in accuracy scoring was 102, with each of the 51 items allowing for up to two responses per vignette.

3.4.2. Rasch Analysis

Rasch modelling is fundamental to any test development, as it acts as a diagnostic tool for test quality and validity (Aryadoust et al., 2021; Boone, 2016). The Rasch model was chosen as it rests on the assumption of unidimensionality, which posits that all items should measure a single latent variable (Bond, 2015; Boone, 2016), in this case, emotion vocabulary production. If multiple variables are measured at the same time, the resulting scores may become ambiguous and less valid. Therefore, by ensuring that items conform to the unidimensionality requirement, we increase the likelihood that the PEVST provides a meaningful measure of emotion vocabulary in this study. By evaluating technical quality through statistics such as the mean-square fit statistics (MSQ), all items were ensured to measure a single latent variable (Boone, 2016). On the one hand, MSQ values under the threshold indicate overfit, which means that the item behaves too predictably and may be redundant, as it does not contribute any useful information towards interpreting the results. On the other hand, values above the threshold indicate underfit, suggesting that an item has more randomness than the model predicts, thus jeopardising the validity of the test. Items with infit and outfit MSQ beyond the acceptable threshold of 0.7–1.3 (Boone, 2016; E. V. Smith, 2002; Wright, 1994) can be considered as not conforming to the unidimensionality requirements of the Rasch model and should be subjected to further scrutiny or consideration for removal.
In addition to fit statistics, various reliability indices were calculated to further evaluate the test. Person Separation Reliability (PSR) assesses the test’s ability to distinguish between respondents with different levels of the latent trait. A PSR of 0.7 or above is considered the minimum requirement for reliably distinguishing between low- and high-ability individuals, whereas a PSR of 0.8 or above is indicative of a good level of discrimination (Fisher, 1992; Linacre, 2024). Similarly, Item Separation Reliability (ISR) evaluates how well the items are spread along the latent continuum. High ISR values indicate that the items form a stable hierarchy of difficulty, ensuring that the test covers a wide range of ability levels without clustering too narrowly. Both PSR and ISR are equally important because they provide evidence on whether the test is sensitive enough to detect meaningful differences among individuals and whether the items cover the range of abilities effectively. Without acceptable separation reliability, the test may fail to provide a useful or precise measurement. Additionally, to investigate the representativeness of the test, the difficulty of each item was calculated and mapped onto the Wright map, which displays the distribution of item difficulties and person abilities. The Wright map shows whether the test has any flooring or ceiling effect. Flooring effect means that there are no items present for low-ability respondents, whereas ceiling effect means that there are no items present for high-ability respondents (Boone, 2016). This ensures that the items are neither too easy nor too difficult for the respondents (Boone et al., 2014). Wald’s test was conducted to ensure that the items function consistently across the range of the latent variable and across different respondent groups (Glas & Verhelst, 1995).
Both dichotomous and polytomous Rasch modelling were used for the dominant and accuracy scoring, respectively. The dichotomous Rasch model assesses items with binary outcomes (i.e., correct or incorrect responses) where only the most dominant responses were used. The polytomous Rasch model handles items with multiple outcomes for the accuracy scoring, where multiple responses are acceptable. Data was analysed via RStudio v4.3.3 (RStudio Team, 2020) using apaTables (Stanley, 2021), psych (Revelle, 2023), Tidyverse (Wickham et al., 2019), and eRm (Mair et al., 2016) packages.

4. Results

4.1. By-Item Analysis

In total, the PEVST elicited 15,912 responses. Using dominant scoring, a total of 6798 responses (43%) were marked as correct and matched the target emotion word (Table A2, Appendix B). A total of 9114 responses (57%) were non-dominant responses (did not match with the top three dominant responses) and, therefore, were scored as incorrect (Table A3, Appendix C). Using accuracy scoring, a total of 9868 responses (62%) were marked as correct and matched the target emotion word (Table A4, Appendix D). A total of 6044 responses (38%) were marked as incorrect.
To examine if the dominant responses match the target emotion words, out of 51 items in total, 21 items directly matched their target emotion word based on the top 3 most dominant responses (Table A2, Appendix B). Among the remaining 30 items, 13 included at least 1 close synonym among the top 3 responses, typically higher frequency words replacing lower frequency target words (e.g., target word: “famished”; top 3 responses: “hungry,” “tired,” “pain”) (Table A3, Appendix C). Of these 13 items, 7 were mid-frequency words and 6 were low-frequency words that elicited at least one high-frequency synonym (e.g., “daze” (7k)—“confuse” (2k); “euphoric” (11k)—“happy” (1k); “famished” (14k)—“hungry” (1k)). In total, 34 out of 51 items (67%) had responses that were either exact matches or close synonyms of the intended target emotion. This highlights the impact of mid- and low-frequency target emotion words on response accuracy, as discussed below.
Given the rather even distribution of L1 and L2 participants in this study, we explored the potential of L1 and L2 participants producing a different response pattern. Table 3 displays the degree of consistency across the first, second, and third most frequent responses (Dom 1, Dom 2, and Dom 3) provided by both groups. Results show a high degree of response consistency between L1 and L2 participants. For 36 out of the 51 items (71%), the dominant responses across all three categories (Dom 1, Dom 2, and Dom 3) were identical for both groups, resulting in a 100% match rate. For the remaining 15 items (29%), the first and second most dominant responses were the same, but the third differed slightly, yielding a match rate of 66.66%. More importantly, none of the items showed a completely divergent response pattern between L1 and L2 groups, indicating that even when slight variations occurred, the dominant responses remained consistent.
To improve the construct validity of this test, the flowchart (Figure 3) illustrates the process to remove underperforming items before proceeding to the next phase of analysis. Setting a naming agreement threshold at 25% (i.e., proportion of participant providing the same headword as dominant response) ensures that the vocabulary items included in the assessment are representative enough of the population’s knowledge. Lower frequency words that fall below this threshold may not be sufficiently known or commonly used among a diverse group of learners. The threshold helps filter out obscure items that might otherwise skew the results of the test. A 25% threshold strikes a balance between challenge and accessibility. This ensures that the test is neither too easy (where nearly everyone knows all the words) nor too difficult (where most words are too obscure). An achievable but non-trivial threshold helps maintain the test’s integrity as an appropriate measure of vocabulary knowledge across a variety of proficiency levels. Additionally, by setting this cut-off, we improve content validity by ensuring that the words assessed are genuinely within the emotional range and understanding of a significant portion of the demographic. The threshold suggests a certain level of commonality in vocabulary recognition, which could contribute to test reliability by including words commonly recognised, understood, and used amongst varied test groups. Following those criteria, 27 items were selected for the next phase of analysis.

4.2. Rasch Analysis

4.2.1. Dichotomous Rasch Model

Dominant scoring was used for the dichotomous Rasch model. When conducting dichotomous Rasch modelling, only the most dominant response for each item was considered correct, whereas non-dominant responses were considered incorrect. Table 4 shows the fit statistics of the dichotomous Rasch model using only the most dominant response. Firstly, the technical quality of the PEVST using dominant scoring was inspected for underfitting and overfitting items with infit and outfit MSQ between 0.7–1.3 (Boone, 2016; A. B. Smith et al., 2008; Wright, 1994). A t-test was also reported for infit and outfit, and the t-value ideally should fall within the range of −1.96 to 1.96 (Wright, 1994).
Based on those criteria, two items were flagged as misfitting for the following reasons.
Misfitting items:
Q26:
Infit and outfit t were out of range (beyond −2);
Q71:
Negative discrimination value.
Item separation reliability (ISR = 0.95) shows that this model is adequate in terms of size and difficulty to assess respondents’ emotion vocabulary production. However, person separation reliability (PSR = 0.46) indicates low discriminative power (Fisher, 1992; Linacre, 2024); therefore, not able to properly discern between low and high ability respondents.
Figure 4 presents a Wright map of the dichotomous model. The majority of the items are situated between 0 to 1 on the latent dimension. Flooring and ceiling effects were noticeable as there were few to no items present for participants on both extremes.
Table 5 shows a summary of Wald’s test for the dichotomous Rasch model. According to Wald’s test, Q26 was identified as misfit (p < 0.05), suggesting potential irregularities within the item.

4.2.2. Polytomous Rasch Model

Accuracy scoring was adopted for the polytomous Rasch model, using the partial credit model. Table 6 shows the fit statistics of the polytomous Rasch model. All items appear to fall within the reasonable range.
Item separation reliability (ISR = 0.93) shows that this model is adequate in size and difficulty to assess respondents’ emotion vocabulary production. Although there was a significant increase in person separation reliability (PSR = 0.60) from the dichotomous model, it still fell short of the minimum to distinguish between low- and high-ability participants.
In Figure 5, all items appeared to be relatively evenly distributed across the latent dimension, ranging between −2 and +2. The majority of participants in this study have an above-average ability, centred just above the 0 mark. There were no noticeable flooring or ceiling effects in this model. This indicates that the items are neither too easy nor too difficult for the respondents.
Table 7 shows a summary of the Wald’s test for all 27 items. Wald’s test flagged the following misfitting items (p < 0.05):
  • Q30
    Q30.c1: p < 0.05, z > +1.96;
    Q30.c2: p = 0.065; although not significant, but close to threshold.
  • Q41
    Q41.c2: p < 0.05, z > +1.96;
    Q41.c1: p = 0.054; although not significant, but close to threshold.
The following items were excluded from this analysis due to inappropriate response patterns: Q73.

5. Discussion

This study presents the initial development and validation of the Productive Emotion Vocabulary Size Test (PEVST), an assessment that incorporates word frequency, valence, and arousal as key lexical and affective characteristics. The test was designed to include target emotions across low-, mid-, and high-frequency words to provide a comprehensive gauge of vocabulary knowledge. This study sets out to investigate the extent to which scoring and item performance can support the validity of the PEVST as an emotion vocabulary production test.
By-item analysis provided revealing insights into how each item performed, based on the most dominant responses elicited. The majority of the items had responses matching the target emotions. This indicates that the vignettes were well-constructed to elicit the intended emotion word, thereby reinforcing the test’s construct validity. Out of 51 items, 21 items were able to elicit a response that directly matched the target emotion. While 13 mid- and low-frequency items (26%) did not have dominant responses that directly matched the target emotions, there were responses from a higher frequency band that were close and/or direct synonyms. This might be due to the easier retrieval of lower frequency words, which require less effort, in accordance with Zipf’s Principle of Least Effort (Zipf, 2016). Given that a large majority of active vocabularies comprise high-frequency words (Nation, 2004; Schmitt & Schmitt, 2014), participants are more likely to provide responses that they frequently use (Vine et al., 2020; Zhu et al., 2018). As evidenced through by-item analysis, some of the low-frequency target items indeed produced dominant responses with a higher frequency synonym (e.g., “euphoric”—“happy”, “famished”—“hungry”). This finding emphasises the importance of using a scoring mechanism that accommodates semantically equivalent responses, such as synonyms, to capture the depth of participants’ emotional vocabularies accurately.
Additionally, the ability of both L1 and L2 participants to produce comparable emotion words in response to the same target emotion suggests that the vignettes effectively elicit shared conceptual and linguistic representations of emotional content. This alignment reinforces the validity of the PEVST as a measure of productive emotion vocabulary knowledge that is not biased toward a particular linguistic background. While there were some variations in the third most dominant response between language groups, the present results suggest that the emotion vocabulary, as measured by the PEVST, taps into a core set of concepts that are understood and expressed similarly by both L1 and L2 English speakers. Moreover, the observed response alignment contributes to the argument that differences in emotion vocabulary performance between L1 and L2 users—if any—are more likely attributable to individual differences in language proficiency rather than their linguistic background. This finding is particularly relevant for applied contexts, such as language education, where accurate assessment of learners’ emotional lexicons is essential for supporting emotional expression, communication, and well-being.
To improve the construct validity of the PEVST, redundant items that did not match in the dominant scoring were removed, resulting in a remainder of 27 out of 51 items, before proceeding with the Rasch analysis. Although PSR for the polytomous model fell short of the acceptable benchmark, it was still a notable increase from the dichotomous model. This shows that the accuracy scoring was better at distinguishing between high- and low-ability participants. Findings from the polytomous model demonstrated improved distribution of items across logits (−2 to +2) compared to the dichotomous model, which exhibited skewed item distributions and observable flooring and ceiling effects. The polytomous model also had a more representative distribution of items relative to participants’ abilities, aligning item thresholds with person parameters, whereas the items in the dominant model are visibly left skewed, with items only distributed from −2 to +1 with observable flooring and ceiling effects. Ideally, all the items should be evenly distributed across the whole scale to provide a meaningful measure of the latent variable with varying difficulty (Boone, 2016). This means that collapsing responses into a dichotomous scale, with only the most dominant response, was insufficient for a nuanced understanding of the test. Given that only the most dominant response was used, a large number of responses were omitted, and this resulted in leaving many potentially correct responses (often lower frequency words) unaccounted for. Furthermore, considering that this is a productive test, limiting low-frequency and unique exemplar responses that still accurately reflect the vignette may be counterproductive. Therefore, the accuracy scoring was more representative of participants’ ability in emotion recognition and production. However, the dominant scoring was still informative in this study to identify items that perform according to expectations, achieving at least 25% of naming agreement (see Streubel et al., 2020). Undeniably, the PSR coefficient in the polytomous model (PSR < 0.7) still fell short of the minimum benchmark. However, this could be attributed to the limited variance in language proficiency of this sample, as measured by LexTALE, where only 10 participants (6.4%) were from the lower intermediate bracket.
The presence of misfitting items in both Rasch models of the PEVST (dominant and accuracy) raises complex challenges concerning item design, test validity, and reliability. Items Q26 and Q71 were flagged as misfitting in the dichotomous model, but not in the polytomous model. This could be due to the restrictive nature of dominant scoring, which limits the range of responses, thus leading to less accurate estimates of ability. However, in the polytomous model, different items were flagged as misfits by Wald’s test: Q30, Q41, and Q73. Misfitting items can be indicative of several issues, such as dimensional inadequacies (items that do not conform to the unidimensional scale assumed by Rasch models) (Aryadoust et al., 2021; Bond, 2015; Boone et al., 2014), random or careless responding (E. V. Smith, 2002), or differing interpretations of language among test-takers (Brandenburger & Schwichow, 2023; Embretson & Reise, 2013; Hoemann et al., 2019; Vine et al., 2020). Additionally, this may indicate potential ambiguity in emotion-related language processing, as individual differences such as emotion perception and understanding can lead to significant differences in how individuals respond to emotion-related items (Mayer et al., 2012, 2016). However, it could also be attributed to the limited variance in language proficiency, which could distort item fit and restrict item response range. Acknowledging sample limitations, these findings still highlight two key takeaways for this test: the importance of meaningful scoring criteria and item design. Well-developed scoring criteria for a test would allow the test to paint a more accurate picture of participants’ ability (DeVellis & Thorpe, 2021). Additionally, good item design would reduce ambiguity and validity issues (Goetze, 2023b), thus the probability of random responses (E. V. Smith, 2002). Ultimately, these misfitting items can be subjected to removal as there are other items with similar item difficulties within those logits (see Figure 5). Alternatively, these items could be re-piloted with a different sample, possibly with more variance.
Rasch analysis offers a robust framework for validating the PEVST by addressing key aspects of construct validity, including dimensionality, scoring methodology, and item reliability. The integration of polytomous scoring and vignette-based responses ensures that the PEVST captures the complexity of emotion vocabulary use, offering a comprehensive measure of this critical domain. The iterative development and validation process detailed here demonstrates the PEVST’s potential as a tool for assessing productive emotion vocabulary knowledge, providing valuable insights into the intersection of language, emotion, and cognition. This can inform targeted instructional strategies, such as explicit teaching of low-frequency emotion words and incorporation of contextualised emotion-based activities that bridge lexical gaps in ESL/EFL learners (Dewaele, 2015; Masrai, 2019; Pavlenko, 2012).

Limitations and Future Directions

Readers should take note that the findings from this study only represent individuals with upper-intermediate and advanced levels of English proficiency. Future research could extend this work by assessing ESL learners with lower proficiency levels to better evaluate the test’s effectiveness across a broader range of language abilities. While the PEVST was an extension of the PEVT (Szabo et al., forthcoming), incorporating vignette methodology, elements of cultural relevance, and gender differences remain an important consideration when using this test. As previous studies found that gender differences influenced emotion vocabulary production (e.g., Bazhydai et al., 2019; Dylman et al., 2020), future studies could explore whether gender differences contribute to how differently these items perform. Additionally, given that emotions are experienced, recognised, and expressed differently across cultures (see Immordino-Yang et al., 2016; Lange et al., 2022; Laukka & Elfenbein, 2020), the vignettes may elicit different emotional responses from culturally diverse populations. Future studies can explore cultural adaptations and cross-cultural validation by adapting the stimuli and linguistic nuances to cater to cultural specificity for a global population. Building on the in-depth item analysis in this study, our next steps include cross-cultural validation beyond the predominantly Malaysian sample, as well as examining a wider range of language proficiencies. We are also currently investigating the extent to which our findings are influenced by language proficiency, emotional intelligence, and word characteristics.

6. Conclusions

The development of the Productive Emotion Vocabulary Size Test (PEVST) marks a significant forward step in measuring emotion vocabulary production among adults. This novel study addresses a gap in emotion vocabulary assessments by incorporating low-, mid-, and high-frequency emotion words to gauge a comprehensive range of emotion vocabulary knowledge. This study examined the extent to which scoring and item performance support the validity of the PEVST in measuring emotion vocabulary production. The findings of this study revealed that out of 51 items, 27 items were finalised and included in the Rasch analysis. By-item analysis revealed that low-frequency target emotion words elicited high-frequency responses. Dominant scoring provided an automated scoring system, while accuracy scoring revealed more nuanced responses, particularly lower frequency emotion vocabulary. Rasch analysis provided great insights and considerations for item designs and test scoring. Findings revealed that accuracy scoring was more thorough than dominant scoring in capturing a comprehensive representation of participants’ emotion vocabulary knowledge. Misfitting items suggested areas for improvement, particularly in item design for response consistency and the need to account for individual differences towards emotional understanding and processing. Homogeneity in language proficiency may also contribute towards these discrepancies, which led to limited generalisability to only high proficiency individuals. Addressing these challenges in future test iterations could improve the test’s construct validity. In conclusion, the PEVST demonstrates potential as a tool for assessing emotion vocabulary, particularly in an English L2 setting. However, further refinements are needed to address its current limitations. By integrating other factors such as gender differences, cultural nuances, and language proficiency, future iterations of the PEVST can offer a more comprehensive and accurate measure of emotion vocabulary, thereby contributing valuable insights into the intricate interplay between language, emotions, and cognition.

Author Contributions

Conceptualisation, A.J.E.C., C.Z.S. and S.A.; methodology, A.J.E.C. and C.Z.S.; formal analysis, A.J.E.C. and C.Z.S.; investigation, A.J.E.C.; writing—original draft, A.J.E.C.; writing—review & editing, A.J.E.C., C.Z.S. and S.A.; visualisation, A.J.E.C.; supervision, C.Z.S. and S.A.; project lead and administration, C.Z.S.; funding acquisition, C.Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded internally by the Faculty of Arts and Social Science at the University of Nottingham Malaysia.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Faculty of Arts and Social Sciences Research Committee at University of Nottingham Malaysia FASS2022-0016/SoEd/ACJ18823555(Revision2) 2023-12-21.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Acknowledgments

The authors would like to express their sincerest gratitude to Shanna Lee Zai Dee, Toh Jia Qian, Bong Wei Le Jeremy for their assistance in designing and developing the vignettes; and Lee Soon Tat for reviewing the article.

Conflicts of Interest

The authors declare no conflict of interest.

Glossary

L1/L2First language/second language. English L2 refers here to English learned as an additional language in ESL/EFL contexts, but does not denote a chronological or dominance/language proficiency or order of acquisition in the case of multilinguals.
ESL/EFLEnglish as a Second Language refers to learners who are acquiring English in a country where English is the dominant or official language and is used outside of classrooms in a natural manner. English as a Foreign Language refers to learners who are learning English in a non-English-speaking country and English is typically used only in the classroom, so exposure is more limited and structured.
BNC/COCAA combined word frequency list derived from the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA). The combined BNC/COCA list provides frequency rankings that reflect usage in both British and American English.
SUBTLEX-UKA word frequency database based on over 200 million words from British television and film subtitles. As subtitles tend to closely match everyday spoken language, SUBTLEX-UK provides frequency data that are often more representative of colloquial usage than those based on traditional written corpora.
Word familiesA base word and its inflected or derived forms (e.g., teach, teaches, teacher, teaching). Word family counting assumes learners can recognise related forms once the base word is known.
High-frequency wordsThe most commonly used words in a language. They cover a large proportion of everyday texts and are essential for basic comprehension.
Mid-frequency WordsWords that occur less frequently than high-frequency items but are still common in academic, literary, and general texts.
Low-frequency WordsRare words, often domain-specific, which appear infrequently and are usually acquired incidentally or through specialised reading.
Frequency bandsGroupings of words based on how often they occur in large language corpora. (High-frequency: 1–3k; mid-frequency: 4–8k; low-frequency: 9k and above (Schmitt & Schmitt, 2014).

Appendix A

Table A1. PEVST items.
Table A1. PEVST items.
ItemTarget EmotionFrequencyValenceArousalVignette
Q23dirty1k4.503.44“Oh no, I need a shower!”
Q24doubt1k3.284.33“Can I really figure this out?”
Q25lazy1k3.902.76“I should really get up and start taking care of these tasks but I don’t want to.”
Q26defeat3k3.744.14“I lost the competition.”
Q27obsess4k3.234.95“This movie is so good, I’m going to get their merchandise!”
Q28bewilder5k4.324.57“What is going on? Why is our boss wearing a unicorn outfit to work?”
Q29dismay5k3.102.85“I can’t believe my job application was rejected again.”
Q30dizzy5k3.364.95“Oh no, I need to lie down.”
Q31inferior5k3.434.55“Everyone else seems to be doing so much better than me.”
Q32daze7k4.144.5“I can’t think clearly”
Q33phony7k2.524.40“I really hope they don’t notice that my bag is fake”
Q34drowsy9k4.252.83“I need to pull over. I can’t keep driving like this”
Q35vindictive10k3.244.64“I am going to hurt you just like how you hurt me”
Q36luck1k6.734.57“I feel like I will win this!”
Q37support1k6.893.05“I have a safety net. I always have people that I can rely on.”
Q38secret2k5.334.14“I need to watch what I say. I do not want them to find out about my partner yet.”
Q39relieve3k7.253.9“Thank goodness! I thought it might be something serious.”
Q40sympathy3k6.673.29“Oh dear, I can’t imagine what it feels like to be cheated on.”
Q41nostalgia6k6.654.38“I miss those good old days.”
Q42trustworthy8k7.254.22“I’m glad my friend shares his secrets with me”.
Q43sociable9k6.434.35“Aww I love having all these people around and making new friends.”
Q44bashful13k5.554.36“Oh? Who is this? I’ve never met her before!”
Q45homy15k5.683.41“It is so nice here, I want to live here forever”
Q46hate1k1.966.26“Urgh! I really can’t stand him.”
Q47responsible1k2.505.00“This is my fault! I will pay the owner for the repairs”
Q48accuse2k3.385.48“I haven’t done anything but she is scolding me.”
Q49hostile3k2.355.39“I can’t stand noisy children! Get out of my restaurant!”
Q50curse4k2.905.20“This is the sixth time I’ve gotten a punctured tyre this month.”
Q51greed4k2.484.45“Yum! This ice-cream is so delicious, I want more!”
Q52insult4k2.625.3“Why do they have to say such nasty things to me?”
Q53intimidate4k2.845.27“He always shouts at me and says I would be fired if I didn’t do well.”
Q54frantic5k3.795.39“I have to find it quickly before they realize it’s missing.”
Q55horrified5k2.686.29“How could something so awful happen?!”
Q56reckless5k3.095.18“I don’t care what happens.”
Q57squirm7k3.865.29“Ugh! I can’t stand the sight of blood.”
Q58foreboding10k3.535.30“I think something bad is going to happen.”
Q59grumpy10k2.815.05“Why is everyone bugging me today?! I wish they would leave me alone.”
Q60disbelief12k4.215.58“There is no way this is actually true.”
Q61famished14k4.476.43“I haven’t eaten all day and now my stomach is hurting. I really need to eat something!”
Q62silly1k6.275.13“I’m acting so childish.”
Q63adventure2k7.406.36“I live for new experiences.”
Q64curious2k6.375.90“I really want to understand how it works.”
Q65defence2k5.365.11“I need to justify myself. I will not let her criticise me like this.”
Q66impulse4k5.165.33“I just had to have it.”
Q67gratitude5k6.675.09“My mother is so thoughtful and kind.”
Q68enchant6k7.165.27“I feel like I’ve stepped into a magical world.”
Q69flirt6k6.735.93“You’re so beautiful. You made me forget my pickup line”
Q70tickle6k6.145.86“I can’t stop laughing! I can’t stand it anymore.”
Q71hilarious7k7.806.11“I killed that joke. Everyone is laughing hysterically!”
Q72inquisitive9k6.005.33“I wonder how many planes, pilots, passengers, and bags are at this airport? I have so many questions!”
Q73euphoric11k7.805.25“We’re finally having a baby after trying for five years! This is amazing!”

Appendix B

Table A2. Summary of the number of items that matched their target emotion word using dominant scoring.
Table A2. Summary of the number of items that matched their target emotion word using dominant scoring.
Target EmotionFreqDom 1FreqCount 1NA1Dom 2FreqCount 2NA2Dom 3FreqCount 3NA3
Dirty1kDirty *1k4830.77Disgust2k3522.43Un(comfort)able1k1811.54
Doubt1kDoubt *1k5333.97Confuse2k3321.15Worry1k1811.54
Lazy1kLazy *1k11171.15Relax2k2415.38Tire1k1912.18
Defeat3kSad1k8353.20Disappoint2k7145.51Defeat *3k2717.31
Obsess4kExcite1k5032.05Happy1k3421.79Obsess *4k1610.26
Dizzy5kDizzy *5k9963.46Tire1k4226.92Sick1k2516.03
Secret2kCare1k3119.87Cautious4k3019.23Secret *2k3019.23
Relieve3kRelieve *3k8554.48Relief2k5032.05Happy1k4327.56
Sympathy3kSad1k7044.87Empathy6k4126.28Sympathy *3k3723.72
Nostalgia6kNostalgic *7k6340.38Happy1k5937.82Sad1k2717.31
Responsible1kGuilty2k6642.30Responsible *1k4025.64Worry1k1912.18
Greed4kHappy1k7246.15Satisfy2k2717.30Greed *4k2314.74
Reckless5kReckless *5k3623.07Care1k148.97Apathy7k95.77
Disbelief12kDoubt1k4528.84Disbelief *12k3220.51Sceptic4k2616.67
Silly1kHappy1k5635.89Fun1k3623.07Silly *1k3019.23
Adventure2kExcite1k8151.92Adventure *2k6944.23Happy1k3119.87
Curious2kCurious *2k12177.56Determine2k1912.17Interest1k148.97
Defence2kAngry1k4428.21Confident3k2817.94Defence *2k2616.67
Impulse4kExcite1k2415.38Happy1k2314.74Impulse *4k2214.10
Flirt6kFlirt *6k4226.92Confident3k2415.38Happy1k1710.90
Inquisitive9kCurious2k12479.49Excite1k3321.15Inquisitive *9k1710.90
Note. Freq: Frequency bands (Schmitt & Schmitt, 2014); Count 1, 2, and 3: Frequency of response occurrence; Dom 1: Most dominant response; Dom 2: Second most dominant response; Dom 3: Third most dominant response; NA 1, 2, and 3: Naming agreement for Dom 1, 2, and 3, respectively; Items with “*” indicate a direct match between target items and responses.

Appendix C

Table A3. Summary of the number of items that did not match their target emotion word using dominant scoring.
Table A3. Summary of the number of items that did not match their target emotion word using dominant scoring.
Target EmotionFreqDom 1FreqCount 1NA1Dom 2FreqCount 2NA2Dom 3FreqCount 3NA3
bewilder5kconfuse *2k7044.87curious2k5837.18shock2k2817.95
dismay5ksad1k5434.61disappoint *2k4730.13frustrate2k4226.92
inferior5ksad1k5635.90disappoint2k3522.44depress2k138.33
daze7kconfuse *2k3522.44shock2k2717.31distract4k127.69
phony7kworry1k5032.05embarrass2k2516.03anxious2k2516.03
drowsy9ktire1k4025.64dizzy5k3623.08sick1k3220.51
vindictive10kangry1k7145.51vengeful *11k5837.18revenge5k2616.67
luck1kconfident3k6139.10hope1k5635.90excite1k3723.72
support1khappy1k5333.97safe1k4629.49grateful3k3220.51
trustworthy8khappy1k5233.33trust *1k4025.64grateful3k3119.87
sociable9khappy1k9661.54joy2k2214.10grateful3k1811.54
bashful13kshy *1k7749.36curious2k5535.26scare1k2415.38
homy15kcomfort *1k5837.18relax2k4226.92happy1k3925.00
hate1kannoy2k8453.85angry1k5535.26frustrate2k2415.38
accuse2kconfuse2k6843.59sad1k2314.74annoy2k159.62
hostile3kannoy2k9057.69angry1k8252.56frustrate2k2415.38
curse4kfrustrate2k3723.72annoy2k3019.23(un)luck(y)1k3824.36
insult4kangry1k7145.51sad1k2817.95annoy2k2012.82
intimidate4ksad1k5233.33scare1k159.62worry1k148.97
frantic5kworry1k5132.69anxious2k4528.85panic *2k3623.08
horrified5ksad1k6239.74shock *2k3925.00disbelief12k1811.54
squirm7kscare1k7749.36disgust2k5635.90fear1k2616.67
foreboding10kworry1k5233.33anxious2k5032.05scare1k3421.79
grumpy10kannoy *2k10567.31frustrate2k3119.87irritate4k2918.59
famished14khungry *1k11976.28tire1k2214.10pain1k2012.82
gratitude5kgrateful *3k6441.03love1k5837.18happy1k5132.69
enchant6kamaze2k4629.49excite1k3321.15happy1k2817.95
tickle 6khappy1k9560.90fun1k3019.23joy2k2817.95
hilarious7kproud2k6441.03happy1k6239.74confident3k2516.03
euphoric11khappy *1k6742.95excite1k5233.33joy *2k2113.46
Note. Freq: Frequency bands (Schmitt & Schmitt, 2014); Count 1, 2, and 3: Frequency of response occurrence; Dom 1: Most dominant response; Dom 2: Second most dominant response; Dom 3: Third most dominant response; NA 1, 2, and 3: Naming agreement for Dom 1, 2, and 3, respectively; Items with “*” indicate mid- and low-frequency items that elicited at least one high-frequency synonymous response.

Appendix D

Table A4. Summary of accuracy scoring for all 51 items.
Table A4. Summary of accuracy scoring for all 51 items.
Target EmotionFreqExamplesSum of Accurate Responses
dirty1kFrustrated, gross, annoyed35
doubt1kAnxious, unsure, uncertain66
lazy1kUnmotivated, overwhelmed, sluggish30
defeat3kFrustrated, dejected, depressed69
obsess4kAttracted, addicted, mesmerised35
bewilder5kSurprised, funny, puzzled62
dismay5kDejected, hopeless, depressed90
dizzy5kUnwell, nauseous, weak35
inferior5kJealous, insecure, worried66
daze7kLost, distraught, disoriented40
phony7kScared, fearful, ashamed63
drowsy9kUnwell, fatigue, nauseous30
vindictive10kAngry, hateful, resentful54
luck1kOptimistic, determined, positive33
support1kLoved, secure, comfortable55
secret2kWorried, nervous, scared54
relieve3kGrateful, thankful, reassured35
sympathy3kPity, sorry, compassion35
nostalgia6kLonging, melancholic, wistful37
trustworthy8kGlad, appreciated, touched45
sociable9kFriendly, welcomed, popular30
bashful13kNervous, confused, anxious46
homy15kContent, calm, cosy70
hate1kIrritated, disgusted, angry75
responsible1kRegret, remorse, sorry53
accuse2kFrustrated, misunderstood, wronged67
hostile3kIrritated, impatient, angry62
curse4kSad, tired, disappointed41
greed4kHungry, addicted, obsessed43
insult4kHurt, upset, furious62
intimidate4kStressed, hurt, discouraged56
frantic5kScared, nervous, anxious 52
horrified5kWorried, sorrowful, terrified55
reckless5kFearless, impulsive, free30
squirm7kNauseous, uncomfortable, terrified66
foreboding10kFearful, paranoid, anxious47
grumpy10kAngry, impatient, agitated60
disbelief12kUnbelievable, suspicious, unimpressed33
famished14kDesperate, weak, suffering35
silly1kPlayful, foolish, comical22
adventure2kAdventurous, happy, courageous29
curious2kEager, intrigued, focused50
defence2kDetermined, annoyed, indignant79
impulse4kGreedy, obsessed, tempted59
gratitude5kAppreciated, touched, thankful61
enchant6kWonderous, awe, mesmerised65
flirt6kAttracted, lustful, infatuated55
tickle 6kAmused, entertained, giggly46
hilarious7kSatisfied, accomplished, funny51
inquisitive9kInterested, intrigued, wonder45
euphoric11kEcstatic, elated, overjoyed72

Note

1
English L2 refers here to English learned as an additional language in ESL/EFL contexts, but does not denote a chronological or dominance/language proficiency or order of acquisition in the case of multilinguals.

References

  1. Aryadoust, V., Ng, L. Y., & Sayama, H. (2021). A comprehensive review of Rasch measurement in language assessment: Recommendations and guidelines for research. Language Testing, 38(1), 6–40. [Google Scholar] [CrossRef]
  2. Baron-Cohen, S., Wheelwright, S., Hill, J., Raste, Y., & Plumb, I. (2001). The “Reading the Mind in the Eyes” Test revised version: A study with normal adults, and adults with Asperger syndrome or high-functioning autism. The Journal of Child Psychology and Psychiatry and Allied Disciplines, 42(2), 241–251. [Google Scholar] [CrossRef]
  3. Barrett, L. F. (2017a). How emotions are made: The secret life of the brain. Pan Macmillan. [Google Scholar]
  4. Barrett, L. F. (2017b). The theory of constructed emotion: An active inference account of interoception and categorization. Social Cognitive and Affective Neuroscience, 12(1), 1–23. [Google Scholar] [CrossRef]
  5. Barrett, L. F., & Westlin, C. (2021). Navigating the science of emotion. Emotion Measurement, 39–84. [Google Scholar] [CrossRef]
  6. Baumann, J. F., & Graves, M. F. (2010). What is academic vocabulary? Journal of Adolescent & Adult Literacy, 54(1), 4–12. [Google Scholar] [CrossRef]
  7. Bazhydai, M., Ivcevic, Z., Brackett, M. A., & Widen, S. C. (2019). Breadth of emotion vocabulary in early adolescence. Imagination, Cognition and Personality, 38(4), 378–404. [Google Scholar] [CrossRef]
  8. Berscheid, E., Moore, B. S., & Isen, A. M. (1990). Contemporary vocabularies of emotion. In Affect in social behavior (pp. 22–38). Cambridge University Press. [Google Scholar]
  9. Bielak, J., & Mystkowska-Wiertelak, A. (2020). Investigating language learners’ emotion-regulation strategies with the help of the vignette methodology. System, 90, 102208. [Google Scholar] [CrossRef]
  10. Bond, T. (2015). Applying the rasch model: Fundamental measurement in the human sciences (3rd ed.). Routledge/Taylor & Francis Group. [Google Scholar] [CrossRef]
  11. Boone, W. J. (2016). Rasch analysis for instrument development: Why, when, and how? CBE Life Sciences Education, 15(4), rm4. [Google Scholar] [CrossRef]
  12. Boone, W. J., Yale, M. S., & Staver, J. R. (2014). Rasch analysis in the human sciences (pp. 1–482). Springer. [Google Scholar] [CrossRef]
  13. Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). The development and psychometric properties of LIWC-22. University of Texas at Austin. [Google Scholar]
  14. Bradley, M. M., & Lang, P. J. (1999). Affective norms for English words (ANEW): Instruction manual and affective ratings. (Technical report C-1). The Center for Research in Psychophysiology. [Google Scholar]
  15. Brandenburger, M., & Schwichow, M. (2023). Utilizing Latent Class Analysis (LCA) to analyze response patterns in categorical data. In Advances in applications of rasch measurement in science education (pp. 123–156). Springer. [Google Scholar]
  16. Chen, C., & Truscott, J. (2010). The effects of repetition and L1 lexicalization on incidental vocabulary acquisition. Applied Linguistics, 31(5), 693–713. [Google Scholar] [CrossRef]
  17. Council of Europe. (2020). Common European framework of reference for languages: Learning, teaching, assessment companion volume. Council of Europe Publishing. Available online: www.coe.int/lang-cefr (accessed on 1 April 2023).
  18. Crossley, S., Salsbury, T., Titak, A., & Mcnamara, D. (2014). Frequency effects and second language lexical acquisition Word types, word tokens, and word production. International Journal of Corpus Linguistics, 19, 301–332. [Google Scholar] [CrossRef]
  19. Denham, S. A., Bassett, H. H., Brown, C., Way, E., & Steed, J. (2015). “I know how you feel”: Preschoolers’ emotion knowledge contributes to early school success. Journal of Early Childhood Research, 13(3), 252–262. [Google Scholar] [CrossRef]
  20. DeVellis, R. F., & Thorpe, C. T. (2021). Scale development: Theory and applications. Sage Publications. [Google Scholar]
  21. Dewaele, J.-M. (2008). Dynamic emotion concepts of L2 learners and L2 users: A second language acquisition perspective. Bilingualism: Language and Cognition, 11(2), 173–175. [Google Scholar] [CrossRef]
  22. Dewaele, J.-M. (2015). On emotions in foreign language learning and use. The Language Teacher, 39(3), 13–15. [Google Scholar] [CrossRef]
  23. Dylman, A. S., Blomqvist, E., & Champoux-Larsson, M. F. (2020). Reading habits and emotional vocabulary in adolescents. Educational Psychology, 40(6), 681–694. [Google Scholar] [CrossRef]
  24. Ebert, M., Ivcevic, Z., Widen, S. S., Linke, L., & Brackett, M. (2014). Breadth of emotion vocabulary in middle schoolers. Available online: https://elischolar.library.yale.edu/cgi/viewcontent.cgi?article=1038&context=dayofdata (accessed on 24 August 2021).
  25. Ekman, P. (1999). Facial expressions. In Handbook of Cognition and Emotion (pp. 301–320). John Wiley & Sons Ltd. [Google Scholar]
  26. Embretson, S. E., & Reise, S. P. (2013). Item response theory. Psychology Press. [Google Scholar]
  27. Encyclopaedia Britannica. (2022). Words for emotions vocabulary word list. Available online: https://www.britannica.com/dictionary/eb/3000-words/topic/emotions-vocabulary-english (accessed on 21 November 2022).
  28. Encyclopædia Britannica. (2024). Find definitions & meanings of words|Britannica dictionary. Available online: https://www.britannica.com/dictionary (accessed on 21 November 2022).
  29. Fabes, R. A., Eisenberg, N., Hanish, L. D., & Spinrad, T. L. (2001). Preschoolers’ spontaneous emotion vocabulary: Relations to likability. Early Education & Development, 12(1), 11–27. [Google Scholar] [CrossRef]
  30. Ferré, P., Guasch, M., Stadthagen-Gonzalez, H., & Comesaña, M. (2022). Love me in L1, but hate me in L2: How native speakers and bilinguals rate the affectivity of words when feeling or thinking about them. Bilingualism: Language and Cognition, 25(5), 786–800. [Google Scholar] [CrossRef]
  31. Fisher, W. P. J. (1992). Reliability statistics. Rasch Measurement Transactions, 6, 238. [Google Scholar]
  32. Glas, C. A. W., & Verhelst, N. D. (1995). Testing the Rasch Model. In G. H. Fischer, & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 69–95). Springer. [Google Scholar] [CrossRef]
  33. Goetze, J. (2023a). An appraisal-based examination of language teacher emotions in anxiety-provoking classroom situations using vignette methodology. The Modern Language Journal, 107(1), 328–352. [Google Scholar] [CrossRef]
  34. Goetze, J. (2023b). Vignette methodology in applied linguistics. Research Methods in Applied Linguistics, 2(3), 100078. [Google Scholar] [CrossRef]
  35. Grosse, G., Streubel, B., Gunzenhauser, C., & Saalbach, H. (2021). Let’s talk about emotions: The development of children’s emotion vocabulary from 4 to 11 years of age. Affective Science, 2(2), 150–162. [Google Scholar] [CrossRef] [PubMed]
  36. Ha, A. Y. H., & Hyland, K. (2017). What is technicality? A technicality analysis model for EAP vocabulary. Journal of English for Academic Purposes, 28, 35–49. [Google Scholar] [CrossRef]
  37. Hoemann, K., Xu, F., & Barrett, L. F. (2019). Emotion words, emotion concepts, and emotional development in children: A constructionist hypothesis. Developmental Psychology, 55(9), 1830–1849. [Google Scholar] [CrossRef]
  38. Immordino-Yang, M. H., Yang, X. F., & Damasio, H. (2016). Cultural modes of expressing emotions influence how emotions are experienced. Emotion, 16(7), 1033. [Google Scholar] [CrossRef]
  39. Israelashvili, J., Oosterwijk, S., Sauter, D., & Fischer, A. (2019). Knowing me, knowing you: Emotion differentiation in oneself is associated with recognition of others’ emotions. Cognition and Emotion, 33(7), 1461–1471. [Google Scholar] [CrossRef]
  40. Kopp, C. B. (1989). Regulation of distress and negative emotions: A developmental view. Developmental Psychology, 25(3), 343–354. [Google Scholar] [CrossRef]
  41. Lakoff, G. (2016). Language and emotion. Emotion Review, 8(3), 269–273. [Google Scholar] [CrossRef]
  42. Lange, J., Heerdink, M. W., & van Kleef, G. A. (2022). Reading emotions, reading people: Emotion perception and inferences drawn from perceived emotions. Current Opinion in Psychology, 43, 85–90. [Google Scholar] [CrossRef] [PubMed]
  43. Laukka, P., & Elfenbein, H. A. (2020). Cross-cultural emotion recognition and in-group advantage in vocal expression: A meta-analysis. Emotion Review, 13(1), 3–11. [Google Scholar] [CrossRef]
  44. Lemhöfer, K., & Broersma, M. (2012). Introducing LexTALE: A quick and valid lexical test for advanced learners of English. Behavior Research Methods, 44(2), 325. [Google Scholar] [CrossRef] [PubMed]
  45. Linacre, J. M. (2024). A user’s guide to WINSTEPS® MINISTEP: Rasch-model computer programs. Program Manual 5.8.0, 719. WINSTEPS. [Google Scholar]
  46. Lindquist, K. A. (2021). Language and emotion: Introduction to the special issue. Affective Science, 2(2), 91–98. [Google Scholar] [CrossRef]
  47. Mair, P., Hatzinger, R., Maier, M. J., Rusch, T., & Mair, M. P. (2016). Package ‘eRm’. R Foundation. [Google Scholar]
  48. Masrai, A. (2019). Vocabulary and reading comprehension revisited: Evidence for high-, mid-, and low-frequency vocabulary knowledge. Sage Open, 9(2), 2158244019845182. [Google Scholar] [CrossRef]
  49. Mavrou, I. (2021). Emotional intelligence, working memory, and emotional vocabulary in L1 and L2: Interactions and dissociations. Lingua, 257, 103083. [Google Scholar] [CrossRef]
  50. Mayer, J. D., Caruso, D. R., & Salovey, P. (2016). The ability model of emotional intelligence: Principles and updates. Emotion Review, 8(4), 290–300. [Google Scholar] [CrossRef]
  51. Mayer, J. D., Salovey, P., & Caruso, D. (2012). Models of emotional intelligence. In Handbook of intelligence (pp. 396–420). Cambridge University Press. [Google Scholar] [CrossRef]
  52. Merriam-Webster. (2024). Merriam-webster: America’s most trusted dictionary. Available online: https://www.merriam-webster.com/ (accessed on 21 November 2022).
  53. Milton, J. (2013). Measuring the contribution of vocabulary knowledge to proficiency in the four skills. In C. Bardel, C. Lindqvist, & B. Laufer (Eds.), L2 vocabulary acquisition, knowledge and use: New perspectives on assessment and corpus analysis (pp. 57–78). Amsterdam; Eurosla. [Google Scholar]
  54. Nation, P. (2004). A study of the most frequent word families in the British National Corpus. In Vocabulary in a second language (pp. 3–13). John Benjamins Publishing Company. [Google Scholar]
  55. Nation, P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31(7), 9–13. [Google Scholar] [CrossRef]
  56. Ng, B. C., Cui, C., & Cavallaro, F. (2019). The annotated lexicon of chinese emotion words. WORD, 65(2), 73–92. [Google Scholar] [CrossRef]
  57. Nook, E. C., Sasse, S. F., Lambert, H. K., McLaughlin, K. A., & Somerville, L. H. (2017). Increasing verbal knowledge mediates development of multi-dimensional emotion representations. Nature Human Behaviour, 1(12), 881–889. [Google Scholar] [CrossRef] [PubMed]
  58. Nook, E. C., Stavish, C. M., Sasse, S. F., Lambert, H. K., Mair, P., McLaughlin, K. A., & Somerville, L. H. (2020). Charting the development of emotion comprehension and abstraction from childhood to adulthood using observer-rated and linguistic measures. Emotion, 20(5), 773–792. [Google Scholar] [CrossRef] [PubMed]
  59. Pavlenko, A. (2008). Emotion and emotion-laden words in the bilingual lexicon. Bilingualism, 11(2), 147–164. [Google Scholar] [CrossRef]
  60. Pavlenko, A. (2012). Affective processing in bilingual speakers: Disembodied cognition? International Journal of Psychology, 47(6), 405–428. [Google Scholar] [CrossRef]
  61. Pentón Herrera, L. J., & Darragh, J. J. (2024). Social-emotional learning in English language teaching. University of Michigan Press. [Google Scholar]
  62. Pérez-García, E., & Sánchez, M. J. (2020). Emotions as a linguistic category: Perception and expression of emotions by Spanish EFL students. Language, Culture and Curriculum, 33(3), 274–289. [Google Scholar] [CrossRef]
  63. Qian, D. D., & Lin, L. H. F. (2019). The relationship between vocabulary knowledge and language proficiency. In The Routledge handbook of vocabulary studies (pp. 66–80). Routledge. [Google Scholar] [CrossRef]
  64. Revelle, W. (2023). Procedures for psychological, psychometric, and personality research [R package psych version 2.3.12]. Available online: https://CRAN.R-project.org/package=psych (accessed on 11 November 2022).
  65. RStudio Team. (2020). RStudio: Integrated development environment for R (4.3.3). RStudio, PBC. [Google Scholar]
  66. Sánchez, M. J., & Pérez-García, E. (2020). Emotion(less) textbooks? An investigation into the affective lexical content of EFL textbooks. System, 93, 102299. [Google Scholar] [CrossRef]
  67. Schmitt, N., & Schmitt, D. (2014). A reassessment of frequency and vocabulary size in L2 vocabulary teaching1. Language Teaching, 47(4), 484–503. [Google Scholar] [CrossRef]
  68. Schmitt, N., & Schmitt, D. (2020). Vocabulary in language teaching. Cambridge University Press. [Google Scholar]
  69. Sette, S., Spinrad, T. L., & Baumgartner, E. (2017). The relations of preschool children’s emotion knowledge and socially appropriate behaviors to peer likability. International Journal of Behavioral Development, 41(4), 532–541. [Google Scholar] [CrossRef]
  70. Smith, A. B., Rush, R., Fallowfield, L. J., Velikova, G., & Sharpe, M. (2008). Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology, 8, 33. [Google Scholar] [CrossRef]
  71. Smith, E. V., Jr. (2002). Detecting and evaluating the impact of multidimensionality using item fit statistics and principal component analysis of residuals. Journal of Applied Measurement, 3(2), 205–231. [Google Scholar] [PubMed]
  72. Stanley, D. (2021). Create American Psychological Association (APA) style tables [R package apaTables version 2.0.8]. Available online: https://CRAN.R-project.org/package=apaTables (accessed on 2 November 2022).
  73. Streubel, B., Gunzenhauser, C., Grosse, G., & Saalbach, H. (2020). Emotion-specific vocabulary and its contribution to emotion understanding in 4- to 9-year-old children. Journal of Experimental Child Psychology, 193, 104790. [Google Scholar] [CrossRef] [PubMed]
  74. Szabo, C. Z., Bong, W. L. J., Wang, Y., Chee, A., & Lee, S. T. (forthcoming). Developing a productive emotion vocabulary test for adult speakers of english as an additional language. [Manuscript in preparation]. [Google Scholar]
  75. Takizawa, K. (2024). What contributes to fluent L2 speech? Examining cognitive and utterance fluency link with underlying L2 collocational processing speed and accuracy. Applied Psycholinguistics, 45(3), 516–541. [Google Scholar] [CrossRef]
  76. Taylor, J. E., Beith, A., & Sereno, S. C. (2020). LexOPS: An R package and user interface for the controlled generation of word stimuli. Behavior Research Methods, 52(6), 2372–2382. [Google Scholar] [CrossRef]
  77. van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67(6), 1176–1190. [Google Scholar] [CrossRef]
  78. Vine, V., Boyd, R. L., & Pennebaker, J. W. (2020). Natural emotion vocabularies as windows on distress and well-being. Nature Communications, 11(1), 1–9. [Google Scholar] [CrossRef]
  79. Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45(4), 1191–1207. [Google Scholar] [CrossRef]
  80. Webb, S., Sasao, Y., & Ballance, O. (2017). The updated Vocabulary Levels Test: Developing and validating two new forms of the VLT. ITL-International Journal of Applied Linguistics, 168(1), 33–69. [Google Scholar] [CrossRef]
  81. Weidman, A. C., Steckler, C. M., & Tracy, J. L. (2017). The jingle and jangle of emotion assessment: Imprecise measurement, casual scale usage, and conceptual fuzziness in emotion research. Emotion (Washington, D.C.), 17(2), 267–295. [Google Scholar] [CrossRef]
  82. Whissel, C. M. (1989). The Dictionary of Affect in Language. In The measurement of emotions (pp. 113–131). Academic Press. [Google Scholar] [CrossRef]
  83. Wickham, H., Averick, M., Bryan, J., Chang, W., Mcgowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Lin Pedersen, T., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686. [Google Scholar] [CrossRef]
  84. Wilkens, R., Dalla Vecchia, A., Boito, M. Z., Padró, M., & Villavicencio, A. (2014). Size does not matter. Frequency does. A study of features for measuring lexical complexity. In Advances in artificial intelligence—IBERAMIA 2014 (Vol. 8864, pp. 129–140). Lecture notes in computer science (Including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). Springer. [Google Scholar] [CrossRef]
  85. Wright, B. D. (1994). Reasonable mean-square fit values. Rasch Meas Transac, 8, 370. [Google Scholar]
  86. Zhao, J., & Huang, J. (2023). A comparative study of frequency effect on acquisition of grammar and meaning of words between Chinese and foreign learners of English language. Frontiers in Psychology, 14, 1125483. [Google Scholar] [CrossRef]
  87. Zhu, Y., Zhang, B., Wang, Q. A., Li, W., & Cai, X. (2018). The principle of least effort and Zipf distribution. Journal of Physics: Conference Series, 1113(1), 012007. [Google Scholar] [CrossRef]
  88. Zipf, G. K. (2016). Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books. [Google Scholar]
Figure 1. Illustration from PEVT, adapted item from Streubel et al. (2020), depicting the emotion “Joy” supplemented by a short excerpt: ‘Wow! That’s a great present!’ (Szabo et al., forthcoming).
Figure 1. Illustration from PEVT, adapted item from Streubel et al. (2020), depicting the emotion “Joy” supplemented by a short excerpt: ‘Wow! That’s a great present!’ (Szabo et al., forthcoming).
Languages 10 00204 g001
Figure 2. Sample item from the PEVST: “famished”.
Figure 2. Sample item from the PEVST: “famished”.
Languages 10 00204 g002
Figure 3. Flowchart for finalising items for Rasch analysis.
Figure 3. Flowchart for finalising items for Rasch analysis.
Languages 10 00204 g003
Figure 4. Wright’s map of person ability and item map for dichotomous Rasch model.
Figure 4. Wright’s map of person ability and item map for dichotomous Rasch model.
Languages 10 00204 g004
Figure 5. Wright's map of person ability and item map for polytomous Rasch model.
Figure 5. Wright's map of person ability and item map for polytomous Rasch model.
Languages 10 00204 g005
Table 1. Summary of participant demographics.
Table 1. Summary of participant demographics.
DemographicsN (%)
Gender
 Male64 (41%)
 Female92 (59%)
Language Background
 L1 English82 (53%)
 L2 English74 (47%)
Nationality
 Malaysian126 (81%)
 Non-Malaysian30 (19%)
Note. L1 English are participants with English as their first language; L2 English are participants with English as their second or third language; non-Malaysian participants include participants from 16 countries, including China, Romania, Seychelles, etc.
Table 2. Summary of participants’ language proficiency as measured by LexTALE.
Table 2. Summary of participants’ language proficiency as measured by LexTALE.
RangeNMinMaxMeanSD
Advanced80–10010380.00100.0091.745.75
Upper intermediate60–792461.2578.7571.045.83
Lower intermediate<601040.0055.0048.384.88
Table 3. Dominant response consistency between L1 and L2 English participants.
Table 3. Dominant response consistency between L1 and L2 English participants.
ItemTarget EmotionDom 1Dom 2Dom 3Match %
Q23dirtyYesYesYes100
Q24doubtYesYesYes100
Q25lazyYesYesYes100
Q26defeatYesYesYes100
Q27obsessYesYesYes100
Q28bewilderYesYesYes100
Q29dismayYesYesYes100
Q30dizzyYesYesYes100
Q31inferiorYesYesYes100
Q32dazeYesYesNo66.66
Q33phonyYesYesNo66.66
Q34drowsyYesYesYes100
Q35vindictiveYesYesYes100
Q36luckYesYesYes100
Q37supportYesYesNo66.66
Q38secretYesYesYes100
Q39relieveYesYesYes100
Q40sympathyYesYesYes100
Q41nostalgiaYesYesNo66.66
Q42trustworthyYesYesYes100
Q43sociableYesYesNo66.66
Q44bashfulYesYesYes100
Q45homyYesYesYes100
Q46hateYesYesNo66.66
Q47responsibleYesYesYes100
Q48accuseYesYesNo66.66
Q49hostileYesYesYes100
Q50curseYesYesYes100
Q51greedYesYesNo66.66
Q52insultYesYesYes100
Q53intimidateYesYesNo66.66
Q54franticYesYesYes100
Q55horrifiedYesYesYes100
Q56recklessYesYesNo66.66
Q57squirmYesYesNo66.66
Q58forebodingYesYesYes100
Q59grumpyYesYesNo66.66
Q60disbeliefYesYesYes100
Q61famishedYesYesNo66.66
Q62sillyYesYesYes100
Q63adventureYesYesYes100
Q64curiousYesYesYes100
Q65defenceYesYesYes100
Q66impulseYesYesYes100
Q67gratitudeYesYesYes100
Q68enchantYesYesYes100
Q69flirtYesYesYes100
Q70tickleYesYesYes100
Q71hilariousYesYesNo66.66
Q72inquisitiveYesYesNo66.66
Q73euphoricYesYesYes100
Table 4. Fit statistics of dichotomous Rasch model using only the most dominant response.
Table 4. Fit statistics of dichotomous Rasch model using only the most dominant response.
ItemChisqdfp-ValueOutfit MSQInfit MSQOutfit tInfit tDiscrim
Q23152.2941550.5460.9760.974−0.239−0.3500.218
Q24171.4651550.1731.0991.0601.2620.9660.021
Q25157.8871550.4201.0120.9960.159−0.0310.149
Q26138.5771550.8240.8880.899−2.401−2.4580.447
Q27144.5651550.7150.9270.972−0.862−0.4020.246
Q28151.3131550.5690.9700.978−0.579−0.4910.237
Q29145.3961550.6980.9320.941−0.896−0.9740.333
Q30161.9521550.3351.0381.0260.5790.4700.097
Q35145.8241550.6890.9350.944−1.316−1.3200.334
Q39162.2601550.3291.0401.0280.8210.6590.100
Q41167.8801550.2271.0761.0601.2711.2020.040
Q42142.1101550.7630.9110.929−1.091−1.0800.356
Q44162.6481550.3211.0431.0470.9081.1260.086
Q45166.2651550.2541.0661.0680.9741.2180.022
Q47155.8771550.4650.9990.9950.004−0.1010.200
Q54147.6511550.6500.9460.969−0.636−0.4590.238
Q55152.7781550.5350.9790.983−0.330−0.3380.234
Q58168.6241550.2151.0811.0531.0140.8330.051
Q59155.5651550.4720.9970.992−0.010−0.0980.175
Q60160.0451550.3741.0261.0240.2970.3360.093
Q61149.9831550.5990.9610.976−0.311−0.2350.196
Q63146.0941550.6840.9370.947−1.357−1.2900.325
Q64135.4821550.8690.8680.934−1.053−0.6360.291
Q67155.8721550.4650.9991.0140.0050.3120.145
Q69153.4361550.5200.9840.979−0.121−0.2210.181
Q71171.7391550.1701.1011.0741.7051.513-0.012
Q73151.8081550.5570.9730.961−0.472−0.8260.315
Table 5. Wald’s test for dichotomous Rasch model.
Table 5. Wald’s test for dichotomous Rasch model.
Itemz-Statisticsp-Value
Q231.0560.291
Q240.8650.387
Q250.9450.344
Q26−1.9870.047
Q27−1.0720.284
Q28−0.4940.621
Q29−1.0500.294
Q30−0.2620.793
Q35−1.6790.093
Q390.6430.520
Q411.1180.264
Q42−1.5900.112
Q441.8580.063
Q451.6830.092
Q47−0.0980.922
Q540.5210.602
Q55−0.9060.365
Q580.6950.487
Q590.7550.450
Q601.5990.110
Q61−1.0730.283
Q63−1.5890.112
Q64−1.0170.309
Q670.6010.548
Q690.3050.760
Q711.9390.052
Table 6. Fit statistics of polytomous Rasch model using accuracy scoring.
Table 6. Fit statistics of polytomous Rasch model using accuracy scoring.
ItemChisqdfp-ValueOutfit MSQInfit MSQOutfit tInfit tDiscrim
Q23161.7131550.3401.0371.0330.4620.4350.133
Q24149.8501550.6020.9610.978−0.386−0.2050.255
Q25163.9571550.2961.0511.0500.5050.4980.037
Q26155.7841550.4670.9990.9990.0280.0320.192
Q27161.3771550.3461.0341.0130.4100.1750.133
Q28139.4451550.8100.8940.914−0.878−0.7460.349
Q29132.9241550.9000.8520.896−0.940−0.7230.385
Q30152.9181550.5320.9800.982−0.177−0.1620.257
Q35159.0211550.3961.0191.0180.2260.2130.166
Q39151.0461550.5750.9680.980−0.313−0.1820.231
Q41163.1811550.3111.0461.0370.5220.4300.134
Q42156.5561550.4501.0041.0010.0700.0420.220
Q44142.2531550.7600.9120.927−0.891−0.7580.337
Q45183.2861550.0601.1751.0961.5660.9450.021
Q47139.3611550.8110.8930.899−1.185−1.1300.391
Q54142.1801550.7610.9110.907−0.853−0.9350.379
Q55147.8011550.6470.9470.948−0.560−0.5580.312
Q58153.7141550.5140.9850.993−0.127−0.0440.210
Q59148.9621550.6220.9550.967−0.379−0.2780.268
Q60140.6371550.7890.9020.915−1.115−0.9780.348
Q61149.6371550.6060.9590.960−0.369−0.3600.243
Q63149.3461550.6130.9570.957−0.447−0.4520.274
Q64146.7301550.6700.9410.937−0.617−0.6520.328
Q67150.6281550.5840.9660.951−0.199−0.3410.274
Q69151.6521550.5610.9720.964−0.288−0.3850.239
Q71158.6851550.4031.0171.0010.1950.0390.210
Q73147.6181550.6510.9460.963−0.560−0.3750.285
Table 7. Wald’s test for polytomous Rasch model.
Table 7. Wald’s test for polytomous Rasch model.
Itemsz-Statisticp-Value
Q23.c10.5670.570
Q23.c20.4670.640
Q24.c1−0.5270.598
Q24.c2−0.4740.635
Q25.c11.8100.070
Q25.c21.6160.106
Q26.c1−0.9970.319
Q26.c2−0.4160.677
Q27.c10.2570.797
Q27.c2−0.7750.439
Q28.c1−0.9780.328
Q28.c2−0.6950.487
Q29.c10.0290.977
Q29.c2−0.4650.642
Q30.c12.2940.022
Q30.c21.8450.065
Q35.c10.3090.757
Q35.c21.4090.159
Q39.c1−1.0540.292
Q39.c2−0.8280.408
Q41.c11.9310.054
Q41.c22.2020.028
Q42.c10.2920.770
Q42.c20.6260.531
Q44.c1−0.9690.333
Q44.c2−1.0250.305
Q45.c1−0.0100.992
Q45.c20.6210.534
Q47.c1−0.3780.705
Q47.c2−0.8550.392
Q54.c10.7400.459
Q54.c2−0.0890.929
Q55.c1−0.6840.494
Q55.c20.1630.87
Q58.c11.1370.256
Q58.c21.0410.298
Q59.c1−0.1130.910
Q59.c2−0.3120.755
Q60.c1−0.7840.433
Q60.c2−1.2550.209
Q61.c10.5530.58
Q61.c2−0.1240.901
Q63.c11.0470.295
Q63.c20.4670.641
Q64.c1−0.2670.790
Q64.c2−0.6470.517
Q67.c1−0.2510.802
Q67.c2−0.0930.926
Q69.c1−0.2170.828
Q69.c2−0.3010.763
Q71.c10.7010.483
Q71.c20.4570.648
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chee, A.J.E.; Szabo, C.Z.; Ambrose, S. Measuring Emotion Recognition Through Language: The Development and Validation of an English Productive Emotion Vocabulary Size Test. Languages 2025, 10, 204. https://doi.org/10.3390/languages10090204

AMA Style

Chee AJE, Szabo CZ, Ambrose S. Measuring Emotion Recognition Through Language: The Development and Validation of an English Productive Emotion Vocabulary Size Test. Languages. 2025; 10(9):204. https://doi.org/10.3390/languages10090204

Chicago/Turabian Style

Chee, Allen Jie Ein, Csaba Zoltan Szabo, and Sharimila Ambrose. 2025. "Measuring Emotion Recognition Through Language: The Development and Validation of an English Productive Emotion Vocabulary Size Test" Languages 10, no. 9: 204. https://doi.org/10.3390/languages10090204

APA Style

Chee, A. J. E., Szabo, C. Z., & Ambrose, S. (2025). Measuring Emotion Recognition Through Language: The Development and Validation of an English Productive Emotion Vocabulary Size Test. Languages, 10(9), 204. https://doi.org/10.3390/languages10090204

Article Metrics

Back to TopTop