Animate, or Inanimate, That Is the Question for Large Language Models

Pucci, Giulia; Zanzotto, Fabio Massimo; Ranaldi, Leonardo

doi:10.3390/info16060493

Open AccessArticle

Animate, or Inanimate, That Is the Question for Large Language Models

by

Giulia Pucci

¹,

Fabio Massimo Zanzotto

² and

Leonardo Ranaldi

^1,2,*

¹

Department of Computing Science, University of Aberdeen, Aberdeen AB24 3UE, UK

²

Human-Centric ART, University of Rome Tor Vergata, Viale del Politecnico, 1, 00133 Rome, Italy

^*

Author to whom correspondence should be addressed.

Information 2025, 16(6), 493; https://doi.org/10.3390/info16060493

Submission received: 15 April 2025 / Revised: 31 May 2025 / Accepted: 3 June 2025 / Published: 13 June 2025

(This article belongs to the Special Issue Natural Language Processing (NLP) with Applications and Natural Language Understanding (NLU))

Download

Browse Figures

Versions Notes

Abstract

The cognitive core of human beings is closely connected to the concept of animacy, which significantly influences their memory, vision, and complex language comprehension. While animacy is reflected in language through subtle constraints on verbs and adjectives, it is also acquired and honed through non-linguistic experiences. In the same vein, we suggest that the limited capacity of LLMs to grasp natural language, particularly in relation to animacy, stems from the fact that these models are trained solely on textual data. Hence, the question this paper aims to answer arises: Can LLMs, in their digital wisdom, process animacy in a similar way to what humans would do? We then propose a systematic analysis via prompting approaches. In particular, we probe different LLMs using controlled lexical contrasts (animate vs. inanimate nouns) and narrative contexts in which typically inanimate entities behave as animate. Results reveal that, although LLMs have been trained predominantly on textual data, they exhibit human-like behavior when faced with typical animate and inanimate entities in alignment with earlier studies, specifically on seven LLMs selected from three major families—OpenAI (GPT-3.5, GPT-4), Meta (Llama2 7B, 13B, 70B), and Mistral (Mistral-7B, Mixtral). GPT models generally achieve the most consistent and human-like performance, and in some tasks, such as sentence plausibility and acceptability judgments, even surpass human baselines. Moreover, although to a lesser degree, the other models also assume comparable results. Hence, LLMs can adapt to understand unconventional situations by recognising oddities as animated without needing to interface with unspoken cognitive triggers humans rely on to break down animations.

Keywords:

large language models; computational linguistics; cognitive modeling; animacy

1. Introduction

The mnemonic abilities underlying cognitive processing seem to enable animate entities and concepts to be more easily memorised, which highlights the importance of the animacy effect in human cognition [1,2,3]. Animacy is expressed in language through humans’ ability to use specific verbs and adjectives with animate and inanimate entities, enabling them to infer and reason about the mental states, intentions, and reactions of others. This ability allows humans to navigate and understand social interactions. Therefore, employing NLP models in increasingly complex social contexts requires a similar capacity to capture these socio-cognitive dynamics [4,5,6,7].

Current large language models (LLMs), such as the GPTs [8], PaLM [9], and Llama [10], are trained merely on textual data and cannot access non-verbal information, unlike humans. If faced with discerning animacy, these systems must infer it from its downstream linguistic implications, diverging from humans who also benefit from visual and physical stimuli. Thus, a fundamental question arises: Do LLMs perceive and respond to animacy hooks in language in a way as close to human comprehension as possible?

This work investigates whether LLMs exhibit human-like behavior when processing animacy. We conduct extensive studies using LLMs as participants in psycholinguistic experiments originally designed for humans. Through this approach, we explore how LLMs respond to violations of selective constraints related to animacy in typical and atypical contexts. Complementing the foundation work of Warstadt et al. [11], Spiliopoulou et al. [12], and Hanna et al. [13], we study the animacy effect by proposing systematic prompting approaches for eliciting different LLMs to understand scenarios and situations that demand intricate answers.

Downstream of a comprehensive experimental approach, we discover that, like humans, LLMs generally prefer sentences adhering to animacy-related constraints, greatly preferring these constructions. These similarities are not strictly constrained to typical animacy; in fact, the behavior of LLMs encountering atypical animate entities still seems to align with that of humans, both in terms of surprise at a first impression and downstream of adaptation by exhibiting significantly less surprise.

Our findings can be summarized as follows:

By proposing a systematic analysis based on the prompting of LLMs, we analyze the animacy effect, extending and confirming the results obtained from previous contributions.
In particular, using psychological tests designed for humans, we observe that the LLMs not only prefer sentences that adhere to animacy constraints, as shown in [11], but are able to adapt awareness in atypical scenarios just as humans do [14].
Finally, we demonstrate that prompted LLMs not only understand and demonstrate robust behavior comparable with humans but also deliver answers that best approximate the expectations placed on them.

2. Related Work

The intricate relationship between animacy and language has long been a subject of interest in cognitive linguistics. Animacy, often visualised as a spectrum, not only influences linguistic structures but also shapes our cognitive interpretations of entities based on their perceived liveliness. There is a wide path of the study of Animacy in Natural Language as discussed in Section 2.1. Parallel to these linguistic explorations, the NLP community has witnessed the rise of LLMs, and in particular, the instruction-tuned LLMs (Section 2.2). These models have showcased emergent abilities, revolutionising the approach to NLP challenges.

Nevertheless, how do these state-of-the-art models manage sufficient linguistic properties? In an analytic scenario introduced in Section 2.3, we adopt a psycholinguistic lens, treating these models as subjects to evaluate their understanding and processing of animacy. This methodology, while novel, draws inspiration from traditional human cognitive tests, aiming to bridge the gap between human cognition and machine understanding. We further reflect upon previous works that have touched upon this intersection, aiming to push the boundaries of our understanding. By juxtaposing the abilities of LLMs with human cognitive processes, this paper seeks to provide a comprehensive overview of how modern models interpret, process, and respond to varying degrees of animacy in language.

2.1. Animacy in Natural Language

The role of animacy within cognitive processes is defined as a spectrum or gradient [3,15]. Linguistically, this is represented as either a three-tiered hierarchy (humans > animals > objects) or a binary distinction (humans and animals > objects). Entities are differentiated based on their position in this hierarchy, either syntactically or morphologically [16].

Animacy can be identified at both the broad category and the specific instance levels. Similarly, linguistic animacy is not solely grounded in biological factors but also hinges on the speaker’s emotional connection and empathy towards the specific entity [17].

The impact of animacy in language is not uniform across different languages; it can range from explicit markers of animacy to more subtle influences, as seen in English. Such subtleties encompass strict constraints based on animacy [18,19] as well as nuanced grammatical impacts [20,21]. For instance, sentences more frequently begin with animate entities, even if this results in less conventional structures like the passive voice [22,23]. In this study, we emphasize the distinction between humans and inanimate objects and the ensuing constraints based on animacy. Such a pronounced differentiation is anticipated to yield more discernible effects in LLMs.

2.2. Large Language Models

GPT-3 [24] was a major milestone in the development of LLMs, building on earlier transformer-based architectures such as GPT-2 and BERT. Since then, numerous LLMs have been proposed, including both open-source models—such as Llama [10]—and proprietary ones like GPT-4 [8]. Compared with smaller-scale language models, contemporary LLMs have demonstrated a range of emergent abilities [25], including zero-shot multi-task learning and few-shot in-context learning with chain-of-thought reasoning [26].

At the same time, the standard fine-tuning pipelines changed. Ouyang et al. [27] trained GPT-3 with corpora composed of instruction-following demonstrations to make LLMs more scalable and improve zero-shot performance. As a result, InstructGPT, GPT-3.5, and GPT-4 [8] were each trained with particular techniques but none were far from [27]’s idea. These newly tuned models, called instruction-tuned LLMs (It-LLMs), are achieving outstanding performances in tasks related to logical reasoning [28,29], common sense [30,31], and social knowledge [32,33]. We therefore investigate whether their abilities could be compared with humans’ in detecting animacy.

2.3. Large Language Models as Test Subjects

In our research, we analyze the performance of different It-LLMs (that, for facilitating the conversations, we will call only LLMs) by employing them as subjects within a psycholinguistic framework, which is an increasingly adopted methodology in the field. This approach evaluates the LLMs by analyzing the responses produced by asking several questions for each sentence, just as would be done to humans. Subsequently, using output responses, the judgment of acceptability is analyzed. State-of-the-art work in these investigations presents language models (non-large and non-instruction-tuned) pairs of grammatically correct sentences. Thus, the models were expected to assign a higher probability to the linguistically more plausible sentence. Such techniques have previously been used to deepen LMs’ understanding of constructs such as negation, subject–verb agreement, and others, as shown by works such as Ettinger [34], Sinclair et al. [35], and Warstadt et al. [11].

Recent studies have compared the performance of large language models (LLMs) with human cognition by employing surprisal—the negative log probability of a sequence—as a proxy for cognitive effort during text comprehension [36]. Notably, LLMs have demonstrated considerable versatility, exhibiting strong correlations with measures such as reading time, eye-movement patterns, and EEG responses [37,38,39]. Moreover, more robust language models show enhanced predictive accuracy with respect to surprisal [40,41].

2.4. Animacy in LLMs

Previous works investigated the abilities of language models (LMs) in processing animacy, but the emphasis had predominantly been on conventional animacy. Warstadt et al. [11] explored the phenomenon of animacy within the BLiMP framework. Later, Kauf et al. [42] investigated overall event knowledge in LLMs, concluding that LMs were adept at discerning animacy concerning selective constraints. Our study explores different animacy, introducing a surprise score to emulate the pioneering studies of Nieuwland and Van Berkum [14]. They concentrated on human N400 responses to non-traditional animacy. Simultaneous research by Michaelov and Bergen [39] and Michaelov et al. [36] revisited one of these experiments in its original Dutch context. In contrast, our effort encompasses a comprehensive replication of experiments from [43]. Such endeavours underscore scenarios where models efficiently grasp overarching trends but stumble upon intricate details. Furthermore, we aim to analyze the correlation between an LM’s robustness and predictive accuracy by evaluating a broad spectrum of LLMs.

2.5. Our Contribution

The rise of LLMs and their use in a massive amount of tasks has reshaped the analysis pipeline. Completing the earlier foundational work proposed by Michaelov et al. [36], Truong et al. [41], and Hanna et al. [13] comparatively, our work goes beyond the state of the art in the following manner:

We propose a systematic prompting pattern and analyzing natural language responses as humans would.
In particular, we propose a promting approach for estimating the LLMs’ understanding of the acceptability and plausibility of concepts related to animate and inanimate entities. We then analyzed and compared the results with human baselines from previous contributions.
Furthermore, we proposed an approach based on a series of progressive prompts to simulate the estimation of the N400 neurological response. Here, by placing LLMs in atypical contexts with animated entities, we have shown similarities to the results of tests performed on humans.
Finally, we concluded the contribution by showing that the prompting approach is affected by a significantly minor bias that allows fair analogies between results obtained by LLMs and previous findings.

3. Models and Methods

In order to investigate whether large language models (LLMs) are able to understand and generate language in a way that reflects human expectations, we need to understand whether they are able to best approximate human knowledge of words and their cognitive passages. Hence, using LLMs as subjects (Section 3.1) we study if behaviors can be manifested in human-designed experiments conceived for studying animacy (Section 3.2). Hence, we propose a systematic prompt-based approach for LLMs through which we discuss the results in Section 3.3. Finally, we outline a general discussion of the findings in Section 3.4.

3.1. The Subjects in Three Families of Language Models

The animacy effects behind state-of-the-art large language models are analyzed via systematic prompting three groups of models:

Two subjects from the OpenAI family [8]: GPT-3.5 and GPT-4;
Three subjects form the Meta family [10]: Llama2-chat-7b, -13b, and -70b;
Two subjects form the Mistral family [44,45]: Mixtral8x7b, Mistral-7b.

To simplify the discussion, we will omit “chat” for the Metas and “b” for the Mistrals models. The resulting names will be Llama2-7, -13, and-70, and Mixtral and Mistral-7. We use both open-source models—the Meta family—to make our work more reproducible and closed-source models—the OpenAI family—because they demonstrate outstanding performance in many NLP tasks.

Finally, as we describe in each experiment, we evaluate the accuracies using the accuracy score by following Wei et al. [25] and Kojima et al. [46]. In particular, we compute the string matching between the final answers and the target values. The top-p parameter is set to 1 in all processes. We select the prompting temperature [0, 1] by repeating the experiments three times.

3.2. Selected Experimental Settings

In order to adapt our `subjects’ to the experimental settings proposed on humans, we discern between two different types of experiments: (1) typical animacy (Section 3.2.1) and (2) atypical animacy (Section 3.2.2). The two different kinds of experiments are needed as typical animacy is more a lexical task, and atypical animacy is more a contextual task from the point of view of LLMs.

3.2.1. Typical Animacy

In typical animacy experiments, subjects are prompted to determine which word in a pair is animate and which is not (e.g., if “frogs” are animated and “mountains” are not). Hence, we propose two different settings:

The Benchmark of Linguistic Minimal Pairs (BLiMP) [11] in Section 3.3.1;
The Benchmark of Sentence Plausibility (BSP) [47] in Section 3.3.2.

In BLiMP, we select two sub-tasks: transitive-animate and passive-animate. Each sub-task has 1000 pairs of synthetic English sentences that are very similar, but differ by only one or two words (Table 1).

Meanwhile, in BSP, we use sentences containing plausible and implausible words with different nuances. The resource contains 1500 synthetic sentences in English. Each sentence has a fixed initial part and an interchangeable final part between animate plausible, animate implausible (inherent and non-related), and inanimate implausible (inherent and non-related) words (Table 2).

From the point of view of our subjects, that is, LLMs, this psychological experiment is translated into a lexical task.

3.2.2. Atypical Animacy

In contrast to Section 3.2.1, to investigate if the treated subjects are able to detect animacy without relying on lexical information of the target word, we use two studies—a repetition study and a contextual study—where inanimate entities are treated as animated entities [14]. This shifts the focus from the lexical knowledge of the target word to the contextual knowledge. The human experiments are based on N400, a brain response measured by EEG that rises when processing semantically anomalous input.

The repetition study measured participants’ N400 responses while reading cartoon-like stories in which a typically inanimate entity behaved as animate (Table 3). Nieuwland and Van Berkum [14] found that although initially surprised by the atypically animated entity, participants quickly adapted, producing increasingly lower N400 responses.

The contextual study performs the measures only behind a contextualization part since the repetition experiment shows similarities with the work of Caramazza and Shelton [43]. These contexts are given as in Table 4, where people are asked to read the story with one of the targets alternatively.

Hence, these experiments are useful for investigating the following questions: Can LLMs adapt to animated entities at the token level despite being typically inanimate? Or is animate processing limited to a simple type-level understanding? We replicate these studies with LLMs to answer this question, using their surprise to model N400 responses.

Then, we conduct two different experiments. In the first experiment presented in Section 3.3.3, we reproduce the repetition and context introduced in [14]. In a second experiment (Section 3.3.4), we analyze the impact of context adaptation as proposed in [48]. For each experiment, we introduce the original study and the methods we used for the context adaptation of LLMs. Finally, we report our empirical results and compare them with those of the original study.

3.3. Experimenting with LLM Subjects

3.3.1. Experiment 1: Typical Animacy on BLiMP

Prompt Definition

By constructing a series of prompts over the datasets presented in Section 3.2.1, we systematically test the models’ answers to animacy in situations where the animacy of an instance aligns with its more general type.

Sentence pairs in BLiMP [11] are built as follows: one sentence respects the animacy constraints, and the other violates them. Hence, there is a straightforward way to evaluate the LLM’s ability to surpass the animacy test. We ask them to answer the following prompt:

Choose which example is acceptable between A and B.

(A) Galileo is concealed by the woman.

(B) Galileo is concealed by the horse.

Answer:

A model gets a correct example if it chooses a sentence that respects the animacy constraint.

Following this approach, we evaluate the accuracies by performing a string matching between the generated answers and the target values on both sub-tasks.

Results

The results of this first experiment set are intriguing. The OpenAI family seems to behave on par with respect to humans, and the Meta and Mistral families are catching up. Figure 1 shows the results of each model (vertical bars) and the results obtained by humans (horizontal dashed lines), as presented in [11]. The accuracy metric has to be intended as the percentage of examples in which human or artificial subjects preferred the acceptable sentence of the given pair.

Human transitive accuracy is reached by GPT-3.5 and topped by GPT-4. This seems to suggest that these models can handle the lexicon to determine typical animacy. Llama lags behind, but it is reaching the human level.

Similarly, in the passive scenario, GPT-4 performs very close to humans. Regarding the other models, GPT-3.5, Llama2-70, and Mixtral have comparable and slightly lower performance in the transitive scenario and significantly lower performance in the passive scenario, respectively. Finally, the smaller models, i.e., with fewer parameters, underperform compared to humans with average gaps of 20 points.

This difference between the transitive and passive case may be due more to differences in setting than to different animacy processing in the two scenarios. However, it should be noted that the composition of the choices is strongly class-related. Indeed, in the passive case, the target word, i.e., the most influential one, is always in the last position. In contrast, the target word is not the last token in the transitory case. Thus, heuristics related to the sensitivity of the choices in the input prompts’ structure may be present.

3.3.2. Experiment 2: Typical Animacy on BSP

Prompt Definition

In the second experiment, we structure the prompting phase similarly. Hence, by constructing a series of prompts over the second benchmark, i.e., the BSP, presented in Section 3.2.1, we systematically test the model’s responses in situations in which plausible and implausible sentences with animated and non-animated components were provided.

Following the approach of Vega-Mendoza et al. [47], we analyze the model’s answers to the plausibility question on five different inputs constructed from a sentence and completed with different types of answers. Hence, there is a straightforward way to evaluate our subject’s ability to surpass the plausibility test. Hence, consider the following prompt:

Is the following sentence plausible? Answer by choosing (Yes) or (No).

Sentence: In ancient Egypt the people were governed by the pyramid.

Answer:

A model gets an answer if it answers the question with Yes or No, respecting plausibility.

The accuracy of the LLMs is computed in this way: for the plausible control word, the accuracy counts the percentage of Yes, and for all the implausible words, it counts the percentage of No.

Results

In this second experiment set, LLMs of the OpenAI family behave similarly to humans as in the previous one. Figure 2 shows the accuracy results of each model and the results obtained by humans, as presented in [47].

Dealing with animated words (see Figure 2), humans and LLMs behave similarly. Indeed, humans perform on animate-unrelated similarly to how they perform on control words. Instead, they are less able to recognize animate-related as making target sentences implausible. The same happens for all the LLM subjects. GPT-4 performs better than humans, and it keeps the difference in recognizing the implausibility of sentences built with animated-unrelated and with animated-related words.

Moreover, when dealing with unanimated words (see the right plot in Figure 2), humans and LLMs behave similarly. Humans recognize the implausibility of unanimated-unrelated words but have a slight decrease in recognizing unanimated-related ones. The same trend happens for all the LLM subjects and, consistently in other experiments, the OpenAI family performs better than humans.

In humans, these differences in the plausibility of animate and non-animate cases are given by a combination of cognitive factors, as explained by Vega-Mendoza et al. [47]. Consequently, as in the experiments in Section 3.3.1, GPTs perform comparably to humans and sometimes outperform. However, even in this task, there is a robust structural component related to fearfulness. The target words, i.e., those that provide the final decision, are always in the last position. Therefore, there may be a heuristic related to the sensitivity of the structure of choices in the input prompts.

3.3.3. Experiment 3: Atypical Animacy—Repetition

Human Experiment and Its Results

The repetition experiment on atypical animacy [14] measure the N400 responses of a series of participants who listened to Dutch stories containing a typical animate or an inanimate entities behaving as if it were a human being. The N400 values are measured in three stages: the first (

T_{1}

), the third (

T_{3}

), and the fifth (

T_{5}

) mention of the entity (see, for example, Table 3 with confectioner and apple pie). Nieuwland and Van Berkum [14] discovered the following:

In the case of animated entities, participants have a moderate N400 response to the first mention ( $T_{1}$ ) and a low response to subsequent mentions ( $T_{2}$ and $T_{3}$ );
In the case of inanimate entities, participants initially ( $T_{1}$ ) have a high N400 response to the atypically animated entity, and, as the mentions progress ( $T_{2}$ and $T_{3}$ ), their N400 responses are very close to the responses from the mentions of the animated entity.

Thus, while the humans are initially surprised by the atypically animated entity, they quickly adapt to the situation and no longer find it surprising. Moreover, they show that responses do not derive from lexical repetition but from context. In fact, in the contextual experiment, they provide a context. Only at the end did they estimate N400 responses of the participants, obtaining low results for inanimate entities in atypical inanimate contexts.

Prompt Definition

In order to estimate a surprise value analogous to the N400, state-of-the-art studies examine token probability values. However, some of the models used in our study do not provide access to probability values, prompting us to define a series of prompts to query the model about its level of surprise systematically. In particular, for each of the 60 examples, we estimate the surprise of the animate and inanimate entity given by the context at each time-step, denoted as

T_{n}

, (which, in our case, refers to an input-prompt). For instance, to model the inanimate N400 response at

T_{1}

in the example from Table 3, we construct the following input-prompt:

Choose a surprising value from 0 to 30 on the following story:

A granny met the apple pie at the market with whom she started a pleasant conversation about recipes.

Answer:[num]

Following the time-steps, we introduce additional prompts by contextualizing the preceding story, whether animate or inanimate. For example, for the inanimate scenario:

Given the following story:

A granny met the apple pie at the

market with whom she started

a pleasant conversation about recipes.

The apple pie confided to.............

Choose a surprising value from 0 to 30 on the following story:

The apple pie that this was the ultimate recipe and apologized for the misplaced

distrust.

Answer:[num]

Hence, we compute the average surprise value of examples containing animate and inanimate entities separately at each time-step.

Results

The LLMs follow the general trends of human N400 responses as shown in Figure 3. Indeed, as reported in Section 3.3.3, human N400 responses for animate and inanimate critical words diverge at

T_{1}

and come closer at

T_{3}

and

T_{5}

. LLM subjects behave similarly. In fact, at

T_{1}

, models are surprised by the inanimate entity and trimmed by the animate one. At later steps (

T_{3}

and

T_{5})

, the surprises of inanimate entities decrease until they reach levels similar to animate entities. LLMs seem to adapt, just as humans do. However, the raw results do not show that the models adapt to the same extent as humans.

We use the Wilcoxon signed-rank test for non-normally distributed data to make the experiments robust, as performed in [14]. Then, we observe distinct surprise values at each time step. As with humans, LLMs have a statistically significant difference between the surprises of animate and inanimate entities, for example,

T_{1}

. However, while there is no difference between humans at

T_{3}

, there are differences (p < 0.01) in most models; only the largest do not have any. At

T_{5}

, differences disappear only in the large models. Although the models can generally approximate human N400 responses to atypical animacy trends, only the most significant and most potent fully replicate human adaptation.

3.3.4. Experiment 4: Atypical Animacy—Context Experiment

Human Experiments and Results

In the context experiments, Nieuwland and Van Berkum [14] discover that contextual appropriateness seems to neutralize animacy violations, that is, non-appropriate adjectives (such as “worried”, for example in Table 4) are not generating much surprise if the context suggests them. Moreover, context can even make an animacy-violating predicate more preferred than an animacy-obeying canonical predicate if the context justifies this.

Prompt Definition

By using examples for the context experiment (as in Table 4), we ask for a surprising value for animate and inanimate adjectives for each of the 60 stories proposed by Nieuwland and Van Berkum [14]. In particular, we use input-prompt structures closer to the previous:

	`Given the following context:`
+Context	`A girl told a sandwich that an attack was imminent. The sandwich wailed that his family was in danger. The girl told the sandwich that public places were the most dangerous. The sandwich immediately started calling everyone he knew.`
Baseline	`Choose a surprising value from 0 to 30 on the following story:`
	`The sandwich was delicious and wanted to make sure none of his loved ones were in danger.`
	`Answer:[num]`

To estimate absolute values, we also ask for baseline surprises, that is, those of the inanimate adjective without the context of the whole story.

Results

Even in this experiment (see Figure 4), LLM subjects behave similarly to human subjects. The animate baseline is larger for all four subjects than the inanimate baseline. Even in the baseline, there is conflicting information as to the presence or absence of animacy of the selected subject. Moreover, the surprise drop with the context is more significant with unanimated than animated adjectives. This is in line with the human experiments.

3.4. General Discussion

The four experiments deliver a coherent message: despite lacking embodiment and sensory experience, LLMs exhibit human-like patterns in their processing of animacy. LLMs demonstrate sensitivity to animacy constraints in the two typical animacy tasks (Experiments 1 and 2), showing surprise in response to violations involving both animate and inanimate entities—closely mirroring human behavior. Notably, models from the OpenAI family display the most consistent alignment with human responses. The atypical animacy tasks (Experiments 3 and 4) further challenge the assumption that LLMs rely solely on lexical associations. In these settings, LLMs are tested on their degree of surprise when inanimate entities are depicted as behaving like animate beings. These experiments are designed to assess whether models adapt their responses when contextual information is progressively introduced. Crucially, the observed reduction in surprise across successive prompts suggests that this is not merely a lexical phenomenon. As in the typical tasks, LLMs exhibit a human-like trend: their surprise diminishes more markedly for inanimate subjects, indicating an ability to integrate context in ways that approximate human adaptation.

3.5. Error Analysis

Although we observed human-like behaviors, as detailed in the preceding sections, the results of our experiments ultimately depend on the model generations produced by the LLMs introduced in Section 2.3. To ensure the stability of our findings, we report standard deviations computed across multiple generations. In what follows, we illustrate the evaluation procedures adopted for each experiment, focusing on the error analysis to provide a transparent account of the robustness and reliability of the reported outcomes.

3.5.1. Multiple Choice Question

Experiment 1 and Experiment 2, presented in Section 3.2.1, are based on a robust evaluation pipeline. In the first case, the task is framed as a multiple-choice question, and the evaluation relies on a heuristic based on string matching between the generated response and the correct option, following the approach proposed in [26,49]. A similar method is applied in Experiment 2, where the models are required to respond with a strict Yes or No. Here, similarly, evaluation is carried out through string matching between the model’s output and the expected response. Hence, the LLMs were stimulated to generate well-formed defined responses. In Appendix A, we show that the total percentage of responses that did not reflect the defined string matching heuristics is not sensible and confirms the robustness of the results obtained. In particular, we estimated a maximum misleading response rate of about 0.5% and 0.6% (see Table A5) and 2.5–3% (see Table A6), which does not affect the final results. Examples of generation can be seen in Appendixes Appendix C and Appendix D.

3.5.2. Number Generation

Prompts based on multiple-choice questions or strict answers such as “Yes” or “No” are easier to control and analyze. However, in Experiment 3 and Experiment 4, numbers are involved. To manage and control the sensitivity of the prompts, as proposed in Experiments 3 and 4, we added the keyword “[num]” (see Section 3.3.3 and Section 3.3.4). In a similar way, in order to produce a complete and robust analysis, we estimated the final values by profoundly analysing the numerical outputs or not. We used the library word2number to convert the generated literal number into integer values. As displayed in Appendix B, the answers containing literal numbers are significantly minor and do not affect the final evaluations. Finally, the [num] keyword seems to have directed the generation correctly, as reported in the examples shown in Table A7 and Table A8.

4. Future Works

In future work, we intend to investigate how animacy is represented and reasoned about in smaller language models, particularly those that have undergone teacher–student alignment processes [30,50]. By examining models trained through distillation or alignment from larger, more capable teacher models, we aim to assess the extent to which nuanced conceptual features, such as animacy, are retained or transformed. Beyond English, we plan to extend our experimental framework to include models that have been subjected to explicit multilingual alignment methodologies [51,52]. This will enable us to assess the effectiveness of alignment strategies in improving reasoning performance and cross-lingual generalisation across a broader range of languages. Finally, another promising line of research in exploring the effects of animacy in tasks related to social contexts [33,53].

5. Conclusions

Large language models (LLMs) have demonstrated impressive capabilities in recognising and exploiting cognitive patterns, often outperforming humans in tasks that rely on consistency and repetition. However, determining whether these systems can transcend pattern recognition to engage in more nuanced, human-like reasoning remains a central and compelling question. In this work, we approached LLMs as subjects in psycholinguistic experiments, probing their ability to process the concept of animacy, an inherently cognitive and socio-linguistic construct. Strikingly, our results reveal that LLMs not only approximate human behaviour in contexts governed by clear lexical constraints but also exhibit adaptive responses in more complex, context-rich scenarios where lexical cues alone are insufficient. This suggests a form of flexibility that, while not equivalent to human cognition, points toward emerging parallels. Despite their exclusive training on textual data, the models tested consistently aligned with human responses when confronted with canonical and unconventional animacy uses. This capacity for contextual adaptation underscores the potential of instruction-tuned LLMs to engage with higher-order linguistic and cognitive phenomena. Nonetheless, while these findings are promising, they also highlight the limitations of current models. LLMs are still far from fully understanding the social and cognitive dimensions underpinning human language use. Bridging this gap requires continued empirical scrutiny and the integration of broader cognitive grounding, beyond text alone. Ultimately, achieving models that truly mirror human-like reasoning will depend on combining their extensive linguistic competence with a deeper, more embodied understanding of the social world they aim to reflect.

Author Contributions

Conceptualization, L.R. and G.P.; methodology, L.R. and G.P.; software, L.R.; validation, L.R., G.P., and F.M.Z.; formal analysis, L.R. and G.P.; investigation, L.R., G.P., and F.M.Z.; resources, L.R. and G.P.; writing—original draft preparation, L.R. and G.P.; supervision, L.R. and F.M.Z.; project administration, L.R. and G.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable. We reported comparative results from previous studies and did not do any kind of analysis on humans.

Data Availability Statement

The data and models used are all available and can be supplied free of charge on request. All techniques have been reported in Section 3 and the appendices.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Error Analysis Strict Answers

Table A1. Percentage over 1000 instances for the transitive and passive sub-task (Section 3.3.1) of generations that do not contain one of the prompted choices. Table A5 shows two examples of outputs.

Type	GPT-4	GPT-3.5	Llama2-70	Llama2-13	Llama2-7	Mixtral	Mistral-7
Transitive	0.1%	0.1%	0.2%	0.3%	0.4%	0.3%	0.4%
Passive	0.1%	0.1%	0.2%	0.3%	0.5%	0.2%	0.5%

Table A2. Percentage over 4500 instances for the animated and unanimated sub-task of generations that do not contain (Yes) or (No) as explained in Section 3.3.2. Consequently, it is difficult to assess the answer automatically.

Type	GPT-4	GPT-3.5	Llama2-70	Llama2-13	Llama2-7	Mixtral	Mistral-7
Animated	0.5%	1%	2.5%	2.5%	3%	3%	3.5%
Unanimated	0.8%	1.5%	2%	2%	2.5%	3%	3%

Appendix B. Appendix Error Analysis Numeric Answers

Table A3. Number of generations that do not contain numerical values and in brackets that do not contain words meaning numbers. The total instances (sentences) are 60 for each time-step, as introduced in Section 3.3.3.

		GPT-4	GPT-3.5	Llama2-70	Llama2-13	Llama2-7	Mixtral	Mistral-7
$T_{1}$	Animate	2(0)	1(0)	2(0)	7(1)	8(1)	3(0)	4(0)
$T_{1}$	Inanimate	2(0)	4(0)	5(1)	7(1)	8(1)	4(0)	5(1)
$T_{2}$	Animate	0(0)	1(0)	2(0)	3(0)	3(0)	1(0)	3(0)
$T_{2}$	Inanimate	1(0)	4(0)	3(1)	2(1)	8(1)	4(0)	5(1)
$T_{3}$	Animate	0(0)	0(0)	1(0)	1(0)	1(0)	1(0)	1(0)
$T_{3}$	Inanimate	0(0)	0(0)	0(0)	1(0)	2(0)	2(0)	0(0)

Table A4. Number of generations that do not contain numerical values and in brackets that do not contain words meaning numbers. The total number of instances analyzed (sentences) is 60, as introduced in Section 3.3.4.

	GPT-4	GPT-3.5	Llama2-70	Llama2-13	Llama2-7	Mixtral	Mistral-7
Animate	2(0)	3(0)	7(0)	8(1)	8(2)	4(1)	5(1)
Inanimate	1(0)	3(0)	6(1)	8(1)	9(0)	2(0)	5(1)

Appendix C. Answers for Experiment 1

Table A5. Examples of answers generated by GPT-3.5 and Llama-2-13-chat. As introduced in Section 3.3.1 and shown in Table A1, the answers contain the choices prompted.

`Choose which example is acceptable between A and B.`	`Choose which example is acceptable between A and B.`
`(A) Hospitals aren’t biked to by the man.`	`(A) Hospitals aren’t biked to by the man.`
`(B) Hospitals aren’t biked to by the mouth.`	`(B) Hospitals aren’t biked to by the mouth.`
`Answer:`	`Answer:`
`The acceptable example is (A) Hospitals aren’t biked to by the man.` `In English, “biking” typically refers to the activity of riding a bicycle. It is logical to say that a man could ride a bicycle to a hospital. However, it doesn’t make sense to say that “the mouth” could bike to a hospital, as the mouth is not capable of such physical activity. Therefore, option A is the appropriate choice because it aligns with the expected action and capability of the subject (“the man”).`	`Explanation: In this sentence, “mouth” is a noun that refers to the opening through which food and air pass when we breathe. It doesn’t make sense to say that hospitals are biked to by the mouth because hospitals are buildings where medical treatment is provided, and they cannot be reached or accessed through the mouth.`
(answer generated by GPT-3.5-turbo)	(answer generated by Llama-2-13-chat)

Appendix D. Answers for Experiment 2

Table A6. Examples of answers generated from GPT-3.5 and Llama-7-13-chat. Also in this case, as described in Section 3.3.2, the prompts are well formed as the output generated with low misleading rates (Table A2).

`Is the following sentence plausible? Answer by choosing (Yes) or (No).`	`Is the following sentence plausible? Answer by choosing (Yes) or (No).`
`Sentence: At the club the cocktails are served by the tonic`	`Sentence: At the club the cocktails are served by the tonic`
`Answer:`	`Answer:`
`No, the sentence is not plausible. Cocktails are typically served by bartenders or servers, not by tonic, which is a type of mixer. Tonic water is a beverage ingredient used in cocktails, but it does not serve cocktails itself. Therefore, the sentence does not make sense in the context of cocktail service.`	`Yes, the sentence is plausible. The word “tonic” could refer to a type of alcoholic drink, such as a gin and tonic, which is commonly served at clubs. In this context, the sentence makes sense and is a common way to describe how cocktails are served at a bar or nightclub.`
(answer generated by Llama-2-7-chat)	(answer generated by Llama-2-7-chat)

Appendix E. Answers for Experiment 3

Table A7. This example shows the answers provided by GPT-4 and Mixtral8x7 using the prompt defined in Section 3.3.3. The structure of the input and the keyword ([num]) elicit the prompted LLMs to generate a number in digits as output.

Choose a surprising value from 0 to 30 on the following story: A museum visitor asked the map where the small collection of Van Gogh paintings was located.

Answer: 27 (GPT-4)

Answer: 22 (Mixtral8x7)

Given the following story:

A museum visitor asked the map where the small collection of Van Gogh paintings was located. The map was willing to take her there, as there was nothing to do. The museum visitor envied the map because he was allowed to see so much beauty every day. But the map laughingly explained that he always took a Game Boy with him when he was bored. The museum visitor teased the map, joking that he had to be an art barbarian to ever get bored in this museum.

Choose a surprising value from 0 to 30 on the following story: The map joked that he had seen enough art for the next ten years.

Answer:[num]

6. (answer generated by GPT-4)

7. Considering the previous answer, the map’s response is less surprising but still slightly unexpected. A surprising value of 7 out of 30 reflects the diminished unexpectedness of the map’s attitude, while still acknowledging the earlier context. (answer generated by Mixtral8x7)

Appendix F. Answers for Experiment 4

Table A8. This example shows the answers provided by different probed LLMs using the prompt defined in Section 3.3.4. Also, in this experiment the structure of the input and the keyword ([num]) elicit the prompted LLMs to generate a number in digits as output.

Given the following context:

A girl told a sandwich that an attack was imminent. The sandwich wailed that his family was in danger. The girl told the sandwich that public places were the most dangerous. The sandwich immediately started calling everyone he knew.

Choose a surprising value from 0 to 30 on the following story: The sandwich was delicious and wanted to make sure none of his loved ones were in danger.

Answer:[num]

20. (answer generated by GPT-3.5)

18. (answer generated by Mixtral8x7)

15. (answer generated by Llama-2-70-chat)

24. (answer generated by Llama-2-7-chat and Llama-2-13-chat)

References

Mao, R.; Liu, Q.; Li, X.; Cambria, E.; Hussain, A. Guest Editorial: Cognitive Analysis for Humans and AI. Cogn. Comput. 2024, 16, 3316–3318. [Google Scholar] [CrossRef]
New, J.; Cosmides, L.; Tooby, J. Category-specific attention for animals reflects ancestral priorities, not expertise. Proc. Natl. Acad. Sci. USA 2007, 104, 16598–16603. [Google Scholar] [CrossRef] [PubMed]
Nairne, J.S.; VanArsdall, J.E.; Pandeirada, J.N.S.; Cogdill, M.; LeBreton, J.M. Adaptive Memory: The Mnemonic Value of Animacy. Psychol. Sci. 2013, 24, 2099–2105. [Google Scholar] [CrossRef] [PubMed]
Ghisellini, R.; Pareschi, R.; Pedroni, M.; Raggi, G.B. Recommending Actionable Strategies: A Semantic Approach to Integrating Analytical Frameworks with Decision Heuristics. Information 2025, 16, 192. [Google Scholar] [CrossRef]
Bulla, L.; Midolo, A.; Mongiovì, M.; Tramontana, E. EX-CODE: A Robust and Explainable Model to Detect AI-Generated Code. Information 2024, 15, 819. [Google Scholar] [CrossRef]
Ranaldi, L.; Pucci, G. Knowing Knowledge: Epistemological Study of Knowledge in Transformers. Appl. Sci. 2023, 13, 677. [Google Scholar] [CrossRef]
Ranaldi, L.; Fallucchi, F.; Zanzotto, F.M. Dis-Cover AI Minds to Preserve Human Knowledge. Future Internet 2022, 14, 10. [Google Scholar] [CrossRef]
OpenAI. GPT-4 Technical Report. arXiv 2022, arXiv:2303.08774. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Warstadt, A.; Parrish, A.; Liu, H.; Mohananey, A.; Peng, W.; Wang, S.F.; Bowman, S.R. BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Trans. Assoc. Comput. Linguist. 2020, 8, 377–392. [Google Scholar] [CrossRef]
Spiliopoulou, E.; Pagnoni, A.; Bisk, Y.; Hovy, E. EvEntS ReaLM: Event Reasoning of Entity States via Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 1982–1997. [Google Scholar] [CrossRef]
Hanna, M.; Belinkov, Y.; Pezzelle, S. When Language Models Fall in Love: Animacy Processing in Transformer Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; pp. 12120–12135. [Google Scholar] [CrossRef]
Nieuwland, M.S.; Van Berkum, J.J.A. When Peanuts Fall in Love: N400 Evidence for the Power of Discourse. J. Cogn. Neurosci. 2006, 18, 1098–1111. [Google Scholar] [CrossRef]
García, M.G.; Primus, B.; Himmelmann, N.P. Shifting from animacy to agentivity. Theor. Linguist. 2018, 44, 25–39. [Google Scholar] [CrossRef]
Gass, S.M. A Review of Interlanguage Syntax: Language Transfer and Language Universals. Lang. Learn. 1984, 34, 115–132. [Google Scholar] [CrossRef]
Vihman, V.A.; Nelson, D. Effects of Animacy in Grammar and Cognition: Introduction to Special Issue. Open Linguist. 2019, 5, 260–267. [Google Scholar] [CrossRef]
Caplan, D.; Hildebrandt, N.; Waters, G.S. Interaction of verb selectional restrictions, noun animacy and syntactic form in sentence processing. Lang. Cogn. Process. 1994, 9, 549–585. [Google Scholar] [CrossRef]
Buckle, L.; Lieven, E.; Theakston, A.L. The Effects of Animacy and Syntax on Priming: A Developmental Study. Front. Psychol. 2017, 8, 2246. [Google Scholar] [CrossRef]
Bresnan, J.; Hay, J. Gradient grammar: An effect of animacy on the syntax of give in New Zealand and American English. Lingua 2008, 118, 245–259. [Google Scholar] [CrossRef]
Rosenbach, A. Animacy and grammatical variation—Findings from English genitive variation. Lingua 2008, 118, 151–171. [Google Scholar] [CrossRef]
Ferreira, F. Choice of Passive Voice is Affected by Verb Type and Animacy. J. Mem. Lang. 1994, 33, 715–736. [Google Scholar] [CrossRef]
Fairclough, N. The language of critical discourse analysis: Reply to Michael Billig. Discourse Soc. 2008, 19, 811–819. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent Abilities of Large Language Models. arXiv 2022, arXiv:2206.07682. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar]
Ranaldi, L.; Pucci, G.; Haddow, B.; Birch, A. Empowering Multi-step Reasoning across Languages via Program-Aided Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 12171–12187. [Google Scholar] [CrossRef]
Liu, H.; Ning, R.; Teng, Z.; Liu, J.; Zhou, Q.; Zhang, Y. Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4. arXiv 2023, arXiv:2304.03439. [Google Scholar]
Ranaldi, L.; Freitas, A. Aligning Large and Small Language Models via Chain-of-Thought Reasoning. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, 17–22 March 2024; Graham, Y., Purver, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1812–1827. [Google Scholar]
Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.T.; Li, Y.; Lundberg, S.; et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv 2023, arXiv:2303.12712. [Google Scholar]
Sap, M.; Le Bras, R.; Fried, D.; Choi, Y. Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 3762–3780. [Google Scholar] [CrossRef]
Ranaldi, L.; Pucci, G. When Large Language Models Contradict Humans? Large Language Models’ Sycophantic Behaviour. arXiv 2024, arXiv:2311.09410. [Google Scholar]
Ettinger, A. What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models. Trans. Assoc. Comput. Linguist. 2020, 8, 34–48. [Google Scholar] [CrossRef]
Sinclair, A.; Jumelet, J.; Zuidema, W.; Fernández, R. Structural Persistence in Language Models: Priming as a Window into Abstract Language Representations. Trans. Assoc. Comput. Linguist. 2022, 10, 1031–1050. [Google Scholar] [CrossRef]
Michaelov, J.A.; Coulson, S.; Bergen, B.K. Can Peanuts Fall in Love with Distributional Semantics? arXiv 2023, arXiv:2301.08731. [Google Scholar]
Smith, N.J.; Levy, R. The effect of word predictability on reading time is logarithmic. Cognition 2013, 128, 302–319. [Google Scholar] [CrossRef] [PubMed]
Aurnhammer, C.; Frank, S. Comparing Gated and Simple Recurrent Neural Network Architectures as Models of Human Sentence Processing. In Proceedings of the Annual Meeting of the Cognitive Science Society, Madison, WI, USA, 25–28 July 2018. [Google Scholar]
Michaelov, J.; Bergen, B. How well does surprisal explain N400 amplitude under different experimental conditions? In Proceedings of the 24th Conference on Computational Natural Language Learning, Online, 19–20 November 2020; pp. 652–663. [Google Scholar] [CrossRef]
Goodkind, A.; Bicknell, K. Predictive power of word surprisal for reading times is a linear function of language model quality. In Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018), Salt Lake City, UT, USA, 13 December 2018; pp. 10–18. [Google Scholar] [CrossRef]
Truong, T.H.; Baldwin, T.; Verspoor, K.; Cohn, T. Language models are not naysayers: An analysis of language models on negation benchmarks. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), Toronto, ON, Canada, 13–14 July 2023; pp. 101–114. [Google Scholar] [CrossRef]
Kauf, C.; Ivanova, A.A.; Rambelli, G.; Chersoni, E.; She, J.S.; Chowdhury, Z.; Fedorenko, E.; Lenci, A. Event knowledge in large language models: The gap between the impossible and the unlikely. arXiv 2023, arXiv:2212.01488. [Google Scholar] [CrossRef] [PubMed]
Caramazza, A.; Shelton, J.R. Domain-Specific Knowledge Systems in the Brain: The Animate-Inanimate Distinction. J. Cogn. Neurosci. 1998, 10, 1–34. [Google Scholar] [CrossRef] [PubMed]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Hanna, E.B.; Bressand, F.; et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. arXiv 2023, arXiv:2205.11916. [Google Scholar]
Vega-Mendoza, M.; Pickering, M.J.; Nieuwland, M.S. Concurrent use of animacy and event-knowledge during comprehension: Evidence from event-related potentials. Neuropsychologia 2021, 152, 107724. [Google Scholar] [CrossRef]
Boudewyn, M.A.; Blalock, A.R.; Long, D.L.; Swaab, T.Y. Adaptation to Animacy Violations during Listening Comprehension. Cogn. Affect. Behav. Neurosci. 2019, 19, 1247–1258. [Google Scholar] [CrossRef]
Zheng, C.; Zhou, H.; Meng, F.; Zhou, J.; Huang, M. Large Language Models Are Not Robust Multiple Choice Selectors. arXiv 2024, arXiv:2309.03882. [Google Scholar]
Ranaldi, L.; Freitas, A. Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2325–2347. [Google Scholar] [CrossRef]
Ranaldi, L.; Pucci, G. Multilingual Reasoning via Self-training. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; Chiruzzo, L., Ritter, A., Wang, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 11566–11582. [Google Scholar]
Ranaldi, L.; Haddow, B.; Birch, A. When Natural Language is Not Enough: The Limits of In-Context Learning Demonstrations in Multilingual Reasoning. In Findings of the Association for Computational Linguistics: NAACL 2025; Chiruzzo, L., Ritter, A., Wang, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 7369–7396. [Google Scholar]
Ranaldi, L.; Ranaldi, F.; Fallucchi, F.; Zanzotto, F.M. Shedding Light on the Dark Web: Authorship Attribution in Radical Forums. Information 2022, 13, 435. [Google Scholar] [CrossRef]

Figure 1. Large language models’ performances on animate-transitive and -passive sub-tasks of BLiMP benchmark [11].

Figure 2. Large language models’ performances on the plausibility of BSP benchmark [47].

Figure 3. Average surprise values provided by the LLMs at input-prompt

T_{1}

,

T_{3}

, and finally,

T_{5}

.

Figure 3. Average surprise values provided by the LLMs at input-prompt

T_{1}

,

T_{3}

, and finally,

T_{5}

.

Figure 4. Average surprise values provided by the LLMs at prompt.

Table 1. Two examples from the Transitive and Passive datasets. Each is a minimal pair of sentences: one acceptable (Yes) and one not (No).

Acceptable	Example
Sub-task: Passive
`Yes`	The glove was noticed by some woman.
`No`	The glove was noticed by some mouse.
`Yes`	Galileo is concealed by the woman.
`No`	Galileo is concealed by the horse.
Sub-task: Transitive
`Yes`	Beth scares Roger.
`No`	A carriage scares Roger.
`Yes`	Tanya admires Melanie.
`No`	Music admires Melanie.

Table 2. Example from the Benchmark of Sentence Plausibility. Each sentence has a plausible and four non-plausible words. As proposed by Vega-Mendoza et al. [47], we use the options as different tasks.

Sentence:
At the Club the Cocktails are Served by the _
Plausible
Control	barmaid
Implausible
Animate-Related	drunkard
Animate-Unrelated	queen
Inanimate-Related	tonic
Inanimate-Unrelated	dirt

Table 3. Example from translated version of N400 [14]. The first tokens indicate an acceptable example, and the numbers indicate the sentences given as context.

(

T_{1}

) A granny met the (confectioner-apple pie) at the market with whom she started

a pleasant conversation

about recipes. (

T_{2}

) The (confectioner-apple pie) confided a secret recipe to the granny.

(

T_{3}

) But the granny

deceived the (confectioner-apple pie) by making off with the recipe herself. (

T_{4}

) The

(confectioner-apple pie)

discovered the deception and wanted to reprimand the granny. But the granny pleased

the (confectioner-apple pie) with an (confectioner-apple pie) with an even better

version of the recipe.

(

T_{5}

) The (confectioner-apple pie) understood that this was the ultimate recipe and

apologized for the

misplaced distrust.

Table 4. Example from translated context story of N400 [14].

A girl told a sandwich that an attack was imminent. The sandwich wailed that his family was in

danger. The girl told the sandwich that public places were the most dangerous. The sandwich

immediately started calling everyone he knew. The sandwich was [targets] and wanted to make

sure none of his loved ones were in danger

targets: delicious, worried

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pucci, G.; Zanzotto, F.M.; Ranaldi, L. Animate, or Inanimate, That Is the Question for Large Language Models. Information 2025, 16, 493. https://doi.org/10.3390/info16060493

AMA Style

Pucci G, Zanzotto FM, Ranaldi L. Animate, or Inanimate, That Is the Question for Large Language Models. Information. 2025; 16(6):493. https://doi.org/10.3390/info16060493

Chicago/Turabian Style

Pucci, Giulia, Fabio Massimo Zanzotto, and Leonardo Ranaldi. 2025. "Animate, or Inanimate, That Is the Question for Large Language Models" Information 16, no. 6: 493. https://doi.org/10.3390/info16060493

APA Style

Pucci, G., Zanzotto, F. M., & Ranaldi, L. (2025). Animate, or Inanimate, That Is the Question for Large Language Models. Information, 16(6), 493. https://doi.org/10.3390/info16060493

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Animate, or Inanimate, That Is the Question for Large Language Models

Abstract

1. Introduction

2. Related Work

2.1. Animacy in Natural Language

2.2. Large Language Models

2.3. Large Language Models as Test Subjects

2.4. Animacy in LLMs

2.5. Our Contribution

3. Models and Methods

3.1. The Subjects in Three Families of Language Models

3.2. Selected Experimental Settings

3.2.1. Typical Animacy

3.2.2. Atypical Animacy

3.3. Experimenting with LLM Subjects

3.3.1. Experiment 1: Typical Animacy on BLiMP

Prompt Definition

Results

3.3.2. Experiment 2: Typical Animacy on BSP

Prompt Definition

Results

3.3.3. Experiment 3: Atypical Animacy—Repetition

Human Experiment and Its Results

Prompt Definition

Results

3.3.4. Experiment 4: Atypical Animacy—Context Experiment

Human Experiments and Results

Prompt Definition

Results

3.4. General Discussion

3.5. Error Analysis

3.5.1. Multiple Choice Question

3.5.2. Number Generation

4. Future Works

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Error Analysis Strict Answers

Appendix B. Appendix Error Analysis Numeric Answers

Appendix C. Answers for Experiment 1

Appendix D. Answers for Experiment 2

Appendix E. Answers for Experiment 3

Appendix F. Answers for Experiment 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI