Next Article in Journal
Development and Validation of Human-Computer Collaborative Classroom Second Language Learning Engagement Scale
Next Article in Special Issue
The Comparison of Human and Machine Performance in Object Recognition
Previous Article in Journal
A One-Year Longitudinal Study Examining the Direct and Indirect Effects of AI Dependence on Work Engagement and Gender Differences
Previous Article in Special Issue
LegalEye: Multimodal Court Deception Detection Across Multiple Languages
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Assessing the Relational Abilities of Large Language Models and Large Reasoning Models

Department of Experimental Clinical and Health Psychology, Ghent University, 9000 Ghent, Belgium
*
Author to whom correspondence should be addressed.
Behav. Sci. 2026, 16(1), 45; https://doi.org/10.3390/bs16010045
Submission received: 30 October 2025 / Revised: 12 December 2025 / Accepted: 18 December 2025 / Published: 25 December 2025
(This article belongs to the Special Issue Advanced Studies in Human-Centred AI)

Abstract

We assessed the relational abilities of two state-of-the-art large language models (LLMs) and two large reasoning models (LRMs) using a new battery of several thousand syllogistic problems, similar to those used in behavior-analytic tasks for relational abilities. To probe the models’ general (as opposed to task- or domain-specific) abilities, the problems involved multiple relations (sameness, difference, comparison, hierarchy, analogy, temporal and deictic), specified between randomly selected nonwords and varied in terms of complexity (number of premises, inclusion of irrelevant premises) and format (valid or invalid conclusion prompted). We also tested transformations of stimulus function. Our results show that the models generally performed well in this new task battery. The models did show some variability across different relations and were to a limited extent affected by task variations. Model performance was, however, robust against the randomization of premise order in a replication study. Our research provides a new framework for testing a core aspect of intellectual (i.e., relational) abilities in artificial systems; we discuss the implications of this and future research directions.

1. Introduction

Large language models (LLMs, e.g., GPT4, Open AI et al., 2023; LlaMa 3, Dubey et al., 2024) and large reasoning models (LRMs, e.g., Open AI o1, Open AI et al., 2024; DeepSeek R1, Guo et al., 2025) have taken the world by storm. Their ability to produce human-like text, answer questions and solve problems (and much more) has inspired much debate about how intelligent such systems are, and to what extent this aligns with human intelligence. While many have taken their performance as an indication that superhuman (or even general) intelligence is within reach (e.g., Bengio et al., 2024), such conclusions are nuanced by reports of those same systems failing on tasks that are trivial for the average human (e.g., Borji, 2023; McCoy et al., 2024; Berglund et al., 2023). The debate about these systems’ level of intelligence is complicated by a range of issues. First, there is little consensus about a definition of human intelligence (e.g., Sternberg & Detterman, 1987; Legg & Hutter, 2007), let alone artificial or general intelligence (e.g., Wang, 2019; Chollet, 2019; Morris et al., 2023), which leads to inconsistencies in how different perspectives interpret the same findings. Second, the fact that these language models are so fluent at producing human-like text, and that they are often tested on tasks designed for humans (e.g., the bar exam, intelligence tests) brings the risk of anthropomorphizing these systems (Crockett & Messeri, 2023; Shiffrin & Mitchell, 2023), attributing human-like understanding, intelligence and even goals, purpose and sentience to them (often referred to as the ELIZA-effect). However, the alignment between human and artificial system’s responses does not a priori reflect alignment of abilities or the mechanism underlying them (McCoy et al., 2024; Lin, 2025). It is important to see these models for what they are—large neural network models that are trained to predict the next token in a text—and evaluate them as such (McCoy et al., 2024). Finally, because these systems are trained on massive datasets, it is not uncommon that some of the test items used to evaluate them were present in the training data and that performance measures are biased as a result. This further highlights the need for systematic and rigorous testing of state-of-the-art LLMs and LRMs (Bowman & Dahl, 2021; Frank, 2023a; Srivastava et al., 2023; Zador et al., 2023; Lewis & Mitchell, 2024; Z. Wu et al., 2024). In this research, we take an approach towards evaluating LLM and LRM capacities inspired by behavior-analytic research on relational abilities. Doing so allows us to avoid getting caught up in philosophical debates about the nature of general intelligence, yet still obtain a measure closely related to it (i.e., do these systems demonstrate a behavior that is believed to be a hallmark of intelligence; Colbert et al., 2018).
Researchers from different fields in psychological research have emphasized the importance of relational abilities as a cornerstone of human cognition (e.g., Hayes et al., 2001; Penn et al., 2008; Halford et al., 2010; Gentner & Smith, 2013; Hughes & Barnes-Holmes, 2015) and intelligence more generally (e.g., Cassidy et al., 2011; Colbert et al., 2018; McLoughlin et al., 2020). Tests of relational abilities are part of many modern intelligence tests (i.e., tasks like matrix reasoning, verbal analogies, etc.; e.g., Raven & Raven, 2003; Kaufman & Lichtenberger, 2006; Wechsler, 2008), and some have been proposed to be a proxy-measure of intelligence (e.g., Colbert et al., 2017; May et al., 2022). Interestingly, different types (or levels of complexity) of relational abilities can be assessed and compared in different types of systems, be that of nonhuman animals, humans or artificially intelligent systems. To see how, we first need to describe the behavior-analytic perspective on relational abilities in more detail.
Relational frame theory (RFT, Hayes et al., 2001) provides a behavior-analytic definition of relational responding—the ability to respond to one event or stimulus in terms of its relation to another. This definition encompasses both responding to formal relations between events (e.g., choose the larger of two stimuli; referred to as non-arbitrarily applicable responding or NAARR) and responding to symbolic relations (referred to as arbitrarily applicable relational responding or AARR). In the latter case, relations can be arbitrarily applied to stimuli (i.e., regardless of formal properties, e.g., selecting a smaller, more valuable €1 coin over the larger, less valuable €0.50 coin) if particular contextual cues are present. Relational contextual cues signal the relationship between the stimuli (sameness, difference, hierarchy, and so on), whereas functional contextual cues indicate what stimulus function is related (e.g., color, value, threat). AARR has three core characteristics: (1) mutual entailment refers to the bidirectionality of relations (if A is related to B, B is also related to A); (2) combinatorial entailment or transitivity (if A is related to B and B is related to C, one can derive the relation between A and C); and (3) the transformation of function (i.e., if a neutral stimulus A is related to a stimulus B with a reinforcement value, the reinforcement value of A will be transformed according to the specified relation and one will respond to A as if it has that function). The definition of relational responding is in line with related constructs studied in cognitive psychology, such as relational reasoning (Alexander et al., 2016) or analogical reasoning (Holyoak, 2012; Gentner & Smith, 2013). We would like to emphasize the fact that the definition encompasses both NAARR (i.e., responding to formal stimulus relations) and AARR (responding to contextually defined stimulus relations, regardless of formal stimulus relations). Its broad scope allows us to compare different systems (nonhuman animals or other organisms, humans, machines, …) on a vast array of behaviors, from animals responding to simple, formal sameness and difference relations (e.g., a relational matching-to-sample task or transitive inference task; Penn et al., 2008) to humans or machines solving complex analogies or metaphors (e.g., Webb et al., 2023; Sourati et al., 2024).
Many tests have been developed to measure relational abilities (e.g., Premack, 1983; Goel & Dolan, 2001; Birney et al., 2006; Alexander et al., 2016; Colbert et al., 2017). In this research, we will use a measure developed within the behavior-analytic literature, referred to as the relational abilities index (RAI, Colbert et al., 2017). The procedure is similar to syllogistic reasoning tasks in that it presents the participant (or language model) with a number of relational premises involving nonwords (e.g., “COR is the same as WUG and WUG is different from LOM”) and a (or in some versions, multiple) conclusion(s) to respond to (e.g., “Is COR the same as LOM”?). Syllogistic reasoning has long been considered a hallmark of human intelligence (e.g., by Aristotle) and is studied in various research areas such as behavioral psychology (e.g., McHugh et al., 2004; Colbert et al., 2017); cognitive psychology (e.g., Goodwin & Johnson-Laird, 2005; Todd et al., 2019) and neurology (e.g., Goel & Dolan, 2001), as well as computer science (e.g., Hummel & Holyoak, 2001; Ando et al., 2023; Eisape et al., 2023; Ozeki et al., 2024). Traditionally, syllogistic reasoning tasks make use of realistic stimuli (e.g., names of people, animals, etc.) and involve either comparative relations (e.g., “Ben is smaller than Tom, Tom is bigger than Lisa”, see Goel & Dolan, 2001, for an example) or (logical) reasoning about object categories and features (e.g., category-based induction problems, e.g., “All birds have wings; Penguin is a bird; Therefore penguins have wings”, see Eisape et al., 2023, for an example). The RAI, on the other hand, involves multiple relations (sameness, difference, opposition, more or less than, hierarchical relations, temporal relations, deictic relations and analogies; see Cummins, 2023, for a detailed discussion) and uses arbitrarily chosen stimuli (i.e., nonwords). As such, it aims to capture the generalized (abstract) ability to respond relationally in an arbitrarily applied fashion (i.e., regardless of formal stimulus properties) and may provide a broader survey of relational reasoning abilities than similar measures do. The procedure is well-suited for testing language models’ relational abilities also. The fact that they are natural language reasoning problems makes them readily applicable in LLMs and LRMs and the use of nonwords reduces the likelihood that the models encountered identical problems in training. While the models may still benefit from structurally similar problems that are almost certainly present in training data, syllogistic reasoning problems can be manipulated in many ways to create different variations in the same problems (e.g., increasing the number of premises, including irrelevant premises, manipulating the order of premises, changing the response format to multiple choice) so as to further reduce the likelihood of contaminated results.
For this study, we created a large battery of several thousand syllogistic reasoning problems inspired by the RAI. We assessed a multitude of relations (sameness, difference, opposition, more than, less than, hierarchical relations, temporal relations, deictic relations and analogies). The task was created making use of relational derivation tables (essentially look-up tables that store the reversal and transitive combinations of different combinations of relations allowing for automated derivation of relations; Raemaekers, in preparation1). It represents a significant increase in size relative to similar tasks, and to our knowledge, this is the largest and most varied RAI-inspired task to this date (previous versions ranged from 55, Colbert et al., 2017, to 67 trials, Colbert et al., 2020, Cummins, 2023). In addition to relations typically tested in the RAI, we also included test trials that specifically assessed transformation of function. RFT considers this to be an important characteristic of relational responding, but it is currently not included in the RAI. We manipulated problem complexity by varying the number of premises (i.e., one to five relevant premises, with an optional irrelevant premise added to the problem). We also created further problem variations by manipulating whether a valid or invalid conclusion was prompted and by generating multiple instances of each unique problem using different nonwords in the premises, to get a reliable estimate of model performance. We describe the task in more detail below.

2. Materials and Methods

2.1. Syllogistic Reasoning Problems

We constructed a large battery of syllogistic relational reasoning problems (N = 1364 unique problems) that included a wide variety of relations and varied in structure and complexity. Similarly to the RAI, for descriptive purposes, we group the trials into blocks based on the relations they involve, the complexity of the relational arrangement, and the requirement to transform stimulus functions: (a) sameness and difference (N = 152), (b) sameness and opposition (N = 488), (c) more than and less than (N = 72), (d) before and after (N = 72), (e) hierarchy (i.e., ‘contains’ and ‘is part of’; N = 72), (f) deictic (I-you, here-there, now-then; N = 48), (g) analogy (N = 52) and (h) transformation of function (N = 408). The different trial numbers for the different relations were the result of the functioning of the generator function. For a number of specified relations, this function finds all valid combinations (i.e., the combination allows for valid derivation of a new relation) of the relations for a given number of premises (e.g., for two premises with sameness or difference relations, it could be sameness twice, or a one of two combinations of sameness and difference). For those combinations of relations, it then creates relational syllogistic problems with one relation per premise. Relations vary in terms of how they can be combined with other relations to allow valid derivations. For instance, two (or more) opposition relations can be specified between three (or more) relata, allowing one to derive novel transitive relations (e.g., A opposite to B and B opposite to C, so A is the same as C), whereas combinations of two or more difference relations does lead to valid derivations (if A is different from B and B is different from C, one cannot derive the relation between A and C with certainty). As a result, for some combinations of relations (e.g., same and different), fewer unique trials could be generated, leading to different numbers of trials between the relations. Each problem consisted of between one and six relational premises (e.g., ‘A is the same as B. B is the same as C’) and then prompted a conclusion, to which a binary ‘yes’ or ‘no’ response was required. We systematically manipulated problem complexity (varying number of premises and inclusion of an irrelevant premise) and the correct response (prompted syllogism conclusion was valid on half of trials, invalid on the other half). For deictic relations, we also manipulated whether the deictic dimension was to be reversed or not (see McHugh et al., 2004). Ten variants (each with different non-words as the relata) of each unique problem were created, adding up to a total of 13,640 problems (full set available on the Open Science Framework2) on which the LLMs and LRMs were assessed. To reduce the likelihood of prior experience with them in this context, the non-words used as relata were randomly selected from a pool of three-letter syllables, chosen to not be included in openly available versions of the RAI and to as much as possible not be parts of common words. Randomization of the non-words in different problems was constant across the different models. Below, we describe the main variables (beyond the relation involved) that we manipulated in the problems. Example problems are illustrated in Figure 1.

2.1.1. Number of Premises and Irrelevant Premises

For all relations except deictic and analogy relations (which always had two premises with an optional irrelevant premise), we manipulated the number of premises in the problems from one to five premises. Problems with only one premise can be thought of as assessing mutual entailment or reversal of a relation (e.g., “A is the same as B, is B the same as A?”, see Figure 1A), whereas problems with two or more premises assessed understanding of transitivity of relations (e.g., “A is opposite to B, B is opposite to C, and C is the same as D. Is D the same as A?”, Figure 1B). For each level of complexity, we also included a variant where an additional, irrelevant premise was added before prompting the conclusion (e.g., Figure 1C,F,I,M). That is, this premise referred to one of the related nonwords, but was not relevant for the prompted derivation (e.g., “A is the same as B, B is opposite of C and C is the same as D. Is A the opposite of C?”).

2.1.2. Conclusion Validity

We also manipulated the validity of the prompted conclusion. That is, half of the problems had an invalid conclusion prompted (e.g., “A is the same as B. Is B different from A?”, e.g., Figure 1B), while the other half had a valid conclusion prompted (e.g., “A is the opposite of B and B is the opposite of C. Is C the same as A?”; e.g., Figure 1D). This manipulation allowed us to investigate whether the models show signs of affirmation bias (i.e., better performance when answering yes). Such biases could, for example, arise due to problems with correct conclusions being more prevalent in training than those with invalid conclusions, or because of the extensive finetuning process to tailor the models’ responses to human preferences.

2.1.3. Deictic Relations

The problems involving deictic relations had a slightly different structure to those involving other relations (see Figure 1G–J). RFT describes three types of deictic relations (McHugh et al., 2004): interpersonal relations (I–you), temporal relations (now–then) and spatial relations (here–there)). RFT-inspired tests of this ability (e.g., McHugh et al., 2004; Cummins et al., 2023) combine these relations (e.g., I am here now, you were there then. Am I here now?’). For the current test battery, we separated the three relations. One set of trials assessed spatial relations (‘A is here. B is there. Is B here?’), another assessed purely temporal relations (e.g., ‘A is now, B is tomorrow. Is B tomorrow?’) and a final set assessed interpersonal relations (e.g., ‘I think A, you think B. Do I think A?’). Instead of different nonwords, different objects, events and thoughts were included in the various iterations of each problem. Similarly to the other relations, we created trial variations where an irrelevant premise was included (Figure 1I), and where the prompted conclusion was invalid (as well as a combination of both). We also included a variant that prompted a reversal of the deictic dimension (e.g., ‘If I were you and you were me, would I …?’ or ‘If here were there and there were here, would A be here?’, Figure 1J).

2.1.4. Analogy

For all relations except deictic relations, we included problems that required analogical reasoning (see Figure 1D–F). These problems provided two premises (e.g., “A is opposite to B, B is opposite to C.”) and then prompted a comparison of the relation in the first premise to that in the second premise (Figure 1D). Here too, a version with an incorrect analogy prompt (e.g., “A same as B and B same as C. Is A to B different from B to C?”, Figure 1E), and a version with a third, irrelevant premise (i.e., not involved in the analogy prompt, Figure 1F) were included.

2.1.5. Transformation of Function

To assess whether LLMs and LRMs demonstrate the ability to transform psychological functions, we also created a set of problems that applied relations to semantic functions of nonwords (see Figure 1K–M). In these problems, a non-word is assigned a meaning, then related to other nonwords in the same relational premises, and then the meaning of one of the related nonwords was queried (e.g., “AGU means ‘yes’. AGU has the same meaning as BUR. BUR has the opposite meaning of LOP. Does LOP have the same meaning as ‘yes’?”, Figure 1K). For sameness, difference and opposition relations, the nonwords were related to existing words (e.g., “ARU has the same meaning as ‘cat’.”), whereas for more than, less than, before and after relations, nonwords were assigned a value (e.g., “ARU is worth €5”) or time (e.g., “ARU is at 4 p.m.”). Variations in these problems again included manipulating the validity of the presented conclusion (Figure 1L) and adding irrelevant premises to the problem (Figure 1M). While this is a limited form of transformation of function, we do consider it to be a transformation of semantic function via sameness, difference, opposition, comparative and temporal relations. In that sense, we go beyond what existing versions of the RAI measure and at least have some measure of a crucial aspect of relational responding abilities.

2.1.6. Replication with Randomized Order of Premises

As a final test of the generality of the models’ ability to solve relational syllogistic problems, we replicated the full task battery as described above (i.e., 13,640 problems) but randomized the order of premises in each problem. That is, for each problem consisting of two or more premises (including irrelevant premises and premises specifying a function), the premises where no longer presented in a linear fashion (i.e., “A related to B, B related to C, C related to …”), but in a randomized order (e.g., A related to B, C related to D, B related to C). The (randomized) premises were still followed by a prompted conclusion requiring a yes or no response from the models. While randomizing the order of premises might also increase the complexity of the problems for humans, in principle, the relational complexity (i.e., the derivations to be made) remains the same. Therefore, reduced performance in this replication would hint at a lack of generalized ability for relational responding.

2.2. Models and Inference

We tested a sample of two state-of-the-art LLMs and two LRMs. This sample was chosen to include LLMs and LRMs varying in size, architecture and training. Although our sample was too limited to make general claims about LLMs, LRMs, or the difference in performance between the two, including a variety of models did allows us to assess relational behavior in both type of systems. Given the seminal nature of our study, we restricted the sample because of financial, environmental and practical concerns3. All models were accessed using the Together AI API service4. We describe them briefly below (summary in Table 1). For more technical descriptions, see the referenced publications.
We tested two models from the Llama 3 collection of LLMs developed by Meta AI (Dubey et al., 2024). Llama 3.1 405B, which has 405 billion parameters and a decoder-only transformer architecture, and Llama 3.3 70B, which has 70B parameters and has received further training (reinforcement learning, RL, from human feedback) to fine-tune it to human preferences. We also tested two recent models of the GPT-model family (Open AI, 2025): GPT OSS 20B and GPT OSS 120B. These are so-called open-weight models that are trained using reinforcement learning from human feedback and other techniques derived from past Open AI models, on only text data (Open AI, 2025). Both have a mixture-of-experts architecture (MoE, Jiang et al., 2024). GPT OSS 20B has 21 billion parameters with 3.6 billion active parameters, GPT OSS 120B has 117 billion parameters with 5.1 billion active parameters. Based on the results of a pilot optimization study (see Appendix A), we chose to test the models in a zero-shot setting (i.e., no examples, information or additional prompts provided in the context; a system prompt was provided: “Answer with yes or no only, no punctuation is needed. Don’t put a space before your response.”), and with the temperature parameter set to 0.75. For all other parameters, the default settings were used (see Together AI documentation).

3. Results

3.1. Overall Task Performance

Figure 2 (and Table 2) shows the models’ overall performance by blocks (i.e., relations, including analogy and transformation of function separately). As per the preregistered analytic strategy, we limit ourselves to descriptive statistics here to keep the results digestible. Binomial Generalized Linear Model analyses are summarized in Appendix C. Our results demonstrated that the LRMs (GPT OSS 20B and GPT OSS 120B) mostly outperform the LLMs (LlaMa 3.1 405B and Llama 3.3 70B; except for deictic relations where LlaMa 3.1 405B outperformed the other models slightly). The larger models (more parameters) also tended to outperform their smaller counterparts. While the models generally reached a high level of performance (GPT models correctly solved 83% or more of the problems and reached maximum or near-maximum scores in most blocks), there was quite some variability in performance across the different blocks (more so in LlaMa models). Performance dropped notably in the analogy block, even for the GPT models (GPT OSS 20B: 83.27%; GPT OSS 120B: 83.85%), but more so for the LLMs (LlaMa 3.1 405B: 55.19%; LlaMa 3.3 70B: 55.31%). For the LLMs, performance also dropped significantly in the same–opposite block (LlaMa 3.1 405B: 61.25%; LlaMa 3.3 70B: 55.18%) and the transformation of function block (LlaMa 3.1 405B: 82.06%; LlaMa 3.3 70B: 69.04%), and to a lesser extent in the same–different block (LlaMa 3.1 405B: 83.16%; LlaMa 3.3 70B: 70.59%).

3.2. Effect of Problem Complexity

For all blocks except the deictic and analogy blocks, we varied the number of relational premises in each problem from one to five. For comparative (more than, less than) and temporal relations (before, after), all models were remarkably robust to this manipulation and performed at near-maximum level for all levels of complexity. For the other relations, however, we did observe that performance dropped increasingly with more premises. In the same–different and same–opposite blocks, this drop in performance was more pronounced for the LlaMa models (which dropped from maximum down to chance-level) than for the GPT models (which dropped by about ten percentage points). For hierarchical relations, we observed a reversed pattern, where the models’ performance dropped by up to 20% for problems with three premises, but then recovered slightly for more complex problems. These results are illustrated in Appendix B.1 Figure A2 and described in Table A5.

3.3. Effect of Prompt Validity

We also manipulated whether a valid or invalid conclusion was prompted and whether an irrelevant premise was included, to test the models’ sensitivity to variations in problem presentation. While the effect of these variations was more difficult to interpret, we can say that GPT models were less affected by them than were the LlaMa models (see Appendix B.2 Figure A3 and Table A6 for detailed results). For the same–different, more than–less than and before–after blocks, all models performed consistently (at a high level) across the problem variations. For same–opposite relations, the GPT models were relatively unperturbed by the problem variations and were even more accurate when an incorrect conclusion was prompted (10–15% more accurate). The LlaMa models showed the reversed patterns, being significantly less accurate when an irrelevant premise was included in the problem and with performance dropping even more when an invalid conclusion was prompted. Finally, for hierarchical relations, both the GPT and LlaMa models’ performance dropped slightly (between ten and thirty percent) when an invalid conclusion was prompted, but the addition of an irrelevant premise did not noticeably affect performance. We discuss the effect of problem variations for deictic relations, analogy and transformation of function separately below.

3.4. Deictic Relations Performance

We describe the results for deictic relations separately, as these problems were slightly different from the other relations. Trials involved three types of deictic relations: interpersonal, temporal and spatial. All models were slightly less accurate on problems involving interpersonal relations than on trials involving spatial and temporal relations, on which they reached near-maximum performance (see Appendix B.2 Figure A3 and Table A7). In addition to manipulating the inclusion of an irrelevant premise and the validity of the prompted conclusion, we also included trials where the deictic dimension was reversed. While the models were only modestly affected by prompting an incorrect conclusion, performance dropped when a reversal was included, and more so when this was combined with an invalid conclusion or irrelevant premise. Most models were affected by trial variations in similar ways, but it is noteworthy that LlaMa 405B appeared to be significantly more robust to these manipulations, with its performance remaining largely stable or even increasing (relative to the regular prompt), and likely as a result outperforming the other models on deictic responding with 92.17% accuracy).

3.5. Analogy Performance

For all relations except deictic relations, we probed analogical reasoning abilities by providing two relational premises and asking whether the relations in the two premises were the same or not. Here too, we included variants with an invalid conclusion and with irrelevant premises. Across problem variants, models were more accurate solving analogies involving sameness, difference and opposition relations than those involving comparative (more than, less than), temporal (before, after) or hierarchical relations (see Appendix B.3 Figure A4 and Table A8). Still, in the same–different and same–opposite blocks, LlaMa 3.1 405B was significantly less accurate on problems where an incorrect conclusion was prompted (performance dropped from around eighty to below fifty percent). The smaller LlaMa 3.3 70B model only showed this sensitivity in the same–opposite block, while the GPT-models were relatively unaffected by problem variations in these blocks. In the blocks involving comparative, temporal and hierarchical relations, performance was markedly lower. The LlaMa models scored below 25% accuracy, while the GPT models reached about 50% accuracy. Performance on problems involving temporal and hierarchical relations was slightly higher, with LlaMa models reaching about 30% accuracy and GPT models reaching up to 75% accuracy. Interestingly, in the latter three blocks, performance appeared to increase when an invalid conclusion was prompted, or when an irrelevant premise was included.

3.6. Transformation of Function Performance

For sameness, difference, opposition, comparison and temporal relations, we also included trials that probed the models’ ability to transform stimulus functions (nonword meaning). These problems again included variants with invalid conclusions and irrelevant premises. The GPT models were both highly accurate for all relations (near-maximum performance, GPT OSS 20B 96.74% on average, GPT OSS 120B 99.24% on average). The LlaMa models were slightly less accurate (LlaMa 3.3 70B was 69.04% accurate on average, LlaMa 3.1 405B was 82.06% accurate on average), but this drop in performance was limited to same–different and same–opposite relations. For those relations, both LlaMa models’ performance decreased with an increasing number of premises and with invalid conclusions prompted (see Appendix B.4 Figure A5 and Table A9 and Table A10).

3.7. Replication with Randomized Order of Relational Premises

As a final test of the robustness of our results, we conducted a replication study in which we recreated the full task battery, but with the order of relational premises randomized for each problem. The results of this replication are illustrated in Figure 3 and summarized in Table 3. As can be seen from comparing Figure 1 and Figure 3, the models generally performed at about the same level in the replication study as in the original task. For some models, in particular blocks, performance dropped significantly, while for others, it even improved slightly. The most notable drop in performance was for GPT OSS 120B. While still performing at a high level, it lost around five percentage points in the same–different (91.84%), same–opposite (88.81%), comparison (94.03%), temporal (94.31%), hierarchy (92.92%) and transformation of function blocks (92.48%), in which it performed at near-maximum level in the original task (see Table 2). The smaller GPT OSS 20B model was more robust to the randomization of premise order. Also notable, the LlaMa 3.1 405B model dropped about ten percentage points in the transformation of function block (72.70%), but gained about seven percentage points in the same–different block (89.34%). Similarly, the smaller LlaMa 3.3 70B model also dropped about five percentage points in the transformation of function block (65.96%) and gained about eight percentage points in the same–different block (78.42%). The effect of increasing the number of relational premises in a problem, including irrelevant premises and manipulating the validity of the prompted conclusions were largely similar to the original results. The effect of increasing number of premises was slightly more pronounced in the replication study for GPT models (Figure A7 and Table A27). Effects of problem variations in deictic (Figure A8 and Table A28), analogy (Figure A9 and Table A29) and transformation of function blocks (Figure A10, Table A30 and Table A31) were largely the same as in the original task.

4. Discussion

Inspired by behavior-analytic research on the human ability for relational responding, we conducted survey of a small sample of LLMs and LRMs on a large battery of relational syllogistic reasoning problems. Our results demonstrated that both the LLMs (i.e., LlaMa 3.1 405B and 3.3 70B) and LRMs (i.e., GPT OSS 20B and 120B) in our sample generally perform well in these tasks. The LRMs reach at least eighty percent accuracy on all types of relating (including analogy and transformation of function) and reach near-maximum performance on same–different, comparison and temporal relations. The LLMs showed more variability in their performance, and while they performed at a similar level to the GPT models in the comparison, temporal, hierarchy and deictic blocks, their performance dropped markedly in the same–different, same–opposite, analogy and transformation of function blocks. This finding appears to show that the extensive post-training that LRMs go through does benefit them when it comes to solving syllogistic reasoning problems, potentially because this training involves finetuning of step-by-step reasoning (e.g., Rafailov et al., 2023; Li et al., 2023; Ouyang et al., 2022; Guo et al., 2025). Also in line with past research on LLMs and LRMs (e.g., Kaplan et al., 2020; Brown et al., 2020; Chowdhery et al., 2023; Guo et al., 2025), we observed that the larger models (i.e., LlaMa 3.1 405B and GPT OSS 120B) generally outperformed their smaller counterparts (i.e., LlaMa 3.3 70B and GPT OSS 20B, respectively).
These results appear to show that state-of-the-art systems have acquired an ability for relational responding, but they do show sensitivities to the type(s) of relations involved in the problem premises, and to variations in the problem presentation (i.e., varying the number of premises, including irrelevant premises and prompting invalid conclusions). While the effects of task variations differed between the models and between types of relations, they may reflect artifacts from the models’ training. The models’ decreased performance when invalid conclusions were prompted (except for GPT models, which were more accurate for invalid conclusions prompted in the same–opposite block) may reflect a bias for affirmative responding. Similarly, decreased performance in problems with more premises and for problems with an irrelevant premise could also reflect a lack of generalization from training. While our use of non-words likely reduces the probability that the models have seen these exact problems in training, they could still take advantage of structural overlap between these problems and those seen in training, and from the presence of cues indicating that overlap (i.e., performance is a function of overlap in problem topography, rather than generalized relational understanding). We were not able to control the training data of the models, but future research could address this. The decrease in performance on these problem variants, however, could also simply reflect the increased difficulty of these problems (i.e., problems with more relational premises have higher relational complexity, Halford et al., 2010), differences in prompt sensitivity or tokenization idiosyncrasies, which could also be elucidated in future research. Results of our replication study showed that the models’ performance was relatively robust against perturbations of the order of relational premises (which affected performance in prior studies, e.g., W. Wu & Deng, 2025), suggesting the ability to solve relational syllogisms is relatively generalized. Future research could investigate the models’ sensitivity to problem variations by other aspects of the task (e.g., multiple choice format, testing other relations) and by assessing human’s sensitivity to those same task variations.
The LLMs’ relatively low performance on analogy problems (LlaMa models performed around chance-level, while GPT models reached about 83% accuracy) somewhat nuances claims of emergent analogical reasoning in LLMs e.g., Webb et al., 2023; but see Lewis & Mitchell, 2024, for a counterexample). Taken together with the fact that we probed analogical reasoning in a somewhat unusual way (e.g., “A is the same as B. B is the same as C. Is A to B the same as B to C?”), one could argue that the models’ performance does not reflect truly generalized, abstract relational understanding. However, we must also note that the decrease in performance for analogy problems was most pronounced for comparative and temporal relations, where technically speaking, the model can be considered correct when it determines two identical relations (e.g., “A more than B” and “B more than C”) as not being identical, because they might not be the exact same instance of the relation (e.g., A may be five more than B, while B could be twenty more than C). Further research investigating the models’ reasoning process is needed to address this question.
A final aspect of our test of relational abilities that deserves highlighting is the transformation of function. The GPT models appeared to have little difficulty with these problems and performed at near-maximum level. LLMs’ lower performance on transformation of function trials was limited to same–different and same–opposite relations. Performance on those problems decreased with an increasing number of premises, analogous to regular same–different and same–opposite trials (no transformation of function) and may therefore reflect the effect of increasing the number of same–different and same–opposite premises, rather than a difficulty with the transformation of stimulus function itself. Both LLMs and LRMs thus appear to be capable of transforming semantic functions in a relational network, which is an important aspect of relational responding (Hayes et al., 2001).
Further research is needed to answer the question how the models’ performance relates to that of humans. Given the limited number of studies using the RAI or its derivatives (and the different versions used therein), normative (block-level) data for human performance is unavailable. Colbert et al. (2020) validated a version of the RAI (69 trials involving sameness, difference, opposition, comparative, temporal and analogy) and reported that adult participants scored between 80% and 90% for same–different, same–opposite, more–less and before–after trials, and that performance dropped to around 65% for analogy trials. Future research needs to collect human data for this test battery to allow for a more direct comparison of human and artificial performance, as well as the effect of problem variations on human performance.
Readers should be careful interpreting these findings, however. As we mentioned in the introduction, there are many pitfalls, such as the risk of anthropomorphizing the models and attributing human abilities and mechanisms (e.g., understanding, reasoning) to them (Lin, 2025). It is important to emphasize that output alignment (the fact that artificial systems produce seemingly human-like responses) does not a priori imply alignment of the underlying mechanisms or processes. Indeed, these models may solve these tasks in entirely different ways than humans do. The current study only investigated overall performance (whether models can solve relational syllogistic problems), but to further investigate similarities between the models’ and human reasoning, future research should study the models’ ‘reasoning’ process in more detail, and compare it with that of humans (e.g., Sourati et al., 2024; Ozeki et al., 2024; Eisape et al., 2023; Bertolazzi et al., 2024). Another limitation of the current work concerns our use of non-words in our test. We chose to do so in line with previous work in behavior-analytic literature (e.g., Colbert et al., 2020), and argue that the use of non-words allows us to assess whether the system (human or artificial) has acquired a general ability for relational responding (i.e., if it can act as if randomly selected non-words are related, it can learn to do so with any stimulus). A downside of using non-words is that it reduces the ecological validity of our task. While we do not wish to downplay this concern, it is reassuring to know that past research has shown that performance in procedures like these (i.e., syllogistic reasoning problems that specify relations between non-words) can be considered a proxy-measure of intellectual abilities (Colbert et al., 2017) and that they can be used to improve relational and intellectual abilities (e.g., Cassidy et al., 2011; Dixon et al., 2022). Furthermore, we would argue that our use of non-words is intended to both (i) reduce training data contamination, and (ii) increase reliance on relational contextual cues (i.e., the relations specified in the relational premises), because the non-words are unlikely to possess functions that might control the relational responding. There are therefore theoretical reasons (see Hayes et al., 2001) to suspect that the use of non-words may not impact ecological validity when the same cues are present in the tested syllogisms and the “real world”. Nevertheless, future research could address the question of ecological validity of these procedures in more detail.
Beyond merely assessing the relational abilities of LLMs and LRMs to get another measure of their competence, we mainly conducted this study to illustrate a broader point about using (a behavior-analytic perspective on) relational responding as a framework to evaluate artificial systems and compare them to other systems (humans or other animals). As we mentioned in the introduction, researchers in different fields of psychological research and computer science agree that relational abilities are a cornerstone of human cognition (McLoughlin et al., 2020) and may even be considered a proxy-measure of general intelligence (Colbert et al., 2017; Raven & Raven, 2003). By studying relational abilities, we can evaluate different systems without getting caught up in philosophical debates about the nature of general intelligence, yet still obtain a measure related to it. Our results showed that state-of-the-art LLMs and LRMs display the ability for relational reasoning, which is a core aspect of intelligence. Furthermore, taking RFT’s definition of relational responding (Hayes et al., 2001), encompassing both relatively simple responding to formal stimulus relations and more complex AARR (symbolic behavior), we can assess different types or levels of relational responding across a wide range of systems (nonhuman animals, humans, artificial systems). Many animals appear to be capable of responding to formal stimulus relations (e.g., Penn et al., 2008), while humans were until recently assumed to be unique in their ability for AARR (e.g., Hayes et al., 2001; Lionello-DeNolf, 2009; Lionello-DeNolf, 2021). Our results show that language models have developed the ability to AARR, at least as far as our test is concerned. Further research in this direction would allow us to make fair comparisons of these systems’ relational abilities and possibly provide clues about the bigger questions surrounding the level of intelligence of current state-of-the-art artificial intelligence.
Finally, RFT also emphasizes transformation of psychological function as a crucial aspect of AARR. We are not aware of any research that has directly assessed whether LLMs and LRMs have this ability. Our results provide evidence that they do. While our current procedure only probed a limited form of function transformation (i.e., transforming the meaning of nonwords), it does go beyond existing measures (e.g., Colbert et al., 2020; Cummins et al., 2023) that currently do not test it at all. Future research could investigate other types of transformations of function (e.g., transforming response functions: “When I say GUK, you respond with ‘yes’. GUK is opposite to TOK. I say TOK! [‘no’]”) or probe it in different ways (e.g., procedures like the recently developed function transformation tasks: Finn & De Houwer, 2021; Finn et al., 2023). However, there is also the broader, more philosophical question of whether artificial systems like LLMs and LRMs can develop this ability in the first place. According to RFT, the ability to AARR (and thus, the ability for transformation of function) develops through a long history of reinforcement in a verbal environment and involves abstraction of generalized patterns of relating away from formal stimulus properties involved in multiple exemplars of relations. Humans experience this learning history grounded in a multimodal sensory environment. It has been argued that experiencing many different examples of relational responding in a long and structured learning history is required for the ability to AARR (and thus, transform of function) to arise. LLMs strictly speaking only experience our world in a linguistic sense (except for more recent multimodal LLMs and LRMs, e.g., GPT o1, Open AI et al., 2024; Gemma 3, Gemma Team et al., 2025) and require orders of magnitude more training than the average human which looks much different from that of humans (e.g., Frank, 2023b). One could therefore argue that they cannot develop the ability for generalized relational understanding altogether (e.g., Bender et al., 2021). We hope that the framework we propose here can foster progress in these debates.

5. Conclusions

We conducted a large-scale assessment of the relational abilities of state-of-the-art LLMs and LRMs. Our results demonstrated that these systems generally perform well in our relational syllogistic reasoning task and appear to have developed the ability for relational responding, a core characteristic of intelligence. We did observe significant differences between the models and differences in performance for different relations. Furthermore, results suggest that model size and training significantly affect performance (bigger, more intensively post-trained models perform better). Finally, sensitivities to task variations hint at the possibility that performance does not fully reflect generalized understanding, so caution is warranted when interpreting these findings.

Author Contributions

Conceptualization, M.R., M.F. and J.D.H.; methodology, M.R., M.F. and J.D.H.; software, M.R.; validation, M.R.; formal analysis, M.R.; investigation, M.R.; resources, M.R.; data curation, M.R.; writing—original draft preparation, M.R.; writing—review and editing, M.R., M.F. and J.D.H.; visualization, M.R.; supervision, M.F. and J.D.H.; project administration, M.R.; funding acquisition, M.R. and J.D.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Flemish Research Foundation grant number 11M0325N (M.R.), Flemish Research Foundation grant number 12A2B26N (M.F.) and Special Research Fund grant BOF22/MET_V/002-01M00209 (J.D.H.).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available on the OSF at https://osf.io/78u36/overview?view_only=8f2df70d8ff845e9ad393d407f4c27c1 (accessed on 29 October 2025).

Acknowledgments

During the preparation of this study, the authors used ChatGPT (version 5, Thinking) for the purposes of code revision. The authors have reviewed and edited the output and take full responsibility for the content of this publication. GenAI was not used in the writing process.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LLMLarge Language Model
LRMLarge Reasoning Model
RFTRelational Frame Theory
RAIRelational Abilities Index
OSFOpen Science Framework

Appendix A

Pilot Optimization Study

The default hyperparameter settings were used for all models, except for the temperature. To determine the optimal temperature settings, we piloted a subset of the test battery to two of the models, GPT OSS 120B and LlaMa 405B IT. We tested four temperature parameter settings, zero, 0.25, 0.5 and 0.75, and chose the one setting that produced the highest performance, or any comparison deemed interesting. Model performance will be compared in a zero-shot reasoning, few-shot (in-context) reasoning, and a ‘chain-of-thought’ reasoning setting. In the zero-shot reasoning setting, the model only receives a system prompt about the task and response format (i.e., “Answer with yes or no only, no capitalization or punctuation needed.”) and is then prompted with the syllogism and a conclusion. For the chain-of-thought (CoT) prompt variant, the model is prompted to reason step by step after being presented with the syllogism (i.e., “Try and think this through step-by-step.”). In the few-shot learning setting, the model is first presented with two example syllogisms with the correct answer (one single premise reversal with an incorrect conclusion and one three-premise problem with a correct conclusion), before being prompted with the syllogism and possible conclusion(s) (again, with the additional prompt to reason step-by-step in the chain-of-thought reasoning setting).
The pilot study results showed overall high performance for both models, with GPT OSS 120B reliably outperforming LLaMa 3.1 405B IT by about 3 percentage points. With respect to the temperature parameter, our results indicated no significant differences between the four settings we tested. With respect to prompting techniques, our results indicated that few-shot prompting (i.e., providing a few example problems with correct responses before prompting the test problem) caused performance to increase by about 4.5 percentage points for both models. Based on these results, we decided to set the temperature parameter for testing models on the full battery to 0.75 (as Together AI recommends). We decided to only test the models in a zero-shot setting (i.e., no examples or additional prompt), seen as performance in that condition was already high, and few-shot prompting only improved it by a couple of percentage points.
Table A1. Pilot Study Results. Model Accuracy (%) at Different Temperature Parameter Settings.
Table A1. Pilot Study Results. Model Accuracy (%) at Different Temperature Parameter Settings.
Model\Temperature00.250.50.75
GPT OSS 120B95.195.996.297.0
LlaMa 3.1 405B IT92.392.392.992.6
Table A2. Binomial GLM Results: Model × Temperature.
Table A2. Binomial GLM Results: Model × Temperature.
Coeff.Std. Err.zp
Intercept2.490.2012.6340
Model (GPT)0.470.251.880.06
Temperature 0.2500.0601
Temperature 0.50.080.081.000.32
Temperature 0.750.040.070.580.57
Model–GPT × Temperature—0.250.190.260.750.45
Model–GPT × Temperature—0.50.180.270.690.49
Model–GPT × Temperature—0.750.470.281.690.09
Table A3. Pilot Study Results. Model Accuracy (%) for Different Prompting Techniques.
Table A3. Pilot Study Results. Model Accuracy (%) for Different Prompting Techniques.
Model\PromptZero-ShotFew-ShotChain-of-Thought
GPT OSS 120B93.998.695.7
LlaMa 3.1 405B IT90.895.391.6
Table A4. Binomial GLM Results: Model × Prompt Technique.
Table A4. Binomial GLM Results: Model × Prompt Technique.
Coeff.Std. Err.zp
Intercept2.390.1614.610
Model (GPT)0.710.233.050.002
Prompt–Few-shot0.620.222.820.005
Prompt–CoT−0.100.10−10.318
Model–GPT × Prompt–Few-shot0.510.550.930.350
Model–GPT × Prompt–CoT−0.270.23−1.200.229

Appendix B

Appendix B.1. Effect of Problem Complexity and Variations

Figure A1. Performance for increasing number of relational premises grouped by relations involved. Error bars are 95% Wilson binomial confidence intervals.
Figure A1. Performance for increasing number of relational premises grouped by relations involved. Error bars are 95% Wilson binomial confidence intervals.
Behavsci 16 00045 g0a1
Table A5. Model Accuracy (%) For Increasing Number of Premises Grouped by Relations Involved.
Table A5. Model Accuracy (%) For Increasing Number of Premises Grouped by Relations Involved.
BlockNumber of PremisesLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20BGPT OSS 120B
Same and DifferentOne100.00100.00100.00100.00
Two100.00100.0099.17100.00
Three79.0691.2599.6999.69
Four61.5080.5099.2599.75
Five52.9268.7599.1799.79
Same and OppositeOne100.00100.0098.75100.00
Two82.8185.62592.8198.75
Three59.6966.8889.0690.78
Four53.5261.8092.1193.20
Five50.0455.3199.6693.59
More Than and Less ThanOne100.0096.25100.00100.00
Two100.00100.0098.75100.00
Three100.00100.00100.00100.00
Four100.00100.0099.38100.00
Five100.00100.00100.00100.00
Before and AfterOne96.2592.50100.00100.00
Two99.38100.00100.00100.00
Three99.38100.00100.00100.00
Four100.00100.0098.7593.75
Five100.00100.0099.3898.75
Contains and Is Part OfOne100.00100.0098.75100.00
Two93.13100.0088.13100.00
Three86.3893.7579.3893.75
Four93.13100.0086.25100.00
Five93.75100.0087.50100.00
Figure A2. Performance on Different Trial Variations, grouped by Relations Involved. Error bars are 95% Wilson binomial confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Figure A2. Performance on Different Trial Variations, grouped by Relations Involved. Error bars are 95% Wilson binomial confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Behavsci 16 00045 g0a2
Table A6. Model Accuracy (%) For Different Problem Variants Grouped by Relations Involved.
Table A6. Model Accuracy (%) For Different Problem Variants Grouped by Relations Involved.
BlockProblem VariantLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20BGPT OSS 120B
Same and DifferentRegular69.2186.3298.42100.00
Incorrect Conclusion78.6885.6899.5599.77
Irrelevant Premise68.5780.2999.7199.71
Irrelevant Incorrect67.7179.4399.7199.71
Same and OppositeRegular66.0788.0385.7487.38
Incorrect Conclusion47.3139.0896.8599.31
Irrelevant Premise69.0777.8885.7688.39
Irrelevant Incorrect38.7341.3695.6898.81
More Than and Less ThanRegular100.00100.00100.00100.00
Incorrect Conclusion100.00100.00100.00100.00
Irrelevant Premise100.0098.1399.38100.00
Irrelevant Incorrect100.00100.0098.75100.00
Before and AfterRegular98.8999.44100.00100.00
Incorrect Conclusion100.00100.00100.0099.46
Irrelevant Premise98.1396.8899.3899.38
Irrelevant Incorrect100.00100.0098.7599.38
Contains and Is Part OfRegular100.00100.0099.44100.00
Incorrect Conclusion80.9195.4673.1895.46
Irrelevant Premise100.00100.0098.75100.00
Irrelevant Incorrect93.13100.0079.38100.00

Appendix B.2. Deictic Responding

Figure A3. Performance on different variations in deictic relations trials, grouped by relation involved. Error bars are 95% Wilson binomial confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Figure A3. Performance on different variations in deictic relations trials, grouped by relation involved. Error bars are 95% Wilson binomial confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Behavsci 16 00045 g0a3
Table A7. Model Accuracy (%) For Different Deictic Problem Variants Grouped by Relations Involved.
Table A7. Model Accuracy (%) For Different Deictic Problem Variants Grouped by Relations Involved.
BlockProblem VariantLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20BGPT OSS 120B
InterpersonalRegular95.0075.0095.0090.00
Incorrect Conclusion95.00100.0065.0080.00
Irrelevant Premise100.0090.00100.00100.00
Irrelevant Incorrect95.00100.0075.0090.00
Reversal90.0080.0090.00100.00
Reversal Incorrect65.0095.0035.0040.00
Reversal + Irrelevant95.0060.0080.0095.00
Reversal Inc. + Irrelevant45.0090.0030.0035.00
TemporalRegular95.0075.00100.00100.00
Incorrect Conclusion100.00100.00100.00100.00
Irrelevant Premise100.0080.0090.0095.00
Irrelevant Incorrect100.00100.00100.00100.00
Reversal95.0095.0090.00100.00
Reversal Incorrect35.00100.00100.0095.00
Reversal + Irrelevant95.00100.0095.00100.00
Reversal Inc. + Irrelevant60.00100.0095.0090.00
SpatialRegular100.00100.00100.00100.00
Incorrect Conclusion100.00100.0090.00100.00
Irrelevant Premise100.00100.00100.00100.00
Irrelevant Incorrect100.00100.0090.00100.00
Reversal100.0095.00100.00100.00
Reversal Incorrect70.00100.0090.0080.00
Reversal + Irrelevant100.00100.0095.0095.00
Reversal Inc. + Irrelevant50.0095.0095.0070.00

Appendix B.3. Analogy

Figure A4. Performance on different variations in analogy problems, grouped by relations involved. Error bars are 95% Wilson binomial confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Figure A4. Performance on different variations in analogy problems, grouped by relations involved. Error bars are 95% Wilson binomial confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Behavsci 16 00045 g0a4
Table A8. Model Accuracy (%) For Different Analogy Problem Variants Grouped by Relations Involved.
Table A8. Model Accuracy (%) For Different Analogy Problem Variants Grouped by Relations Involved.
BlockProblem VariantLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20BGPT OSS 120B
Same and DifferentRegular93.33100.00100.00100.00
Incorrect Conclusion86.6740.0093.33100.00
Irrelevant Premise86.6793.33100.00100.00
Irrelevant Incorrect93.3340.0090.00100.00
Same and OppositeRegular75.5100.0097.5097.50
Incorrect Conclusion62.5055.0090.0095.00
Irrelevant Premise82.5085.0097.50100.00
Irrelevant Incorrect57.5055.00100.0095.00
More Than and Less ThanRegular0.000.0020.0010.00
Incorrect Conclusion0.0030.0080.0040.00
Irrelevant Premise5.0015.0055.0025.00
Irrelevant Incorrect0.0015.0055.0080.00
Before and AfterRegular25.0030.0055.0060.00
Incorrect Conclusion5.0075.0070.0070.00
Irrelevant Premise25.0020.0060.0070.00
Irrelevant Incorrect5.0020.0080.0090.00
Contains and Is Part OfRegular35.0050.0085.0095.00
Incorrect Conclusion15.0070.0080.0075.00
Irrelevant Premise15.0075.0085.00100.00
Irrelevant Incorrect0.0035.0095.0090.00

Appendix B.4. Transformation of Function

Figure A5. Performance on transformation of function trials for different problem variation and number of premises, grouped by relations involved. Error bars are 95% Wilson binomial confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Figure A5. Performance on transformation of function trials for different problem variation and number of premises, grouped by relations involved. Error bars are 95% Wilson binomial confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Behavsci 16 00045 g0a5
Table A9. Model Accuracy (%) For Different Transformation of Function Problem Variants Grouped by Relations Involved.
Table A9. Model Accuracy (%) For Different Transformation of Function Problem Variants Grouped by Relations Involved.
BlockProblem VariantLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20BGPT OSS 120B
Same and DifferentRegular76.5098.5097.0099.50
Incorrect Conclusion47.0094.50100.00100.00
Irrelevant Premise75.5093.5099.0099.50
Irrelevant Incorrect69.0088.0096.0098.00
Same and OppositeRegular64.1992.1095.3299.03
Incorrect Conclusion45.1639.1997.9099.68
Irrelevant Premise62.1085.3292.7498.07
Irrelevant Incorrect71.1374.0396.9499.36
More Than and Less ThanRegular100.00100.00100.00100.00
Incorrect Conclusion100.00100.0099.00100.00
Irrelevant Premise100.0098.00100.00100.00
Irrelevant Incorrect86.0099.0094.00100.00
Before and AfterRegular100.00100.00100.00100.00
Incorrect Conclusion100.00100.0099.00100.00
Irrelevant Premise100.00100.0099.0099.00
Irrelevant Incorrect91.00100.0098.00100.00
Table A10. Model Accuracy (%) for Transformation of Function Problems with Increasing Number of Premises Grouped by Relations Involved.
Table A10. Model Accuracy (%) for Transformation of Function Problems with Increasing Number of Premises Grouped by Relations Involved.
BlockNumber of PremisesLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20B GPT OSS 120B
Same and DifferentOne76.2595.0098.75100.00
Two89.1797.50100.00100.00
Three77.5095.6398.13100.00
Four41.5094.0099.00100.00
Five67.0889.5895.8397.50
Same and OppositeOne86.2597.50100.00100.00
Two70.6385.00100.0098.13
Three57.8172.8196.2598.44
Four52.1963.5984.4799.22
Five62.7374.0694.9299.14
More Than and Less ThanOne93.7597.50100.00100.00
Two98.75100.00100.00100.00
Three100.00100.00100.00100.00
Four100.00100.0098.75100.00
Five100.0098.7592.50100.00
Before and AfterOne100.00100.00100.00100.00
Two96.25100.00100.00100.00
Three97.50100.00100.00100.00
Four98.75100.0098.7598.75
Five98.75100.0096.25100.00

Appendix C

Appendix C.1. Effect of Model and Block (Relations) on Performance

Table A11. Binomial GLM Results: Model × Block.
Table A11. Binomial GLM Results: Model × Block.
Coeff.Std. Err.zp
Intercept−0.880.34−2.570.010
Model (LlaMa 3.1 405B IT)−0.720.16−4.410.000
Model (GPT OSS 120B)−5.350.59−9.010.000
Model (GPT OSS 20B)−4.140.39−10.520.000
Block (Same–Opposite)0.670.361.900.062
Block (More–Less)−26.69119.33−7.040.823
Block (Before–After)−4.090.58−3.810.000
Block (Hierarchy)−1.660.44−2.760.000
Block (Deictic)−1.000.361.860.006
Block (Analogy)0.980.530.210.062
Block (Transformation)0.070.342.500.831
Model (LlaMa 3.1 405B IT) × Block (Same–Opposite)0.470.192.500.012
Model (GPT OSS 120B) × Block (Same–Opposite)2.880.624.650.000
Model (GPT OSS 20B) × Block (Same–Opposite)2.020.424.870.000
Model (LlaMa 3.1 405B IT) × Block (More–Less)22.81114.670.200.842
Model (GPT OSS 120B) × Block (More–Less)5.35119.990.050.964
Model (GPT OSS 20B) × Block (More–Less)26.23120.640.220.828
Model (LlaMa 3.1 405B IT) × Block (Before–After)0.900.721.260.209
Model (GPT OSS 120B) × Block (Before–After)4.840.994.900.000
Model (GPT OSS 20B) × Block (Before–After)3.631.003.620.000
Model (LlaMa 3.1 405B IT) × Block (Hierarchy)−1.010.89−1.140.256
Model (GPT OSS 120B) × Block (Hierarchy)3.621.063.430.001
Model (GPT OSS 20B) × Block (Hierarchy)4.790.4311.050.000
Model (LlaMa 3.1 405B IT) × Block (Deictic)0.020.400.050.962
Model (GPT OSS 120B) × Block (Deictic)5.050.776.590.000
Model (GPT OSS 20B) × Block (Deictic)4.070.695.910.000
Model (LlaMa 3.1 405B IT) × Block (Analogy)0.410.331.240.215
Model (GPT OSS 120B) × Block (Analogy)3.600.695.250.000
Model (GPT OSS 20B) × Block (Analogy)2.430.445.730.000
Model (LlaMa 3.1 405B IT) × Block (Transformation)0.000.190.020.987
Model (GPT OSS 120B) × Block (Transformation)1.280.632.010.044
Model (GPT OSS 20B) × Block (Transformation)1.550.433.650.000

Appendix C.2. Effect of Model Problem Complexity on Performance

Table A12. Binomial GLM Results: Model × Complexity (same–different block).
Table A12. Binomial GLM Results: Model × Complexity (same–different block).
Coeff.Std. Err.zp
Intercept7.920.1170.770.000
Model (LlaMa 3.1 405B IT)1.120.167.110.000
Model (GPT OSS 120B)4.660.1629.480.000
Model (GPT OSS 20B)2.450.1645.460.000
Complexity (2 Premises)2.070.1316.000.000
Complexity (3 Premises)−6.590.18−37.170.000
Complexity (4 Premises)−7.450.15−49.030.000
Complexity (5 Premises)−7.800.14−53.9810.000
Model (LlaMa 3.1 405B IT) × Complexity (2 Premises)0.130.180.740.463
Model (GPT OSS 120B) × Complexity (2 Premises)0.000.180.001.00
Model (GPT OSS 20B) × Complexity (2 Premises)−7.650.73−10.450.000
Model (LlaMa 3.1 405B IT) × Complexity (3 Premises)−0.110.29−0.380.705
Model (GPT OSS 120B) × Complexity (3 Premises)−0.221.03−0.210.834
Model (GPT OSS 20B) × Complexity (3 Premises)1.981.021.950.051
Model (LlaMa 3.1 405B IT) × Complexity (4 Premises)−0.180.23−0.770.440
Model (GPT OSS 120B) × Complexity (4 Premises)0.851.010.840.400
Model (GPT OSS 20B) × Complexity (4 Premises)1.970.613.320.001
Model (LlaMa 3.1 405B IT) × Complexity (5 Premises)−0.450.21−2.180.029
Model (GPT OSS 120B) × Complexity (5 Premises)1.391.021.370.171
Model (GPT OSS 20B) × Complexity (5 Premises)2.210.534.160.000
Table A13. Binomial GLM Results: Model × Complexity (same–opposite block).
Table A13. Binomial GLM Results: Model × Complexity (same–opposite block).
Coeff.Std. Err.zp
Intercept6.400.1157–1590.000
Model (LlaMa 3.1 405B IT)2.390.1615.130.000
Model (GPT OSS 120B)3.050.1619.300.000
Model (GPT OSS 20B)−2.001.05−1.910.056
Complexity (2 Premises)−4.830.19−26.000.000
Complexity (3 Premises)−6.010.39−43.550.000
Complexity (4 Premises)−6.260.13−49.990.000
Complexity (5 Premises)−6.400.12−53.890.000
Model (LlaMa 3.1 405B IT) × Complexity (2 Premises)−2.180.70−8.110.000
Model (GPT OSS 120B) × Complexity (2 Premises)−0.260.55−0.470.639
Model (GPT OSS 20B) × Complexity (2 Premises)2.981.082.770.006
Model (LlaMa 3.1 405B IT) × Complexity (3 Premises)−2.080.20−10.610.000
Model (GPT OSS 120B) × Complexity (3 Premises)−1.160.22−5.180.000
Model (GPT OSS 20B) × Complexity (3 Premises)3.701.063.510.000
Model (LlaMa 3.1 405B IT) × Complexity (4 Premises)−2.050.17−11.580.000
Model (GPT OSS 120B) × Complexity (4 Premises)−0.580.20−2.860.004
Model (GPT OSS 20B) × Complexity (4 Premises)4.321.054.100.000
Model (LlaMa 3.1 405B IT) × Complexity (5 Premises)−2.180.19−13.000.000
Model (GPT OSS 120B) × Complexity (5 Premises)−0.80.18−2.050.040
Model (GPT OSS 20B) × Complexity (5 Premises)4.271.054.080.000
Table A14. Binomial GLM Results: Model × Complexity (more–less block).
Table A14. Binomial GLM Results: Model × Complexity (more–less block).
Coeff.Std. Err.zp
Intercept8.720.1177.960.000
Model (LlaMa 3.1 405B IT)−5.460.61−8.980.000
Model (GPT OSS 120B)1.840.1611.610.000
Model (GPT OSS 20B)−0.120.16−0.780.437
Complexity (2 Premises)1.330.149.730.000
Complexity (3 Premises)4.980.1436.370.000
Complexity (4 Premises)4.520.1411.100.000
Complexity (5 Premises)4.980.1436.370.000
Model (LlaMa 3.1 405B IT) × Complexity (2 Premises)5.020.628.120.000
Model (GPT OSS 120B) × Complexity (2 Premises)0.400.192.050.041
Model (GPT OSS 20B) × Complexity (2 Premises)−5.550.74−7.520.000
Model (LlaMa 3.1 405B IT) × Complexity (3 Premises)1.470.622.380.017
Model (GPT OSS 120B) × Complexity (3 Premises)0.000.190.001.00
Model (GPT OSS 20B) × Complexity (3 Premises)0.000.190.001.00
Model (LlaMa 3.1 405B IT) × Complexity (4 Premises)5.150.628.330.000
Model (GPT OSS 120B) × Complexity (4 Premises)0.000.190.001.00
Model (GPT OSS 20B) × Complexity (4 Premises)−5.041.03−4.900.000
Model (LlaMa 3.1 405B IT) × Complexity (5 Premises)1.470.622.380.017
Model (GPT OSS 120B) × Complexity (5 Premises)0.000.190.001.00
Model (GPT OSS 20B) × Complexity (5 Premises)0.000.190.001.00
Table A15. Binomial GLM Results: Model × Complexity (before–after block).
Table A15. Binomial GLM Results: Model × Complexity (before–after block).
Coeff.Std. Err.zp
Intercept3.260.605.480.000
Model (LlaMa 3.1 405B IT)−0.740.73−1.010.314
Model (GPT OSS 120B)5.090.618.420.000
Model (GPT OSS 20B)5.130.618.480.000
Complexity (2 Premises)1.831.181.550.121
Complexity (3 Premises)1.831.181.550.121
Complexity (4 Premises)5.280.608.800.000
Complexity (5 Premises)5.340.608.890.000
Model (LlaMa 3.1 405B IT) × Complexity (2 Premises)5.151.264.100.000
Model (GPT OSS 120B) × Complexity (2 Premises)0.001.190.001.00
Model (GPT OSS 20B) × Complexity (2 Premises)0.001.190.001.00
Model (LlaMa 3.1 405B IT) × Complexity (3 Premises)5.151.264.010.000
Model (GPT OSS 120B) × Complexity (3 Premises)0.001.190.001.00
Model (GPT OSS 20B) × Complexity (3 Premises)0.001.190.001.00
Model (LlaMa 3.1 405B IT) × Complexity (4 Premises)3.190.744.310.000
Model (GPT OSS 120B) × Complexity (4 Premises)−8.551.18−7.220.000
Model (GPT OSS 20B) × Complexity (4 Premises)−9.290.94−9.880.000
Model (LlaMa 3.1 405B IT) × Complexity (5 Premises)3.190.744.310.000
Model (GPT OSS 120B) × Complexity (5 Premises)−9.310.94−9.900.000
Model (GPT OSS 20B) × Complexity (5 Premises)8.641.18−7.310.000
Table A16. Binomial GLM Results: Model × Complexity (hierarchy block).
Table A16. Binomial GLM Results: Model × Complexity (hierarchy block).
Coeff.Std. Err.zp
Intercept6.800.1160.740.000
Model (LlaMa 3.1 405B IT)5.340.1633.770.000
Model (GPT OSS 120B)5.260.1633.270.000
Model (GPT OSS 20B)−2.381.06−2.240.025
Complexity (2 Premises)−4.190.33−12.590.000
Complexity (3 Premises)−4.910.26−18.920.000
Complexity (4 Premises)−4.190.33−12.590.000
Complexity (5 Premises)−4.090.35−11.810.000
Model (LlaMa 3.1 405B IT) × Complexity (2 Premises)1.970.365.470.000
Model (GPT OSS 120B) × Complexity (2 Premises)1.920.365.330.000
Model (GPT OSS 20B) × Complexity (2 Premises)1.781.131.570.117
Model (LlaMa 3.1 405B IT) × Complexity (3 Premises)−4.520.43−10.490.000
Model (GPT OSS 120B) × Complexity (3 Premises)−4.440.43−10.280.000
Model (GPT OSS 20B) × Complexity (3 Premises)1.841.101.670.096
Model (LlaMa 3.1 405B IT) × Complexity (4 Premises)1.970.365.470.000
Model (GPT OSS 120B) × Complexity (4 Premises)1.920.365.330.000
Model (GPT OSS 20B) × Complexity (4 Premises)1.611.131.420.155
Model (LlaMa 3.1 405B IT) × Complexity (5 Premises)1.870.375.010.000
Model (GPT OSS 120B) × Complexity (5 Premises)1.810.374.870.000
Model (GPT OSS 20B) × Complexity (5 Premises)1.611.141.420.155

Appendix C.3. Effect of Model Problem Variations on Performance

Table A17. Binomial GLM Results: Model × Problem Variant (same–different block).
Table A17. Binomial GLM Results: Model × Problem Variant (same–different block).
Coeff.Std. Err.zp
Intercept1.140.1110.220.000
Model (LlaMa 3.1 405B IT)0.650.183.720.000
Model (GPT OSS 120B)4.961.024.880.000
Model (GPT OSS 20B)4.260.725.890.000
Variant (Irrelevant)−0.350.16−2.220.027
Variant (Irrelevant Incorrect)−0.390.16−2.470.014
Variant (Regular)−0.320.16−2.070.039
Model (LlaMa 3.1 405B IT) × Variant (Irrelevant)−0.030.25−0.120.906
Model (GPT OSS 120B) × Variant (Irrelevant)0.111.420.080.938
Model (GPT OSS 20B) × Variant (Irrelevant)0.821.240.660.512
Model (LlaMa 3.1 405B IT) × Variant (Irr. Incorrect)−0.040.25−0.180.859
Model (GPT OSS 120B) × Variant (Irr. Incorrect)0.141.420.100.923
Model (GPT OSS 20B) × Variant (Irr. Incorrect)0.831.220.680.498
Model (LlaMa 3.1 405B IT) × Variant (Regular)0.380.261.480.140
Model (GPT OSS 120B) × Variant (Regular)5.181.025.070.000
Model (GPT OSS 20B) × Variant (Regular)−0.940.84−1.120.262
Table A18. Binomial GLM Results: Model × Problem Variant (same–opposite block).
Table A18. Binomial GLM Results: Model × Problem Variant (same–opposite block).
Coeff.Std. Err.zp
Intercept−0.110.06−1.940.052
Model (LlaMa 3.1 405B IT)−0.340.08−4.230.000
Model (GPT OSS 120B)5.070.3414.960.000
Model (GPT OSS 20B)3.530.1721.010.000
Variant (Irrelevant)0.910.0810.850.000
Variant (Irrelevant Incorrect)−0.350.08−4.300.000
Variant (Regular)0.770.089.430.000
Model (LlaMa 3.1 405B IT) × Variant (Irrelevant)0.790.126.420.000
Model (GPT OSS 120B) × Variant (Irrelevant)−3.850.38−10.790.000
Model (GPT OSS 20B) × Variant (Irrelevant)−2.540.20−12.830.000
Model (LlaMa 3.1 405B IT) × Variant (Irr. Incorrect)0.450.123.850.000
Model (GPT OSS 120B) × Variant (Irr. Incorrect)−0.190.44−0.440.659
Model (GPT OSS 20B) × Variant (Irr. Incorrect)0.020.230.100.918
Model (LlaMa 3.1 405B IT) × Variant (Regular)1.670.1312.500.000
Model (GPT OSS 120B) × Variant (Regular)−3.810.36−10.720.000
Model (GPT OSS 20B) × Variant (Regular)−2.410.20−12.240.000
Table A19. Binomial GLM Results: Model × Problem Variant (more–less block).
Table A19. Binomial GLM Results: Model × Problem Variant (more–less block).
Coeff.Std. Err.zp
Intercept11.970.07177.590.000
Model (LlaMa 3.1 405B IT)0.000.100.001.00
Model (GPT OSS 120B)0.790.108.290.000
Model (GPT OSS 20B)−2.300.10−24.120.000
Variant (Irrelevant)−2.640.10−25.450.000
Variant (Irrelevant Incorrect)−0.230.10−2.170.030
Variant (Regular)0.630.106.230.000
Model (LlaMa 3.1 405B IT) × Variant (Irrelevant)−5.370.60−9.000.000
Model (GPT OSS 120B) × Variant (Irrelevant)0.000.150.001.00
Model (GPT OSS 20B) × Variant (Irrelevant)−1.941.03−1.890.059
Model (LlaMa 3.1 405B IT) × Variant (Irr. Incorrect)0.040.150.270.785
Model (GPT OSS 120B) × Variant (Irr. Incorrect)0.000.150.001.00
Model (GPT OSS 20B) × Variant (Irr. Incorrect)−5.070.73−6.980.000
Model (LlaMa 3.1 405B IT) × Variant (Regular)0.000.140.001.00
Model (GPT OSS 120B) × Variant (Regular)0.000.140.001.00
Model (GPT OSS 20B) × Variant (Regular)0.360.142.540.011
Table A20. Binomial GLM Results: Model × Problem Variant (before–after block).
Table A20. Binomial GLM Results: Model × Problem Variant (before–after block).
Coeff.Std. Err.zp
Intercept9.020.07133.820.000
Model (LlaMa 3.1 405B IT)0.700.107.330.000
Model (GPT OSS 120B)−3.621.02−3.560.000
Model (GPT OSS 20B)1.240.1013.050.000
Variant (Irrelevant)−5.060.59−8.600.000
Variant (Irrelevant Incorrect)0.000.100.001.00
Variant (Regular)−4.520.72−6.270.000
Model (LlaMa 3.1 405B IT) × Variant (Irrelevant)−1.220.75−1.640.101
Model (GPT OSS 120B) × Variant (Irrelevant)4.721.543.070.002
Model (GPT OSS 20B) × Variant (Irrelevant)−0.131.17−0.110.909
Model (LlaMa 3.1 405B IT) × Variant (Irr. Incorrect)0.050.150.320.746
Model (GPT OSS 120B) × Variant (Irr. Incorrect)−0.321.44−0.220.827
Model (GPT OSS 20B) × Variant (Irr. Incorrect)−5.890.73−8.110.000
Model (LlaMa 3.1 405B IT) × Variant (Regular)−0.001.240.000.998
Model (GPT OSS 120B) × Variant (Regular)9.401.257.530.000
Model (GPT OSS 20B) × Variant (Regular)4.100.735.630.000
Table A21. Binomial GLM Results: Model × Problem Variant (hierarchy block).
Table A21. Binomial GLM Results: Model × Problem Variant (hierarchy block).
Coeff.Std. Err.Zp
Intercept1.450.178.420.000
Model (LlaMa 3.1 405B IT)1.600.374.370.000
Model (GPT OSS 120B)1.600.374.370.000
Model (GPT OSS 20B)−0.440.23−1.930.054
Variant (Irrelevant)8.040.1942.570.000
Variant (Irrelevant Incorrect)1.160.363.260.001
Variant (Regular)7.570.2940.430.000
Model (LlaMa 3.1 405B IT) × Variant (Irrelevant)0.000.380.001.00
Model (GPT OSS 120B) × Variant (Irrelevant)0.000.380.001.00
Model (GPT OSS 20B) × Variant (Irrelevant)−4.670.76−6.190.000
Model (LlaMa 3.1 405B IT) × Variant (Irr. Incorrect)5.790.4911.870.000
Model (GPT OSS 120B) × Variant (Irr. Incorrect)5.790.4911.870.000
Model (GPT OSS 20B) × Variant (Irr. Incorrect)−0.820.43−1.880.060
Model (LlaMa 3.1 405B IT) × Variant (Regular)0.000.380.001.00
Model (GPT OSS 120B) × Variant (Regular)0.000.380.001.00
Model (GPT OSS 20B) × Variant (Regular)−3.371.05−3.210.001

Appendix C.4. Effect of Model and Problem Variations on Performance in Analogy Blocks (By Relation)

Table A22. Binomial GLM Results: Model × Problem Variant (same–different analogies).
Table A22. Binomial GLM Results: Model × Problem Variant (same–different analogies).
Coeff.Std. Err.zp
Intercept2.640.733.610.000
Model (LlaMa 3.1 405B IT)11.230.7514.880.000
Model (GPT OSS 120B)14.780.7519.600.000
Model (GPT OSS 20B)12.140.7516.090.000
Variant (Irrelevant)−0.770.91−0.850.398
Variant (Incorrect)−0.770.91−0.850.397
Variant (Irrelevant Incorrect)0.001.040.001.00
Model (LlaMa 3.1 405B IT) × Variant (Irrelevant)−10.461.18−8.870.000
Model (GPT OSS 120B) × Variant (Irrelevant)2.590.942.740.006
Model (GPT OSS 20B) × Variant (Irrelevant)8.940.949.470.000
Model (LlaMa 3.1 405B IT) × Variant (Incorrect)−13.501.00−13.530.000
Model (GPT OSS 120B) × Variant (Incorrect)2.600.942.760.006
Model (GPT OSS 20B) × Variant (Incorrect)−11.371.18−9.640.000
Model (LlaMa 3.1 405B IT) × Variant (Irr. Incor.)−14.271.12−12.800.000
Model (GPT OSS 120B) × Variant (Irr. Incorrect)1.221.071.140.253
Model (GPT OSS 20B) × Variant (Irr. Incorrect)−12.581.22−10.360.000
Table A23. Binomial GLM Results: Model × Problem Variant (same–opposite analogies).
Table A23. Binomial GLM Results: Model × Problem Variant (same–opposite analogies).
Coeff.Std. Err.zp
Intercept1.240.383.270.001
Model (LlaMa 3.1 405B IT)11.980.4129.200.000
Model (GPT OSS 120B)2.431.082.240.025
Model (GPT OSS 20B)2.431.082.250.025
Variant (Irrelevant)0.320.560.560.574
Variant (Incorrect)−0.720.50−1.450.147
Variant (Irrelevant Incorrect)−0.930.50−1.890.059
Model (LlaMa 3.1 405B IT) × Variant (Irrelevant)−11.790.73−16.090.000
Model (GPT OSS 120B) × Variant (Irrelevant)10.801.179.220.000
Model (GPT OSS 20B) × Variant (Irrelevant)−0.321.54−0.210.836
Model (LlaMa 3.1 405B IT) × Variant (Incorrect)−12.290.61−20.040.000
Model (GPT OSS 120B) × Variant (Incorrect)0.001.340.001.00
Model (GPT OSS 20B) × Variant (Incorrect)−0.741.24−0.590.553
Model (LlaMa 3.1 405B IT) × Variant (Irr. Incor.)−12.080.61−19.810.000
Model (GPT OSS 120B) × Variant (Irr. Incorrect)0.211.340.160.873
Model (GPT OSS 20B) × Variant (Irr. Incorrect)13.531.1411.910.000
Table A24. Binomial GLM Results: Model × Problem Variant (more–less analogies).
Table A24. Binomial GLM Results: Model × Problem Variant (more–less analogies).
Coeff.Std. Err.zp
Intercept−16.410.18−90.550.000
Model (LlaMa 3.1 405B IT)0.790.292.790.005
Model (GPT OSS 120B)14.210.7718.510.000
Model (GPT OSS 20B)15.020.5925.590.000
Variant (Irrelevant)13.471.0412.920.000
Variant (Incorrect)−2.570.65−3.950.000
Variant (Irrelevant Incorrect)−2.530.000.001.00
Model (LlaMa 3.1 405B IT) × Variant (Irrelevant)0.411.230.340.738
Model (GPT OSS 120B) × Variant (Irrelevant)−12.371.38−8.960.000
Model (GPT OSS 20B) × Variant (Irrelevant)−11.881.26−9.400.000
Model (LlaMa 3.1 405B IT) × Variant (Incorrect)17.340.8221.130.000
Model (GPT OSS 120B) × Variant (Incorrect)4.361.064.120.000
Model (GPT OSS 20B) × Variant (Incorrect)5.341.025.240.000
Model (LlaMa 3.1 405B IT) × Variant (Irr. Incor.)16.410.6226.470.000
Model (GPT OSS 120B) × Variant (Irr. Incorrect)6.110.886.960.000
Model (GPT OSS 20B) × Variant (Irr. Incorrect)4.120.695.980.000
Table A25. Binomial GLM Results: Model × Problem Variant (before–after analogies).
Table A25. Binomial GLM Results: Model × Problem Variant (before–after analogies).
Coeff.Std. Err.zp
Intercept−1.100.52−2.130.033
Model (LlaMa 3.1 405B IT)0.250.710.350.724
Model (GPT OSS 120B)1.500.692.180.029
Model (GPT OSS 20B)1.300.691.900.058
Variant (Irrelevant)0.000.730.001.00
Variant (Incorrect)−1.851.15−1.610.108
Variant (Irrelevant Incorrect)−1.851.15−1.610.108
Model (LlaMa 3.1 405B IT) × Variant (Irrelevant)−0.541.04−0.520.605
Model (GPT OSS 120B) × Variant (Irrelevant)0.440.990.450.655
Model (GPT OSS 20B) × Variant (Irrelevant)0.200.970.210.833
Model (LlaMa 3.1 405B IT) × Variant (Incorrect)3.791.352.810.005
Model (GPT OSS 120B) × Variant (Incorrect)2.291.331.720.085
Model (GPT OSS 20B) × Variant (Incorrect)2.491.331.880.060
Model (LlaMa 3.1 405B IT) × Variant (Irr. Incor.)1.311.370.960.339
Model (GPT OSS 120B) × Variant (Irr. Incorrect)3.641.442.520.012
Model (GPT OSS 20B) × Variant (Irr. Incorrect)3.031.352.240.025
Table A26. Binomial GLM Results: Model × Problem Variant (hierarchy analogies).
Table A26. Binomial GLM Results: Model × Problem Variant (hierarchy analogies).
Coeff.Std. Err.zp
Intercept−0.620.47−1.320.187
Model (LlaMa 3.1 405B IT)0.620.650.960.339
Model (GPT OSS 120B)3.571.133.150.002
Model (GPT OSS 20B)2.350.783.010.003
Variant (Irrelevant)−1.120.78−1.430.154
Variant (Incorrect)−1.110.78−1.430.154
Variant (Irrelevant Incorrect)−13.450.52−25.890.000
Model (LlaMa 3.1 405B IT) × Variant (Irrelevant)2.211.042.130.033
Model (GPT OSS 120B) × Variant (Irrelevant)13.571.3110.350.000
Model (GPT OSS 20B) × Variant (Irrelevant)1.121.180.940.345
Model (LlaMa 3.1 405B IT) × Variant (Incorrect)1.961.021.910.056
Model (GPT OSS 120B) × Variant (Incorrect)−0.741.39−0.530.597
Model (GPT OSS 20B) × Variant (Incorrect)0.771.150.670.504
Model (LlaMa 3.1 405B IT) × Variant (Irr. Incor.)12.830.8315.450.000
Model (GPT OSS 120B) × Variant (Irr. Incorrect)12.701.379.250.000
Model (GPT OSS 20B) × Variant (Irr. Incorrect)14.661.3111.200.000

Appendix D

Appendix D.1. Effect of Problem Complexity and Variations When Premise Order Is Randomized

Figure A6. Replication study performance for increasing number of relational premises grouped by relations involved. Error bars are 95% Wilson confidence intervals.
Figure A6. Replication study performance for increasing number of relational premises grouped by relations involved. Error bars are 95% Wilson confidence intervals.
Behavsci 16 00045 g0a6
Table A27. Model Accuracy (%) For Increasing Number of Premises in the Replication Study, Grouped by Relations Involved.
Table A27. Model Accuracy (%) For Increasing Number of Premises in the Replication Study, Grouped by Relations Involved.
BlockNumber of PremisesLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20BGPT OSS 120B
Same and DifferentOne100.0098.75100.0090.00
Two97.50100.0098.7590.83
Three86.5697.8198.7588.13
Four71.5087.50100.0091.00
Five65.6378.3399.5895.83
Same and OppositeOne100.00100.00100.0096.25
Two82.8187.5093.7593.13
Three59.7067.6689.8487.03
Four54.3855.3990.0088.13
Five49.9652.1588.2088.83
More Than and Less ThanOne98.7597.50100.00100.00
Two100.00100.00100.0095.63
Three100.00100.0098.7593.13
Four98.75100.00100.0091.88
Five98.13100.0097.5092.50
Before and AfterOne97.5098.75100.0091.25
Two100.00100.0099.3896.88
Three98.1399.38100.00100.00
Four97.5097.5098.7592.50
Five96.8899.38100.0089.38
Contains and Is Part OfOne98.75100.0098.7588.75
Two96.25100.0087.5092.50
Three89.3892.5084.3888.75
Four89.3898.1385.6393.75
Five86.8893.7583.7598.75
Figure A7. Replication study performance on Different Trial Variations, grouped by Relations Involved. Error bars are 95% Wilson confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Figure A7. Replication study performance on Different Trial Variations, grouped by Relations Involved. Error bars are 95% Wilson confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Behavsci 16 00045 g0a7
Table A28. Model Accuracy (%) For Different Problem Variants in the Replication Study, Grouped by Relations Involved.
Table A28. Model Accuracy (%) For Different Problem Variants in the Replication Study, Grouped by Relations Involved.
BlockProblem VariantLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20BGPT OSS 120B
Same and DifferentRegular86.5891.8499.7491.84
Incorrect Conclusion75.9190.6899.3290.46
Irrelevant Premise88.5791.4399.4392.29
Irrelevant Incorrect62.5782.8699.1493.14
Same and OppositeRegular75.2584.1085.3385.57
Incorrect Conclusion39.4636.4694.4693.00
Irrelevant Premise76.0271.7885.8583.98
Irrelevant Incorrect31.7041.5391.7892.37
More Than and Less ThanRegular100.00100.0099.4495.00
Incorrect Conclusion99.09100.0099.0995.91
Irrelevant Premise98.1398.7599.3892.50
Irrelevant Incorrect99.38100.0098.7591.88
Before and AfterRegular98.8999.44100.0095.56
Incorrect Conclusion97.7399.0999.0994.55
Irrelevant Premise98.1398.75100.0093.13
Irrelevant Incorrect97.5098.7599.3893.75
Contains andIs Part OfRegular95.0095.5696.1194.44
Incorrect Conclusion88.6495.0075.4687.73
Irrelevant Premise88.1396.2596.2595.00
Irrelevant Incorrect94.38100.0082.5096.25

Appendix D.2. Deictic Responding

Figure A8. Replication study performance on different variations in deictic relations trials, grouped by relation involved. Error bars are 95% Wilson confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Figure A8. Replication study performance on different variations in deictic relations trials, grouped by relation involved. Error bars are 95% Wilson confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Behavsci 16 00045 g0a8
Table A29. Model Accuracy (%) For Different Deictic Problem Variants in the Replication Study, Grouped by Relations Involved.
Table A29. Model Accuracy (%) For Different Deictic Problem Variants in the Replication Study, Grouped by Relations Involved.
BlockProblem Variant LlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20B GPT OSS 120B
InterpersonalRegular100.0070.00100.0090.00
Incorrect Conclusion95.00100.0070.0095.00
Irrelevant Premise100.0085.00100.00100.00
Irrelevant Incorrect95.00100.0080.0085.00
Reversal80.0085.0085.00100.00
Reversal Incorrect60.0095.0050.0035.00
Reversal + Irrelevant75.0085.0090.0095.00
Reversal Inc. + Irrelevant50.00100.0055.0040.00
TemporalRegular100.0080.00100.00100.00
Incorrect Conclusion100.00100.0095.00100.00
Irrelevant Premise100.0075.0085.0095.00
Irrelevant Incorrect100.00100.00100.00100.00
Reversal100.0095.00100.00100.00
Reversal Incorrect35.0095.0090.00100.00
Reversal + Irrelevant95.00100.00100.00100.00
Reversal Inc. + Irrelevant55.0095.0090.00100.00
SpatialRegular100.00100.00100.0090.00
Incorrect Conclusion100.00100.0090.0095.00
Irrelevant Premise100.00100.00100.0095.00
Irrelevant Incorrect100.00100.0075.0090.00
Reversal100.00100.00100.0090.00
Reversal Incorrect55.00100.00100.0080.00
Reversal + Irrelevant100.00100.00100.00100.00
Reversal Inc. + Irrelevant35.00100.0095.0075.00

Appendix D.3. Analogy

Figure A9. Replication study performance on different variations in analogy problems, grouped by relations involved. Error bars are 95% Wilson confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Figure A9. Replication study performance on different variations in analogy problems, grouped by relations involved. Error bars are 95% Wilson confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Behavsci 16 00045 g0a9
Table A30. Model Accuracy (%) For Different Analogy Problem Variants in the Replication Study, Grouped by Relations Involved.
Table A30. Model Accuracy (%) For Different Analogy Problem Variants in the Replication Study, Grouped by Relations Involved.
BlockProblem Variant LlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20B GPT OSS 120B
Same and DifferentRegular86.67100.0096.6793.33
Incorrect Conclusion99.6756.6790.0090.00
Irrelevant Premise90.0093.3383.3393.33
Irrelevant Incorrect96.6750.0096.6790.00
Same and OppositeRegular67.5095.0097.5097.50
Incorrect Conclusion60.0057.5090.0095.00
Irrelevant Premise77.5077.5095.0095.00
Irrelevant Incorrect57.5050.0092.5092.50
More Than and Less ThanRegular5.005.0015.005.00
Incorrect Conclusion0.0020.0065.0035.00
Irrelevant Premise5.0015.0040.0030.00
Irrelevant Incorrect0.000.0075.0080.00
Before and AfterRegular40.0035.0055.0075.00
Incorrect Conclusion0.0060.0035.0065.00
Irrelevant Premise25.0035.0060.0065.00
Irrelevant Incorrect0.0040.0065.0085.00
Contains andIs Part OfRegular40.0085.0070.0070.00
Incorrect Conclusion5.0035.0075.0070.00
Irrelevant Premise55.0080.0085.0090.00
Irrelevant Incorrect0.0020.0075.0095.00

Appendix D.4. Transformation of Function

Figure A10. Replication study performance on transformation of function trials for different problem variation and number of premises, grouped by relations involved. Error bars are 95% Wilson confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Figure A10. Replication study performance on transformation of function trials for different problem variation and number of premises, grouped by relations involved. Error bars are 95% Wilson confidence intervals. IC = incorrect Conclusion, IP = Irrelevant Premise.
Behavsci 16 00045 g0a10
Table A31. Model Accuracy (%) For Different Transformation of Function Problem Variants in the Replication Study, Grouped by Relations Involved.
Table A31. Model Accuracy (%) For Different Transformation of Function Problem Variants in the Replication Study, Grouped by Relations Involved.
BlockProblem VariantLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20BGPT OSS 120B
Same and DifferentRegular92.5082.5099.5090.00
Incorrect Conclusion52.5079.50100.0091.50
Irrelevant Premise92.0080.0099.5092.00
Irrelevant Incorrect52.5077.0090.0086.00
Same and OppositeRegular70.0074.8490.1693.07
Incorrect Conclusion39.8442.1096.1393.23
Irrelevant Premise70.4870.0086.2991.77
Irrelevant Incorrect43.0762.4287.4291.61
More Than and Less ThanRegular100.00100.0099.0092.00
Incorrect Conclusion94.0099.00100.0095.00
Irrelevant Premise100.0099.0098.0094.00
Irrelevant Incorrect82.0098.0092.0092.00
Before and AfterRegular100.00100.00100.0099.00
Incorrect Conclusion83.0094.0099.0095.00
Irrelevant Premise100.0099.00100.0097.00
Irrelevant Incorrect68.0093.0096.0098.00
Table A32. Model Accuracy (%) for Transformation of Function Problems with Increasing Number of Premises Grouped by Relations Involved.
Table A32. Model Accuracy (%) for Transformation of Function Problems with Increasing Number of Premises Grouped by Relations Involved.
BlockNumber of PremisesLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20B GPT OSS 120B
Same and DifferentOne78.7592.50100.0087.50
Two85.8369.67100.0084.17
Three80.6388.75100.0089.38
Four63.500.0068100.0092.50
Five65.4270.8390.8391.67
Same and OppositeOne93.7598.75100.00100.00
Two70.0081.2598.7596.88
Three56.2591.2594.3890.63
Four51.8853.9190.3189.53
Five53.5962.1997.0393.28
More Than andLess ThanOne93.7596.25100.00100.00
Two96.25100.00100.0092.50
Three96.25100.0098.7593.75
Four92.5098.7598.7587.50
Five91.25100.0088.7592.50
Before and AfterOne97.50100.00100.0095.00
Two91.2598.7598.75100.00
Three88.7595.00100.00100.00
Four81.2592.50100.0095.00
Five80.0096.2595.0096.25

Notes

1
More information, as well as the illustrations of the derivation tables and scripts to create or implement them can be found on the OSF: https://osf.io/bjxqg/overview?view_only=76aea06fae764230b8fa5dea3a5c4728 (accessed on 29 October 2025).
2
3
The preregistered (https://osf.io/78u36/overview?view_only=8f2df70d8ff845e9ad393d407f4c27c1, accessed on 29 October 2025) model sample also contained two models from Google’s Gemma 3 model family and from DeepSeek AI. However, for technical (we could not test Gemma models via Together AI’s API-service), practical (querying the DeepSeek models took a very long time compared to other models) and financial (the cost of running DeepSeek models) reasons, we did not test these models.
4

References

  1. Alexander, P. A., Dumas, D., Grossnickle, E. M., List, A., & Firetto, C. M. (2016). Measuring relational reasoning. The Journal of Experimental Education, 84(1), 119–151. [Google Scholar] [CrossRef]
  2. Ando, R., Morishita, T., Abe, H., Mineshima, K., & Okada, M. (2023). Evaluating large language models with NeuBAROCO: Syllogistic reasoning ability and human-like biases. arXiv, arXiv:2306.12567. [Google Scholar]
  3. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021, March 3–10). On the dangers of stochastic parrots: Can language models be too big? 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623), Virtual. [Google Scholar]
  4. Bengio, Y., Hinton, G., Yao, A., Song, D., Abbeel, P., Darrell, T., Harari, Y. N., Zhang, Y., Xue, L., Shalev-Shwartz, S., Hadfield, G., Clune, J., Maharaj, T., Hutter, F., Baydin, A. G., McIlraith, S., Gao, Q., Acharya, A., Krueger, D., … Anca Dragan, A. (2024). Managing extreme AI risks amid rapid progress. Science, 384(6698), 842–845. [Google Scholar] [CrossRef]
  5. Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T., & Evans, O. (2023). The Reversal Curse: LLMs trained on” A is B” fail to learn” B is A”. arXiv, arXiv:2309.12288. [Google Scholar]
  6. Bertolazzi, L., Gatt, A., & Bernardi, R. (2024). A systematic analysis of large language models as soft reasoners: The case of syllogistic inferences. arXiv, arXiv:2406.11341. [Google Scholar] [CrossRef]
  7. Birney, D. P., Halford, G. S., & Andrews, G. (2006). Measuring the influence of complexity on relational reasoning: The development of the latin square task. Educational and Psychological Measurement, 66(1), 146–171. [Google Scholar] [CrossRef]
  8. Borji, A. (2023). A categorical archive of chatgpt failures. arXiv, arXiv:2302.03494. [Google Scholar] [CrossRef]
  9. Bowman, S. R., & Dahl, G. E. (2021). What will it take to fix benchmarking in natural language understanding? arXiv, arXiv:2104.02145. [Google Scholar] [CrossRef]
  10. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. [Google Scholar]
  11. Cassidy, S., Roche, B., & Hayes, S. C. (2011). A relational frame training intervention to raise intelligence quotients: A pilot study. The Psychological Record, 61(2), 173–198. [Google Scholar] [CrossRef]
  12. Chollet, F. (2019). On the measure of intelligence. arXiv, arXiv:1911.01547. [Google Scholar]
  13. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., & Schuh, P. (2023). Palm: Scaling language modelling with pathways. Journal of Machine Learning Research, 24(240), 1–113. [Google Scholar]
  14. Colbert, D., Dobutowitsch, M., Roche, B., & Brophy, C. (2017). The proxy-measurement of intelligence quotients using a relational skills abilities index. Learning and Individual Differences, 57, 114–122. [Google Scholar] [CrossRef]
  15. Colbert, D., Malone, A., Barrett, S., & Roche, B. (2020). The relational abilities index+: Initial validation of a functionally understood proxy measure for intelligence. Perspectives on Behavior Science, 43(1), 189–213. [Google Scholar] [CrossRef]
  16. Colbert, D., Tyndall, I., Roche, B., & Cassidy, S. (2018). Can SMART training really increase intelligence? A replication study. Journal of Behavioral Education, 27(4), 509–531. [Google Scholar] [CrossRef]
  17. Crockett, M., & Messeri, L. (2023). Should large language models replace human participants? PsyArXiv preprint. [Google Scholar] [CrossRef]
  18. Cummins, J. (2023). On the measurement of relational responding. Journal of Contextual Behavioral Science, 30, 155–168. [Google Scholar] [CrossRef]
  19. Cummins, J., Nevejans, M., Colbert, D., & De Houwer, J. (2023). On the structure of relational responding. Journal of Contextual Behavioral Science, 27, 16–25. [Google Scholar] [CrossRef]
  20. Dixon, M. R., Yi, Z., & Chastain, A. N. (2022). PEAK relational training system. In Handbook of applied behavior analysis interventions for autism: Integrating research into practice (pp. 341–360). Springer International Publishing. [Google Scholar] [CrossRef]
  21. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., & Ganapathy, R. (2024). The llama 3 herd of models. arXiv. [Google Scholar] [CrossRef]
  22. Eisape, T., Tessler, M. H., Dasgupta, I., Sha, F., van Steenkiste, S., & Linzen, T. (2023). A systematic comparison of syllogistic reasoning in humans and language models. arXiv, arXiv:2311.00445. [Google Scholar]
  23. Finn, M., & De Houwer, J. (2021). The selective action of Cfunc control. Journal of the Experimental Analysis of Behavior, 116(3), 314–331. [Google Scholar] [CrossRef]
  24. Finn, M., Raemaekers, M., & De Houwer, J. (2023). Instructing via relations: Function transformations of response and consequence functions of upcoming contingencies. Journal of Contextual Behavioral Science, 30, 203–209. [Google Scholar] [CrossRef]
  25. Frank, M. C. (2023a). Baby steps in evaluating the capacities of large language models. Nature Reviews Psychology, 2(8), 451–452. [Google Scholar] [CrossRef]
  26. Frank, M. C. (2023b). Bridging the data gap between children and large language models. Trends in Cognitive Sciences, 27(11), 990–992. [Google Scholar] [CrossRef]
  27. Gemma Team, Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., & Rouillard, L. (2025). Gemma 3 technical report. arXiv, arXiv:2503.19786. [Google Scholar] [CrossRef]
  28. Gentner, D., & Smith, A. L. (2013). Analogical learning and reasoning. In D. Reisberg (Ed.), The oxford handbook of cognitive psychology. Oxford Library of Psychology. [Google Scholar] [CrossRef]
  29. Goel, V., & Dolan, R. J. (2001). Functional neuroanatomy of three-term relational reasoning. Neuropsychologia, 39(9), 901–909. [Google Scholar] [CrossRef] [PubMed]
  30. Goodwin, G. P., & Johnson-Laird, P. N. (2005). Reasoning about relations. Psychological Review, 112(2), 468. [Google Scholar] [CrossRef] [PubMed]
  31. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., & Li, E. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv, arXiv:2501.12948. [Google Scholar]
  32. Halford, G. S., Wilson, W. H., & Phillips, S. (2010). Relational knowledge: The foundation of higher cognition. Trends in Cognitive Sciences, 14(11), 497–505. [Google Scholar] [CrossRef]
  33. Hayes, S. C., Barnes-Holmes, D., & Roche, B. (Eds.). (2001). Relational frame theory: A post-Skinnerian account of human language and cognition. Springer Science & Business Media. [Google Scholar]
  34. Holyoak, K. J. (2012). Analogy and relational reasoning. In The oxford handbook of thinking and reasoning (pp. 234–259). Oxford University Press. [Google Scholar] [CrossRef]
  35. Hughes, S., & Barnes-Holmes, D. (2015). Relational frame theory: The basic account. In The Wiley handbook of contextual behavioral science (pp. 129–178). Wiley Online Library. [Google Scholar] [CrossRef]
  36. Hummel, J. E., & Holyoak, K. J. (2001). A process model of human transitive inference. Spatial Schemas and Abstract Thought, 279–306. [Google Scholar] [CrossRef]
  37. Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., & Antoniak, S. (2024). Mixtral of experts. arXiv, arXiv:2401.04088. [Google Scholar] [CrossRef]
  38. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv, arXiv:2001.08361. [Google Scholar] [CrossRef]
  39. Kaufman, A. S., & Lichtenberger, E. O. (2006). Assessing adolescent and adult intelligence (3rd ed.). John Wiley & Sons, Inc. [Google Scholar]
  40. Legg, S., & Hutter, M. (2007). A collection of definitions of intelligence. Frontiers in Artificial Intelligence and applications, 157, 17. [Google Scholar]
  41. Lewis, M., & Mitchell, M. (2024). Using counterfactual tasks to evaluate the generality of analogical reasoning in large language models. arXiv, arXiv:2402.08955. [Google Scholar] [CrossRef]
  42. Li, L. H., Hessel, J., Yu, Y., Ren, X., Chang, K. W., & Choi, Y. (2023). Symbolic chain-of-thought distillation: Small models can also” think” step-by-step. arXiv, arXiv:2306.14050. [Google Scholar]
  43. Lin, Z. (2025). Six fallacies in substituting large language models for human participants. Advances in Methods and Practices in Psychological Science, 8(3), 25152459251357566. [Google Scholar] [CrossRef]
  44. Lionello-DeNolf, K. M. (2009). The search for symmetry: 25 years in review. Learning & Behavior, 37(2), 188–203. [Google Scholar] [CrossRef]
  45. Lionello-DeNolf, K. M. (2021). An update on the search for symmetry in nonhumans. Journal of the Experimental Analysis of Behavior, 115(1), 309–325. [Google Scholar] [CrossRef]
  46. May, R. J., Tyndall, I., McTiernan, A., Roderique-Davies, G., & McLoughlin, S. (2022). The impact of the SMART program on cognitive and academic skills: A systematic review and meta-analysis. British Journal of Educational Technology, 53(5), 1244–1261. [Google Scholar] [CrossRef]
  47. McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D., & Griffiths, T. L. (2024). Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences, 121(41), e2322420121. [Google Scholar] [CrossRef]
  48. McHugh, L., Barnes-Holmes, Y., & Barnes-Holmes, D. (2004). Perspective-taking as relational responding: A developmental profile. The Psychological Record, 54(1), 115–144. [Google Scholar] [CrossRef]
  49. McLoughlin, S., Tyndall, I., & Pereira, A. (2020). Convergence of multiple fields on a relational reasoning approach to cognition. Intelligence, 83, 101491. [Google Scholar] [CrossRef]
  50. Meta AI. (2024, July 23). Introducing Llama 3.1: Our most capable models to date. Available online: https://ai.meta.com/blog/meta-llama-3-1/ (accessed on 29 October 2025).
  51. Morris, M. R., Sohl-dickstein, J., Fiedel, N., Warkentin, T., Dafoe, A., Faust, A., Farabet, C., & Legg, S. (2023). Levels of AGI for operationalizing progress on the path to AGI. arXiv, arXiv:2311.02462. [Google Scholar]
  52. Open AI. (2025, August 5). Introducing gpt-oss. Available online: https://openai.com/index/introducing-gpt-oss/ (accessed on 29 October 2025).
  53. Open AI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Bello, I. (2023). Gpt-4 technical report. arXiv, arXiv:2303.08774. [Google Scholar] [CrossRef]
  54. Open AI, Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., Iftimie, A., Karpenko, A., Passos, A. T., Neitz, A., Prokofiev, A., Wei, A., Tam, A., Bennett, A., … Kumar, A. (2024). Openai o1 system card. arXiv, arXiv:2412.16720. [Google Scholar] [CrossRef]
  55. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & Schulman, J. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744. [Google Scholar]
  56. Ozeki, K., Ando, R., Morishita, T., Abe, H., Mineshima, K., & Okada, M. (2024). Exploring reasoning biases in large language models through syllogism: Insights from the NeuBAROCO dataset. arXiv, arXiv:2408.04403. [Google Scholar] [CrossRef]
  57. Penn, D. C., Holyoak, K. J., & Povinelli, D. J. (2008). Darwin’s mistake: Explaining the discontinuity between human and nonhuman minds. Behavioral and Brain Sciences, 31(2), 109–130. [Google Scholar] [CrossRef] [PubMed]
  58. Premack, D. (1983). The codes of man and beasts. Behavioral and Brain Sciences, 6(1), 125–136. [Google Scholar] [CrossRef]
  59. Raemaekers, M. (in preparation). Open-source tools for relational network derivation, visualization and task generation.
  60. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. arXiv. Available online: https://arxiv.org/abs/2305.18290 (accessed on 29 October 2025).
  61. Raven, J., & Raven, J. (2003). Raven progressive matrices. In R. S. McCallum (Ed.), Handbook of nonverbal assessment (pp. 223–237). Kluwer Academic/Plenum Publishers. [Google Scholar] [CrossRef]
  62. Shiffrin, R., & Mitchell, M. (2023). Probing the psychology of AI models. Proceedings of the National Academy of Sciences, 120(10), e2300963120. [Google Scholar] [CrossRef]
  63. Sourati, Z., Ilievski, F., Sommerauer, P., & Jiang, Y. (2024). Arn: Analogical reasoning on narratives. Transactions of the Association for Computational Linguistics, 12, 1063–1086. [Google Scholar] [CrossRef]
  64. Srivastava, A., Kleyko, D., & Wu, Z. (2023). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv. Available online: https://arxiv.org/abs/2206.04615 (accessed on 29 October 2025).
  65. Sternberg, R. J., & Detterman, D. K. (1987). What is intelligence? Contemporary viewpoints on its nature and definition. The American Journal of Psychology, 100(1), 141. [Google Scholar] [CrossRef]
  66. Todd, J. A. M., Andrews, G., & Conlon, E. G. (2019). Relational thinking in later adulthood. Psychology and Aging, 34(4), 486. [Google Scholar] [CrossRef] [PubMed]
  67. Wang, P. (2019). On defining artificial intelligence. Journal of artificial general intelligence, 10(2), 1–37. [Google Scholar] [CrossRef]
  68. Webb, T., Holyoak, K. J., & Lu, H. (2023). Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9), 1526–1541. [Google Scholar] [CrossRef]
  69. Wechsler, D. (2008). Wechsler adult intelligence scale-fourth edition (WAIS-IV) [Database record]. APA PsycTests. [Google Scholar] [CrossRef]
  70. Wu, W., & Deng, W. (2025, April 6–11). Transitive Inference in Large Language Models and Prompting Intervention. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5), Hyderabad, India. [Google Scholar] [CrossRef]
  71. Wu, Z., Qiu, L., Ross, A., Akyürek, E., Chen, B., Wang, B., Kim, N., Andreas, J., & Kim, Y. (2024). Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. In Proceedings of the 2024 conference of the north american chapter of the association for computational linguistics: Human language technologies (Volume 1: Long Papers) (pp. 1819–1862). Association for Computational Linguistics. [Google Scholar] [CrossRef]
  72. Zador, A., Escola, S., Richards, B., Ölveczky, B., Bengio, Y., Boahen, K., Botvinick, M., Chklovskii, D., Churchland, A., Clopath, C., DiCarlo, J., Ganguli, S., Hawkins, J., Körding, K., Koulakov, A., LeCun, Y., Lillicrap, T., Marblestone, A., Olshausen, B., … Pouget, A. (2023). Catalyzing next-generation artificial intelligence through neuroai. Nature Communications, 14(1), 1597. [Google Scholar] [CrossRef]
Figure 1. Example trials from the full task set. (A): Example of two one-premise problems with the correct reversal prompted. (B): A two-premise problem involving more than relations, with the incorrect conclusion prompted. (C): A four-premise problem with an irrelevant premise, all involving same–opposite relations. (D): An example of an analogy prompt involing temporal relations. (E): An analogy prompt with the incorrect conclusion prompted. (F): An analogy with an irrelevant premise included. (GI): Deictic relations problems with the correct relation prompted, the incorrect relation prompted, and an irrelevant premise included, respectively. (J): A deictic relation problem with a reversal of the deictic dimension. (KM): Examples of problems involving transformations of function, with the correct conclusion prompted, the incorrect conclusion prompted, and an irrelevant premise included, respectively.
Figure 1. Example trials from the full task set. (A): Example of two one-premise problems with the correct reversal prompted. (B): A two-premise problem involving more than relations, with the incorrect conclusion prompted. (C): A four-premise problem with an irrelevant premise, all involving same–opposite relations. (D): An example of an analogy prompt involing temporal relations. (E): An analogy prompt with the incorrect conclusion prompted. (F): An analogy with an irrelevant premise included. (GI): Deictic relations problems with the correct relation prompted, the incorrect relation prompted, and an irrelevant premise included, respectively. (J): A deictic relation problem with a reversal of the deictic dimension. (KM): Examples of problems involving transformations of function, with the correct conclusion prompted, the incorrect conclusion prompted, and an irrelevant premise included, respectively.
Behavsci 16 00045 g001
Figure 2. Model performance (percent accurate) grouped by the relations involved in the problems. Error bars are 95% Wilson binomial confidence intervals across items within each block.
Figure 2. Model performance (percent accurate) grouped by the relations involved in the problems. Error bars are 95% Wilson binomial confidence intervals across items within each block.
Behavsci 16 00045 g002
Figure 3. Model performance (percent accurate) in the replication study with randomized premise order, grouped by the relations involved in the problems. Error bars are 95% Wilson binomial confidence intervals across items within each block.
Figure 3. Model performance (percent accurate) in the replication study with randomized premise order, grouped by the relations involved in the problems. Error bars are 95% Wilson binomial confidence intervals across items within each block.
Behavsci 16 00045 g003
Table 1. Overview of the tested models (and group), their training and parameter size.
Table 1. Overview of the tested models (and group), their training and parameter size.
Model (Group)TrainingParameters
LLaMa 3.1 405B IT
(LLM)
Pre: 15T+ multilingual, open-source text tokens. Post: Supervised Fine-Tuning, Rejection Sampling, and Direct Preference Optimization 1.405B
LLaMa 3.3 70B IT
(LLM)
Pre: 15T+ multilingual, open-source text tokens. Post: Supervised fine-tuning and reinforcement learning with human feedback 1.70B
GPT OSS 20B
(LRM)
Pre: trillions of text tokens, focus on STEM, coding, and general knowledge. Post: supervised fine-tuning and high-compute RL 2.21B, 3.6 active
GPT OSS 120B
(LRM)
Pre: trillions of text tokens, focus on STEM, coding, and general knowledge. Post: supervised fine-tuning and high-compute RL 2.117B, 5.1 active
Note: B = billion, T = trillion. RL = Reinforcement Learning. Model names are those used in the referenced publications. 1 Meta AI (2024). 2 Open AI (2025).
Table 2. Model Accuracy (%) Across Blocks.
Table 2. Model Accuracy (%) Across Blocks.
Block\ModelLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20B GPT OSS 120B
Same and Different70.5983.1699.3299.80
Same and Opposite55.1861.2591.1193.56
More Than and Less Than10099.5899.58100
Before and After99.3199.1799.5899.58
Contains and Is Part Of92.6498.6186.8198.61
Deictic86.6792.1787.5089.79
Analogy47.3155.1983.2783.85
Transformation Function69.0482.0696.7499.24
Table 3. Model Accuracy (%) Across Blocks in the Replication with Randomized Premise Order.
Table 3. Model Accuracy (%) Across Blocks in the Replication with Randomized Premise Order.
Block\ModelLlaMa 3.3 70BLlaMa 3.3 405BGPT OSS 20B GPT OSS 120B
Same and Different78.4289.3499.4191.84
Same and Opposite55.3758.1489.4588.81
More Than and Less Than99.1799.7299.1794.03
Before and After98.0699.0399.5894.31
Contains and Is Part Of91.3996.5386.8192.92
Deictic84.5894.1789.5889.58
Analogy48.2755.3977.5079.81
Transformation Function65.9672.7092.9992.48
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Raemaekers, M.; Finn, M.; De Houwer, J. Assessing the Relational Abilities of Large Language Models and Large Reasoning Models. Behav. Sci. 2026, 16, 45. https://doi.org/10.3390/bs16010045

AMA Style

Raemaekers M, Finn M, De Houwer J. Assessing the Relational Abilities of Large Language Models and Large Reasoning Models. Behavioral Sciences. 2026; 16(1):45. https://doi.org/10.3390/bs16010045

Chicago/Turabian Style

Raemaekers, Matthias, Martin Finn, and Jan De Houwer. 2026. "Assessing the Relational Abilities of Large Language Models and Large Reasoning Models" Behavioral Sciences 16, no. 1: 45. https://doi.org/10.3390/bs16010045

APA Style

Raemaekers, M., Finn, M., & De Houwer, J. (2026). Assessing the Relational Abilities of Large Language Models and Large Reasoning Models. Behavioral Sciences, 16(1), 45. https://doi.org/10.3390/bs16010045

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop