Abstract
We assessed the relational abilities of two state-of-the-art large language models (LLMs) and two large reasoning models (LRMs) using a new battery of several thousand syllogistic problems, similar to those used in behavior-analytic tasks for relational abilities. To probe the models’ general (as opposed to task- or domain-specific) abilities, the problems involved multiple relations (sameness, difference, comparison, hierarchy, analogy, temporal and deictic), specified between randomly selected nonwords and varied in terms of complexity (number of premises, inclusion of irrelevant premises) and format (valid or invalid conclusion prompted). We also tested transformations of stimulus function. Our results show that the models generally performed well in this new task battery. The models did show some variability across different relations and were to a limited extent affected by task variations. Model performance was, however, robust against the randomization of premise order in a replication study. Our research provides a new framework for testing a core aspect of intellectual (i.e., relational) abilities in artificial systems; we discuss the implications of this and future research directions.