Assessing the Relational Abilities of Large Language Models and Large Reasoning Models
Abstract
1. Introduction
2. Materials and Methods
2.1. Syllogistic Reasoning Problems
2.1.1. Number of Premises and Irrelevant Premises
2.1.2. Conclusion Validity
2.1.3. Deictic Relations
2.1.4. Analogy
2.1.5. Transformation of Function
2.1.6. Replication with Randomized Order of Premises
2.2. Models and Inference
3. Results
3.1. Overall Task Performance
3.2. Effect of Problem Complexity
3.3. Effect of Prompt Validity
3.4. Deictic Relations Performance
3.5. Analogy Performance
3.6. Transformation of Function Performance
3.7. Replication with Randomized Order of Relational Premises
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| LLM | Large Language Model |
| LRM | Large Reasoning Model |
| RFT | Relational Frame Theory |
| RAI | Relational Abilities Index |
| OSF | Open Science Framework |
Appendix A
Pilot Optimization Study
| Model\Temperature | 0 | 0.25 | 0.5 | 0.75 |
|---|---|---|---|---|
| GPT OSS 120B | 95.1 | 95.9 | 96.2 | 97.0 |
| LlaMa 3.1 405B IT | 92.3 | 92.3 | 92.9 | 92.6 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 2.49 | 0.20 | 12.634 | 0 |
| Model (GPT) | 0.47 | 0.25 | 1.88 | 0.06 |
| Temperature 0.25 | 0 | 0.06 | 0 | 1 |
| Temperature 0.5 | 0.08 | 0.08 | 1.00 | 0.32 |
| Temperature 0.75 | 0.04 | 0.07 | 0.58 | 0.57 |
| Model–GPT × Temperature—0.25 | 0.19 | 0.26 | 0.75 | 0.45 |
| Model–GPT × Temperature—0.5 | 0.18 | 0.27 | 0.69 | 0.49 |
| Model–GPT × Temperature—0.75 | 0.47 | 0.28 | 1.69 | 0.09 |
| Model\Prompt | Zero-Shot | Few-Shot | Chain-of-Thought |
|---|---|---|---|
| GPT OSS 120B | 93.9 | 98.6 | 95.7 |
| LlaMa 3.1 405B IT | 90.8 | 95.3 | 91.6 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 2.39 | 0.16 | 14.61 | 0 |
| Model (GPT) | 0.71 | 0.23 | 3.05 | 0.002 |
| Prompt–Few-shot | 0.62 | 0.22 | 2.82 | 0.005 |
| Prompt–CoT | −0.10 | 0.10 | −1 | 0.318 |
| Model–GPT × Prompt–Few-shot | 0.51 | 0.55 | 0.93 | 0.350 |
| Model–GPT × Prompt–CoT | −0.27 | 0.23 | −1.20 | 0.229 |
Appendix B
Appendix B.1. Effect of Problem Complexity and Variations

| Block | Number of Premises | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Same and Different | One | 100.00 | 100.00 | 100.00 | 100.00 |
| Two | 100.00 | 100.00 | 99.17 | 100.00 | |
| Three | 79.06 | 91.25 | 99.69 | 99.69 | |
| Four | 61.50 | 80.50 | 99.25 | 99.75 | |
| Five | 52.92 | 68.75 | 99.17 | 99.79 | |
| Same and Opposite | One | 100.00 | 100.00 | 98.75 | 100.00 |
| Two | 82.81 | 85.625 | 92.81 | 98.75 | |
| Three | 59.69 | 66.88 | 89.06 | 90.78 | |
| Four | 53.52 | 61.80 | 92.11 | 93.20 | |
| Five | 50.04 | 55.31 | 99.66 | 93.59 | |
| More Than and Less Than | One | 100.00 | 96.25 | 100.00 | 100.00 |
| Two | 100.00 | 100.00 | 98.75 | 100.00 | |
| Three | 100.00 | 100.00 | 100.00 | 100.00 | |
| Four | 100.00 | 100.00 | 99.38 | 100.00 | |
| Five | 100.00 | 100.00 | 100.00 | 100.00 | |
| Before and After | One | 96.25 | 92.50 | 100.00 | 100.00 |
| Two | 99.38 | 100.00 | 100.00 | 100.00 | |
| Three | 99.38 | 100.00 | 100.00 | 100.00 | |
| Four | 100.00 | 100.00 | 98.75 | 93.75 | |
| Five | 100.00 | 100.00 | 99.38 | 98.75 | |
| Contains and Is Part Of | One | 100.00 | 100.00 | 98.75 | 100.00 |
| Two | 93.13 | 100.00 | 88.13 | 100.00 | |
| Three | 86.38 | 93.75 | 79.38 | 93.75 | |
| Four | 93.13 | 100.00 | 86.25 | 100.00 | |
| Five | 93.75 | 100.00 | 87.50 | 100.00 |

| Block | Problem Variant | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Same and Different | Regular | 69.21 | 86.32 | 98.42 | 100.00 |
| Incorrect Conclusion | 78.68 | 85.68 | 99.55 | 99.77 | |
| Irrelevant Premise | 68.57 | 80.29 | 99.71 | 99.71 | |
| Irrelevant Incorrect | 67.71 | 79.43 | 99.71 | 99.71 | |
| Same and Opposite | Regular | 66.07 | 88.03 | 85.74 | 87.38 |
| Incorrect Conclusion | 47.31 | 39.08 | 96.85 | 99.31 | |
| Irrelevant Premise | 69.07 | 77.88 | 85.76 | 88.39 | |
| Irrelevant Incorrect | 38.73 | 41.36 | 95.68 | 98.81 | |
| More Than and Less Than | Regular | 100.00 | 100.00 | 100.00 | 100.00 |
| Incorrect Conclusion | 100.00 | 100.00 | 100.00 | 100.00 | |
| Irrelevant Premise | 100.00 | 98.13 | 99.38 | 100.00 | |
| Irrelevant Incorrect | 100.00 | 100.00 | 98.75 | 100.00 | |
| Before and After | Regular | 98.89 | 99.44 | 100.00 | 100.00 |
| Incorrect Conclusion | 100.00 | 100.00 | 100.00 | 99.46 | |
| Irrelevant Premise | 98.13 | 96.88 | 99.38 | 99.38 | |
| Irrelevant Incorrect | 100.00 | 100.00 | 98.75 | 99.38 | |
| Contains and Is Part Of | Regular | 100.00 | 100.00 | 99.44 | 100.00 |
| Incorrect Conclusion | 80.91 | 95.46 | 73.18 | 95.46 | |
| Irrelevant Premise | 100.00 | 100.00 | 98.75 | 100.00 | |
| Irrelevant Incorrect | 93.13 | 100.00 | 79.38 | 100.00 |
Appendix B.2. Deictic Responding

| Block | Problem Variant | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Interpersonal | Regular | 95.00 | 75.00 | 95.00 | 90.00 |
| Incorrect Conclusion | 95.00 | 100.00 | 65.00 | 80.00 | |
| Irrelevant Premise | 100.00 | 90.00 | 100.00 | 100.00 | |
| Irrelevant Incorrect | 95.00 | 100.00 | 75.00 | 90.00 | |
| Reversal | 90.00 | 80.00 | 90.00 | 100.00 | |
| Reversal Incorrect | 65.00 | 95.00 | 35.00 | 40.00 | |
| Reversal + Irrelevant | 95.00 | 60.00 | 80.00 | 95.00 | |
| Reversal Inc. + Irrelevant | 45.00 | 90.00 | 30.00 | 35.00 | |
| Temporal | Regular | 95.00 | 75.00 | 100.00 | 100.00 |
| Incorrect Conclusion | 100.00 | 100.00 | 100.00 | 100.00 | |
| Irrelevant Premise | 100.00 | 80.00 | 90.00 | 95.00 | |
| Irrelevant Incorrect | 100.00 | 100.00 | 100.00 | 100.00 | |
| Reversal | 95.00 | 95.00 | 90.00 | 100.00 | |
| Reversal Incorrect | 35.00 | 100.00 | 100.00 | 95.00 | |
| Reversal + Irrelevant | 95.00 | 100.00 | 95.00 | 100.00 | |
| Reversal Inc. + Irrelevant | 60.00 | 100.00 | 95.00 | 90.00 | |
| Spatial | Regular | 100.00 | 100.00 | 100.00 | 100.00 |
| Incorrect Conclusion | 100.00 | 100.00 | 90.00 | 100.00 | |
| Irrelevant Premise | 100.00 | 100.00 | 100.00 | 100.00 | |
| Irrelevant Incorrect | 100.00 | 100.00 | 90.00 | 100.00 | |
| Reversal | 100.00 | 95.00 | 100.00 | 100.00 | |
| Reversal Incorrect | 70.00 | 100.00 | 90.00 | 80.00 | |
| Reversal + Irrelevant | 100.00 | 100.00 | 95.00 | 95.00 | |
| Reversal Inc. + Irrelevant | 50.00 | 95.00 | 95.00 | 70.00 |
Appendix B.3. Analogy

| Block | Problem Variant | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Same and Different | Regular | 93.33 | 100.00 | 100.00 | 100.00 |
| Incorrect Conclusion | 86.67 | 40.00 | 93.33 | 100.00 | |
| Irrelevant Premise | 86.67 | 93.33 | 100.00 | 100.00 | |
| Irrelevant Incorrect | 93.33 | 40.00 | 90.00 | 100.00 | |
| Same and Opposite | Regular | 75.5 | 100.00 | 97.50 | 97.50 |
| Incorrect Conclusion | 62.50 | 55.00 | 90.00 | 95.00 | |
| Irrelevant Premise | 82.50 | 85.00 | 97.50 | 100.00 | |
| Irrelevant Incorrect | 57.50 | 55.00 | 100.00 | 95.00 | |
| More Than and Less Than | Regular | 0.00 | 0.00 | 20.00 | 10.00 |
| Incorrect Conclusion | 0.00 | 30.00 | 80.00 | 40.00 | |
| Irrelevant Premise | 5.00 | 15.00 | 55.00 | 25.00 | |
| Irrelevant Incorrect | 0.00 | 15.00 | 55.00 | 80.00 | |
| Before and After | Regular | 25.00 | 30.00 | 55.00 | 60.00 |
| Incorrect Conclusion | 5.00 | 75.00 | 70.00 | 70.00 | |
| Irrelevant Premise | 25.00 | 20.00 | 60.00 | 70.00 | |
| Irrelevant Incorrect | 5.00 | 20.00 | 80.00 | 90.00 | |
| Contains and Is Part Of | Regular | 35.00 | 50.00 | 85.00 | 95.00 |
| Incorrect Conclusion | 15.00 | 70.00 | 80.00 | 75.00 | |
| Irrelevant Premise | 15.00 | 75.00 | 85.00 | 100.00 | |
| Irrelevant Incorrect | 0.00 | 35.00 | 95.00 | 90.00 |
Appendix B.4. Transformation of Function

| Block | Problem Variant | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Same and Different | Regular | 76.50 | 98.50 | 97.00 | 99.50 |
| Incorrect Conclusion | 47.00 | 94.50 | 100.00 | 100.00 | |
| Irrelevant Premise | 75.50 | 93.50 | 99.00 | 99.50 | |
| Irrelevant Incorrect | 69.00 | 88.00 | 96.00 | 98.00 | |
| Same and Opposite | Regular | 64.19 | 92.10 | 95.32 | 99.03 |
| Incorrect Conclusion | 45.16 | 39.19 | 97.90 | 99.68 | |
| Irrelevant Premise | 62.10 | 85.32 | 92.74 | 98.07 | |
| Irrelevant Incorrect | 71.13 | 74.03 | 96.94 | 99.36 | |
| More Than and Less Than | Regular | 100.00 | 100.00 | 100.00 | 100.00 |
| Incorrect Conclusion | 100.00 | 100.00 | 99.00 | 100.00 | |
| Irrelevant Premise | 100.00 | 98.00 | 100.00 | 100.00 | |
| Irrelevant Incorrect | 86.00 | 99.00 | 94.00 | 100.00 | |
| Before and After | Regular | 100.00 | 100.00 | 100.00 | 100.00 |
| Incorrect Conclusion | 100.00 | 100.00 | 99.00 | 100.00 | |
| Irrelevant Premise | 100.00 | 100.00 | 99.00 | 99.00 | |
| Irrelevant Incorrect | 91.00 | 100.00 | 98.00 | 100.00 |
| Block | Number of Premises | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Same and Different | One | 76.25 | 95.00 | 98.75 | 100.00 |
| Two | 89.17 | 97.50 | 100.00 | 100.00 | |
| Three | 77.50 | 95.63 | 98.13 | 100.00 | |
| Four | 41.50 | 94.00 | 99.00 | 100.00 | |
| Five | 67.08 | 89.58 | 95.83 | 97.50 | |
| Same and Opposite | One | 86.25 | 97.50 | 100.00 | 100.00 |
| Two | 70.63 | 85.00 | 100.00 | 98.13 | |
| Three | 57.81 | 72.81 | 96.25 | 98.44 | |
| Four | 52.19 | 63.59 | 84.47 | 99.22 | |
| Five | 62.73 | 74.06 | 94.92 | 99.14 | |
| More Than and Less Than | One | 93.75 | 97.50 | 100.00 | 100.00 |
| Two | 98.75 | 100.00 | 100.00 | 100.00 | |
| Three | 100.00 | 100.00 | 100.00 | 100.00 | |
| Four | 100.00 | 100.00 | 98.75 | 100.00 | |
| Five | 100.00 | 98.75 | 92.50 | 100.00 | |
| Before and After | One | 100.00 | 100.00 | 100.00 | 100.00 |
| Two | 96.25 | 100.00 | 100.00 | 100.00 | |
| Three | 97.50 | 100.00 | 100.00 | 100.00 | |
| Four | 98.75 | 100.00 | 98.75 | 98.75 | |
| Five | 98.75 | 100.00 | 96.25 | 100.00 |
Appendix C
Appendix C.1. Effect of Model and Block (Relations) on Performance
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | −0.88 | 0.34 | −2.57 | 0.010 |
| Model (LlaMa 3.1 405B IT) | −0.72 | 0.16 | −4.41 | 0.000 |
| Model (GPT OSS 120B) | −5.35 | 0.59 | −9.01 | 0.000 |
| Model (GPT OSS 20B) | −4.14 | 0.39 | −10.52 | 0.000 |
| Block (Same–Opposite) | 0.67 | 0.36 | 1.90 | 0.062 |
| Block (More–Less) | −26.69 | 119.33 | −7.04 | 0.823 |
| Block (Before–After) | −4.09 | 0.58 | −3.81 | 0.000 |
| Block (Hierarchy) | −1.66 | 0.44 | −2.76 | 0.000 |
| Block (Deictic) | −1.00 | 0.36 | 1.86 | 0.006 |
| Block (Analogy) | 0.98 | 0.53 | 0.21 | 0.062 |
| Block (Transformation) | 0.07 | 0.34 | 2.50 | 0.831 |
| Model (LlaMa 3.1 405B IT) × Block (Same–Opposite) | 0.47 | 0.19 | 2.50 | 0.012 |
| Model (GPT OSS 120B) × Block (Same–Opposite) | 2.88 | 0.62 | 4.65 | 0.000 |
| Model (GPT OSS 20B) × Block (Same–Opposite) | 2.02 | 0.42 | 4.87 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Block (More–Less) | 22.81 | 114.67 | 0.20 | 0.842 |
| Model (GPT OSS 120B) × Block (More–Less) | 5.35 | 119.99 | 0.05 | 0.964 |
| Model (GPT OSS 20B) × Block (More–Less) | 26.23 | 120.64 | 0.22 | 0.828 |
| Model (LlaMa 3.1 405B IT) × Block (Before–After) | 0.90 | 0.72 | 1.26 | 0.209 |
| Model (GPT OSS 120B) × Block (Before–After) | 4.84 | 0.99 | 4.90 | 0.000 |
| Model (GPT OSS 20B) × Block (Before–After) | 3.63 | 1.00 | 3.62 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Block (Hierarchy) | −1.01 | 0.89 | −1.14 | 0.256 |
| Model (GPT OSS 120B) × Block (Hierarchy) | 3.62 | 1.06 | 3.43 | 0.001 |
| Model (GPT OSS 20B) × Block (Hierarchy) | 4.79 | 0.43 | 11.05 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Block (Deictic) | 0.02 | 0.40 | 0.05 | 0.962 |
| Model (GPT OSS 120B) × Block (Deictic) | 5.05 | 0.77 | 6.59 | 0.000 |
| Model (GPT OSS 20B) × Block (Deictic) | 4.07 | 0.69 | 5.91 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Block (Analogy) | 0.41 | 0.33 | 1.24 | 0.215 |
| Model (GPT OSS 120B) × Block (Analogy) | 3.60 | 0.69 | 5.25 | 0.000 |
| Model (GPT OSS 20B) × Block (Analogy) | 2.43 | 0.44 | 5.73 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Block (Transformation) | 0.00 | 0.19 | 0.02 | 0.987 |
| Model (GPT OSS 120B) × Block (Transformation) | 1.28 | 0.63 | 2.01 | 0.044 |
| Model (GPT OSS 20B) × Block (Transformation) | 1.55 | 0.43 | 3.65 | 0.000 |
Appendix C.2. Effect of Model Problem Complexity on Performance
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 7.92 | 0.11 | 70.77 | 0.000 |
| Model (LlaMa 3.1 405B IT) | 1.12 | 0.16 | 7.11 | 0.000 |
| Model (GPT OSS 120B) | 4.66 | 0.16 | 29.48 | 0.000 |
| Model (GPT OSS 20B) | 2.45 | 0.16 | 45.46 | 0.000 |
| Complexity (2 Premises) | 2.07 | 0.13 | 16.00 | 0.000 |
| Complexity (3 Premises) | −6.59 | 0.18 | −37.17 | 0.000 |
| Complexity (4 Premises) | −7.45 | 0.15 | −49.03 | 0.000 |
| Complexity (5 Premises) | −7.80 | 0.14 | −53.981 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Complexity (2 Premises) | 0.13 | 0.18 | 0.74 | 0.463 |
| Model (GPT OSS 120B) × Complexity (2 Premises) | 0.00 | 0.18 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Complexity (2 Premises) | −7.65 | 0.73 | −10.45 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Complexity (3 Premises) | −0.11 | 0.29 | −0.38 | 0.705 |
| Model (GPT OSS 120B) × Complexity (3 Premises) | −0.22 | 1.03 | −0.21 | 0.834 |
| Model (GPT OSS 20B) × Complexity (3 Premises) | 1.98 | 1.02 | 1.95 | 0.051 |
| Model (LlaMa 3.1 405B IT) × Complexity (4 Premises) | −0.18 | 0.23 | −0.77 | 0.440 |
| Model (GPT OSS 120B) × Complexity (4 Premises) | 0.85 | 1.01 | 0.84 | 0.400 |
| Model (GPT OSS 20B) × Complexity (4 Premises) | 1.97 | 0.61 | 3.32 | 0.001 |
| Model (LlaMa 3.1 405B IT) × Complexity (5 Premises) | −0.45 | 0.21 | −2.18 | 0.029 |
| Model (GPT OSS 120B) × Complexity (5 Premises) | 1.39 | 1.02 | 1.37 | 0.171 |
| Model (GPT OSS 20B) × Complexity (5 Premises) | 2.21 | 0.53 | 4.16 | 0.000 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 6.40 | 0.11 | 57–159 | 0.000 |
| Model (LlaMa 3.1 405B IT) | 2.39 | 0.16 | 15.13 | 0.000 |
| Model (GPT OSS 120B) | 3.05 | 0.16 | 19.30 | 0.000 |
| Model (GPT OSS 20B) | −2.00 | 1.05 | −1.91 | 0.056 |
| Complexity (2 Premises) | −4.83 | 0.19 | −26.00 | 0.000 |
| Complexity (3 Premises) | −6.01 | 0.39 | −43.55 | 0.000 |
| Complexity (4 Premises) | −6.26 | 0.13 | −49.99 | 0.000 |
| Complexity (5 Premises) | −6.40 | 0.12 | −53.89 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Complexity (2 Premises) | −2.18 | 0.70 | −8.11 | 0.000 |
| Model (GPT OSS 120B) × Complexity (2 Premises) | −0.26 | 0.55 | −0.47 | 0.639 |
| Model (GPT OSS 20B) × Complexity (2 Premises) | 2.98 | 1.08 | 2.77 | 0.006 |
| Model (LlaMa 3.1 405B IT) × Complexity (3 Premises) | −2.08 | 0.20 | −10.61 | 0.000 |
| Model (GPT OSS 120B) × Complexity (3 Premises) | −1.16 | 0.22 | −5.18 | 0.000 |
| Model (GPT OSS 20B) × Complexity (3 Premises) | 3.70 | 1.06 | 3.51 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Complexity (4 Premises) | −2.05 | 0.17 | −11.58 | 0.000 |
| Model (GPT OSS 120B) × Complexity (4 Premises) | −0.58 | 0.20 | −2.86 | 0.004 |
| Model (GPT OSS 20B) × Complexity (4 Premises) | 4.32 | 1.05 | 4.10 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Complexity (5 Premises) | −2.18 | 0.19 | −13.00 | 0.000 |
| Model (GPT OSS 120B) × Complexity (5 Premises) | −0.8 | 0.18 | −2.05 | 0.040 |
| Model (GPT OSS 20B) × Complexity (5 Premises) | 4.27 | 1.05 | 4.08 | 0.000 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 8.72 | 0.11 | 77.96 | 0.000 |
| Model (LlaMa 3.1 405B IT) | −5.46 | 0.61 | −8.98 | 0.000 |
| Model (GPT OSS 120B) | 1.84 | 0.16 | 11.61 | 0.000 |
| Model (GPT OSS 20B) | −0.12 | 0.16 | −0.78 | 0.437 |
| Complexity (2 Premises) | 1.33 | 0.14 | 9.73 | 0.000 |
| Complexity (3 Premises) | 4.98 | 0.14 | 36.37 | 0.000 |
| Complexity (4 Premises) | 4.52 | 0.14 | 11.10 | 0.000 |
| Complexity (5 Premises) | 4.98 | 0.14 | 36.37 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Complexity (2 Premises) | 5.02 | 0.62 | 8.12 | 0.000 |
| Model (GPT OSS 120B) × Complexity (2 Premises) | 0.40 | 0.19 | 2.05 | 0.041 |
| Model (GPT OSS 20B) × Complexity (2 Premises) | −5.55 | 0.74 | −7.52 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Complexity (3 Premises) | 1.47 | 0.62 | 2.38 | 0.017 |
| Model (GPT OSS 120B) × Complexity (3 Premises) | 0.00 | 0.19 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Complexity (3 Premises) | 0.00 | 0.19 | 0.00 | 1.00 |
| Model (LlaMa 3.1 405B IT) × Complexity (4 Premises) | 5.15 | 0.62 | 8.33 | 0.000 |
| Model (GPT OSS 120B) × Complexity (4 Premises) | 0.00 | 0.19 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Complexity (4 Premises) | −5.04 | 1.03 | −4.90 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Complexity (5 Premises) | 1.47 | 0.62 | 2.38 | 0.017 |
| Model (GPT OSS 120B) × Complexity (5 Premises) | 0.00 | 0.19 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Complexity (5 Premises) | 0.00 | 0.19 | 0.00 | 1.00 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 3.26 | 0.60 | 5.48 | 0.000 |
| Model (LlaMa 3.1 405B IT) | −0.74 | 0.73 | −1.01 | 0.314 |
| Model (GPT OSS 120B) | 5.09 | 0.61 | 8.42 | 0.000 |
| Model (GPT OSS 20B) | 5.13 | 0.61 | 8.48 | 0.000 |
| Complexity (2 Premises) | 1.83 | 1.18 | 1.55 | 0.121 |
| Complexity (3 Premises) | 1.83 | 1.18 | 1.55 | 0.121 |
| Complexity (4 Premises) | 5.28 | 0.60 | 8.80 | 0.000 |
| Complexity (5 Premises) | 5.34 | 0.60 | 8.89 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Complexity (2 Premises) | 5.15 | 1.26 | 4.10 | 0.000 |
| Model (GPT OSS 120B) × Complexity (2 Premises) | 0.00 | 1.19 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Complexity (2 Premises) | 0.00 | 1.19 | 0.00 | 1.00 |
| Model (LlaMa 3.1 405B IT) × Complexity (3 Premises) | 5.15 | 1.26 | 4.01 | 0.000 |
| Model (GPT OSS 120B) × Complexity (3 Premises) | 0.00 | 1.19 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Complexity (3 Premises) | 0.00 | 1.19 | 0.00 | 1.00 |
| Model (LlaMa 3.1 405B IT) × Complexity (4 Premises) | 3.19 | 0.74 | 4.31 | 0.000 |
| Model (GPT OSS 120B) × Complexity (4 Premises) | −8.55 | 1.18 | −7.22 | 0.000 |
| Model (GPT OSS 20B) × Complexity (4 Premises) | −9.29 | 0.94 | −9.88 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Complexity (5 Premises) | 3.19 | 0.74 | 4.31 | 0.000 |
| Model (GPT OSS 120B) × Complexity (5 Premises) | −9.31 | 0.94 | −9.90 | 0.000 |
| Model (GPT OSS 20B) × Complexity (5 Premises) | 8.64 | 1.18 | −7.31 | 0.000 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 6.80 | 0.11 | 60.74 | 0.000 |
| Model (LlaMa 3.1 405B IT) | 5.34 | 0.16 | 33.77 | 0.000 |
| Model (GPT OSS 120B) | 5.26 | 0.16 | 33.27 | 0.000 |
| Model (GPT OSS 20B) | −2.38 | 1.06 | −2.24 | 0.025 |
| Complexity (2 Premises) | −4.19 | 0.33 | −12.59 | 0.000 |
| Complexity (3 Premises) | −4.91 | 0.26 | −18.92 | 0.000 |
| Complexity (4 Premises) | −4.19 | 0.33 | −12.59 | 0.000 |
| Complexity (5 Premises) | −4.09 | 0.35 | −11.81 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Complexity (2 Premises) | 1.97 | 0.36 | 5.47 | 0.000 |
| Model (GPT OSS 120B) × Complexity (2 Premises) | 1.92 | 0.36 | 5.33 | 0.000 |
| Model (GPT OSS 20B) × Complexity (2 Premises) | 1.78 | 1.13 | 1.57 | 0.117 |
| Model (LlaMa 3.1 405B IT) × Complexity (3 Premises) | −4.52 | 0.43 | −10.49 | 0.000 |
| Model (GPT OSS 120B) × Complexity (3 Premises) | −4.44 | 0.43 | −10.28 | 0.000 |
| Model (GPT OSS 20B) × Complexity (3 Premises) | 1.84 | 1.10 | 1.67 | 0.096 |
| Model (LlaMa 3.1 405B IT) × Complexity (4 Premises) | 1.97 | 0.36 | 5.47 | 0.000 |
| Model (GPT OSS 120B) × Complexity (4 Premises) | 1.92 | 0.36 | 5.33 | 0.000 |
| Model (GPT OSS 20B) × Complexity (4 Premises) | 1.61 | 1.13 | 1.42 | 0.155 |
| Model (LlaMa 3.1 405B IT) × Complexity (5 Premises) | 1.87 | 0.37 | 5.01 | 0.000 |
| Model (GPT OSS 120B) × Complexity (5 Premises) | 1.81 | 0.37 | 4.87 | 0.000 |
| Model (GPT OSS 20B) × Complexity (5 Premises) | 1.61 | 1.14 | 1.42 | 0.155 |
Appendix C.3. Effect of Model Problem Variations on Performance
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 1.14 | 0.11 | 10.22 | 0.000 |
| Model (LlaMa 3.1 405B IT) | 0.65 | 0.18 | 3.72 | 0.000 |
| Model (GPT OSS 120B) | 4.96 | 1.02 | 4.88 | 0.000 |
| Model (GPT OSS 20B) | 4.26 | 0.72 | 5.89 | 0.000 |
| Variant (Irrelevant) | −0.35 | 0.16 | −2.22 | 0.027 |
| Variant (Irrelevant Incorrect) | −0.39 | 0.16 | −2.47 | 0.014 |
| Variant (Regular) | −0.32 | 0.16 | −2.07 | 0.039 |
| Model (LlaMa 3.1 405B IT) × Variant (Irrelevant) | −0.03 | 0.25 | −0.12 | 0.906 |
| Model (GPT OSS 120B) × Variant (Irrelevant) | 0.11 | 1.42 | 0.08 | 0.938 |
| Model (GPT OSS 20B) × Variant (Irrelevant) | 0.82 | 1.24 | 0.66 | 0.512 |
| Model (LlaMa 3.1 405B IT) × Variant (Irr. Incorrect) | −0.04 | 0.25 | −0.18 | 0.859 |
| Model (GPT OSS 120B) × Variant (Irr. Incorrect) | 0.14 | 1.42 | 0.10 | 0.923 |
| Model (GPT OSS 20B) × Variant (Irr. Incorrect) | 0.83 | 1.22 | 0.68 | 0.498 |
| Model (LlaMa 3.1 405B IT) × Variant (Regular) | 0.38 | 0.26 | 1.48 | 0.140 |
| Model (GPT OSS 120B) × Variant (Regular) | 5.18 | 1.02 | 5.07 | 0.000 |
| Model (GPT OSS 20B) × Variant (Regular) | −0.94 | 0.84 | −1.12 | 0.262 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | −0.11 | 0.06 | −1.94 | 0.052 |
| Model (LlaMa 3.1 405B IT) | −0.34 | 0.08 | −4.23 | 0.000 |
| Model (GPT OSS 120B) | 5.07 | 0.34 | 14.96 | 0.000 |
| Model (GPT OSS 20B) | 3.53 | 0.17 | 21.01 | 0.000 |
| Variant (Irrelevant) | 0.91 | 0.08 | 10.85 | 0.000 |
| Variant (Irrelevant Incorrect) | −0.35 | 0.08 | −4.30 | 0.000 |
| Variant (Regular) | 0.77 | 0.08 | 9.43 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Irrelevant) | 0.79 | 0.12 | 6.42 | 0.000 |
| Model (GPT OSS 120B) × Variant (Irrelevant) | −3.85 | 0.38 | −10.79 | 0.000 |
| Model (GPT OSS 20B) × Variant (Irrelevant) | −2.54 | 0.20 | −12.83 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Irr. Incorrect) | 0.45 | 0.12 | 3.85 | 0.000 |
| Model (GPT OSS 120B) × Variant (Irr. Incorrect) | −0.19 | 0.44 | −0.44 | 0.659 |
| Model (GPT OSS 20B) × Variant (Irr. Incorrect) | 0.02 | 0.23 | 0.10 | 0.918 |
| Model (LlaMa 3.1 405B IT) × Variant (Regular) | 1.67 | 0.13 | 12.50 | 0.000 |
| Model (GPT OSS 120B) × Variant (Regular) | −3.81 | 0.36 | −10.72 | 0.000 |
| Model (GPT OSS 20B) × Variant (Regular) | −2.41 | 0.20 | −12.24 | 0.000 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 11.97 | 0.07 | 177.59 | 0.000 |
| Model (LlaMa 3.1 405B IT) | 0.00 | 0.10 | 0.00 | 1.00 |
| Model (GPT OSS 120B) | 0.79 | 0.10 | 8.29 | 0.000 |
| Model (GPT OSS 20B) | −2.30 | 0.10 | −24.12 | 0.000 |
| Variant (Irrelevant) | −2.64 | 0.10 | −25.45 | 0.000 |
| Variant (Irrelevant Incorrect) | −0.23 | 0.10 | −2.17 | 0.030 |
| Variant (Regular) | 0.63 | 0.10 | 6.23 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Irrelevant) | −5.37 | 0.60 | −9.00 | 0.000 |
| Model (GPT OSS 120B) × Variant (Irrelevant) | 0.00 | 0.15 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Variant (Irrelevant) | −1.94 | 1.03 | −1.89 | 0.059 |
| Model (LlaMa 3.1 405B IT) × Variant (Irr. Incorrect) | 0.04 | 0.15 | 0.27 | 0.785 |
| Model (GPT OSS 120B) × Variant (Irr. Incorrect) | 0.00 | 0.15 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Variant (Irr. Incorrect) | −5.07 | 0.73 | −6.98 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Regular) | 0.00 | 0.14 | 0.00 | 1.00 |
| Model (GPT OSS 120B) × Variant (Regular) | 0.00 | 0.14 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Variant (Regular) | 0.36 | 0.14 | 2.54 | 0.011 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 9.02 | 0.07 | 133.82 | 0.000 |
| Model (LlaMa 3.1 405B IT) | 0.70 | 0.10 | 7.33 | 0.000 |
| Model (GPT OSS 120B) | −3.62 | 1.02 | −3.56 | 0.000 |
| Model (GPT OSS 20B) | 1.24 | 0.10 | 13.05 | 0.000 |
| Variant (Irrelevant) | −5.06 | 0.59 | −8.60 | 0.000 |
| Variant (Irrelevant Incorrect) | 0.00 | 0.10 | 0.00 | 1.00 |
| Variant (Regular) | −4.52 | 0.72 | −6.27 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Irrelevant) | −1.22 | 0.75 | −1.64 | 0.101 |
| Model (GPT OSS 120B) × Variant (Irrelevant) | 4.72 | 1.54 | 3.07 | 0.002 |
| Model (GPT OSS 20B) × Variant (Irrelevant) | −0.13 | 1.17 | −0.11 | 0.909 |
| Model (LlaMa 3.1 405B IT) × Variant (Irr. Incorrect) | 0.05 | 0.15 | 0.32 | 0.746 |
| Model (GPT OSS 120B) × Variant (Irr. Incorrect) | −0.32 | 1.44 | −0.22 | 0.827 |
| Model (GPT OSS 20B) × Variant (Irr. Incorrect) | −5.89 | 0.73 | −8.11 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Regular) | −0.00 | 1.24 | 0.00 | 0.998 |
| Model (GPT OSS 120B) × Variant (Regular) | 9.40 | 1.25 | 7.53 | 0.000 |
| Model (GPT OSS 20B) × Variant (Regular) | 4.10 | 0.73 | 5.63 | 0.000 |
| Coeff. | Std. Err. | Z | p | |
|---|---|---|---|---|
| Intercept | 1.45 | 0.17 | 8.42 | 0.000 |
| Model (LlaMa 3.1 405B IT) | 1.60 | 0.37 | 4.37 | 0.000 |
| Model (GPT OSS 120B) | 1.60 | 0.37 | 4.37 | 0.000 |
| Model (GPT OSS 20B) | −0.44 | 0.23 | −1.93 | 0.054 |
| Variant (Irrelevant) | 8.04 | 0.19 | 42.57 | 0.000 |
| Variant (Irrelevant Incorrect) | 1.16 | 0.36 | 3.26 | 0.001 |
| Variant (Regular) | 7.57 | 0.29 | 40.43 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Irrelevant) | 0.00 | 0.38 | 0.00 | 1.00 |
| Model (GPT OSS 120B) × Variant (Irrelevant) | 0.00 | 0.38 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Variant (Irrelevant) | −4.67 | 0.76 | −6.19 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Irr. Incorrect) | 5.79 | 0.49 | 11.87 | 0.000 |
| Model (GPT OSS 120B) × Variant (Irr. Incorrect) | 5.79 | 0.49 | 11.87 | 0.000 |
| Model (GPT OSS 20B) × Variant (Irr. Incorrect) | −0.82 | 0.43 | −1.88 | 0.060 |
| Model (LlaMa 3.1 405B IT) × Variant (Regular) | 0.00 | 0.38 | 0.00 | 1.00 |
| Model (GPT OSS 120B) × Variant (Regular) | 0.00 | 0.38 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Variant (Regular) | −3.37 | 1.05 | −3.21 | 0.001 |
Appendix C.4. Effect of Model and Problem Variations on Performance in Analogy Blocks (By Relation)
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 2.64 | 0.73 | 3.61 | 0.000 |
| Model (LlaMa 3.1 405B IT) | 11.23 | 0.75 | 14.88 | 0.000 |
| Model (GPT OSS 120B) | 14.78 | 0.75 | 19.60 | 0.000 |
| Model (GPT OSS 20B) | 12.14 | 0.75 | 16.09 | 0.000 |
| Variant (Irrelevant) | −0.77 | 0.91 | −0.85 | 0.398 |
| Variant (Incorrect) | −0.77 | 0.91 | −0.85 | 0.397 |
| Variant (Irrelevant Incorrect) | 0.00 | 1.04 | 0.00 | 1.00 |
| Model (LlaMa 3.1 405B IT) × Variant (Irrelevant) | −10.46 | 1.18 | −8.87 | 0.000 |
| Model (GPT OSS 120B) × Variant (Irrelevant) | 2.59 | 0.94 | 2.74 | 0.006 |
| Model (GPT OSS 20B) × Variant (Irrelevant) | 8.94 | 0.94 | 9.47 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Incorrect) | −13.50 | 1.00 | −13.53 | 0.000 |
| Model (GPT OSS 120B) × Variant (Incorrect) | 2.60 | 0.94 | 2.76 | 0.006 |
| Model (GPT OSS 20B) × Variant (Incorrect) | −11.37 | 1.18 | −9.64 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Irr. Incor.) | −14.27 | 1.12 | −12.80 | 0.000 |
| Model (GPT OSS 120B) × Variant (Irr. Incorrect) | 1.22 | 1.07 | 1.14 | 0.253 |
| Model (GPT OSS 20B) × Variant (Irr. Incorrect) | −12.58 | 1.22 | −10.36 | 0.000 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | 1.24 | 0.38 | 3.27 | 0.001 |
| Model (LlaMa 3.1 405B IT) | 11.98 | 0.41 | 29.20 | 0.000 |
| Model (GPT OSS 120B) | 2.43 | 1.08 | 2.24 | 0.025 |
| Model (GPT OSS 20B) | 2.43 | 1.08 | 2.25 | 0.025 |
| Variant (Irrelevant) | 0.32 | 0.56 | 0.56 | 0.574 |
| Variant (Incorrect) | −0.72 | 0.50 | −1.45 | 0.147 |
| Variant (Irrelevant Incorrect) | −0.93 | 0.50 | −1.89 | 0.059 |
| Model (LlaMa 3.1 405B IT) × Variant (Irrelevant) | −11.79 | 0.73 | −16.09 | 0.000 |
| Model (GPT OSS 120B) × Variant (Irrelevant) | 10.80 | 1.17 | 9.22 | 0.000 |
| Model (GPT OSS 20B) × Variant (Irrelevant) | −0.32 | 1.54 | −0.21 | 0.836 |
| Model (LlaMa 3.1 405B IT) × Variant (Incorrect) | −12.29 | 0.61 | −20.04 | 0.000 |
| Model (GPT OSS 120B) × Variant (Incorrect) | 0.00 | 1.34 | 0.00 | 1.00 |
| Model (GPT OSS 20B) × Variant (Incorrect) | −0.74 | 1.24 | −0.59 | 0.553 |
| Model (LlaMa 3.1 405B IT) × Variant (Irr. Incor.) | −12.08 | 0.61 | −19.81 | 0.000 |
| Model (GPT OSS 120B) × Variant (Irr. Incorrect) | 0.21 | 1.34 | 0.16 | 0.873 |
| Model (GPT OSS 20B) × Variant (Irr. Incorrect) | 13.53 | 1.14 | 11.91 | 0.000 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | −16.41 | 0.18 | −90.55 | 0.000 |
| Model (LlaMa 3.1 405B IT) | 0.79 | 0.29 | 2.79 | 0.005 |
| Model (GPT OSS 120B) | 14.21 | 0.77 | 18.51 | 0.000 |
| Model (GPT OSS 20B) | 15.02 | 0.59 | 25.59 | 0.000 |
| Variant (Irrelevant) | 13.47 | 1.04 | 12.92 | 0.000 |
| Variant (Incorrect) | −2.57 | 0.65 | −3.95 | 0.000 |
| Variant (Irrelevant Incorrect) | −2.53 | 0.00 | 0.00 | 1.00 |
| Model (LlaMa 3.1 405B IT) × Variant (Irrelevant) | 0.41 | 1.23 | 0.34 | 0.738 |
| Model (GPT OSS 120B) × Variant (Irrelevant) | −12.37 | 1.38 | −8.96 | 0.000 |
| Model (GPT OSS 20B) × Variant (Irrelevant) | −11.88 | 1.26 | −9.40 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Incorrect) | 17.34 | 0.82 | 21.13 | 0.000 |
| Model (GPT OSS 120B) × Variant (Incorrect) | 4.36 | 1.06 | 4.12 | 0.000 |
| Model (GPT OSS 20B) × Variant (Incorrect) | 5.34 | 1.02 | 5.24 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Irr. Incor.) | 16.41 | 0.62 | 26.47 | 0.000 |
| Model (GPT OSS 120B) × Variant (Irr. Incorrect) | 6.11 | 0.88 | 6.96 | 0.000 |
| Model (GPT OSS 20B) × Variant (Irr. Incorrect) | 4.12 | 0.69 | 5.98 | 0.000 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | −1.10 | 0.52 | −2.13 | 0.033 |
| Model (LlaMa 3.1 405B IT) | 0.25 | 0.71 | 0.35 | 0.724 |
| Model (GPT OSS 120B) | 1.50 | 0.69 | 2.18 | 0.029 |
| Model (GPT OSS 20B) | 1.30 | 0.69 | 1.90 | 0.058 |
| Variant (Irrelevant) | 0.00 | 0.73 | 0.00 | 1.00 |
| Variant (Incorrect) | −1.85 | 1.15 | −1.61 | 0.108 |
| Variant (Irrelevant Incorrect) | −1.85 | 1.15 | −1.61 | 0.108 |
| Model (LlaMa 3.1 405B IT) × Variant (Irrelevant) | −0.54 | 1.04 | −0.52 | 0.605 |
| Model (GPT OSS 120B) × Variant (Irrelevant) | 0.44 | 0.99 | 0.45 | 0.655 |
| Model (GPT OSS 20B) × Variant (Irrelevant) | 0.20 | 0.97 | 0.21 | 0.833 |
| Model (LlaMa 3.1 405B IT) × Variant (Incorrect) | 3.79 | 1.35 | 2.81 | 0.005 |
| Model (GPT OSS 120B) × Variant (Incorrect) | 2.29 | 1.33 | 1.72 | 0.085 |
| Model (GPT OSS 20B) × Variant (Incorrect) | 2.49 | 1.33 | 1.88 | 0.060 |
| Model (LlaMa 3.1 405B IT) × Variant (Irr. Incor.) | 1.31 | 1.37 | 0.96 | 0.339 |
| Model (GPT OSS 120B) × Variant (Irr. Incorrect) | 3.64 | 1.44 | 2.52 | 0.012 |
| Model (GPT OSS 20B) × Variant (Irr. Incorrect) | 3.03 | 1.35 | 2.24 | 0.025 |
| Coeff. | Std. Err. | z | p | |
|---|---|---|---|---|
| Intercept | −0.62 | 0.47 | −1.32 | 0.187 |
| Model (LlaMa 3.1 405B IT) | 0.62 | 0.65 | 0.96 | 0.339 |
| Model (GPT OSS 120B) | 3.57 | 1.13 | 3.15 | 0.002 |
| Model (GPT OSS 20B) | 2.35 | 0.78 | 3.01 | 0.003 |
| Variant (Irrelevant) | −1.12 | 0.78 | −1.43 | 0.154 |
| Variant (Incorrect) | −1.11 | 0.78 | −1.43 | 0.154 |
| Variant (Irrelevant Incorrect) | −13.45 | 0.52 | −25.89 | 0.000 |
| Model (LlaMa 3.1 405B IT) × Variant (Irrelevant) | 2.21 | 1.04 | 2.13 | 0.033 |
| Model (GPT OSS 120B) × Variant (Irrelevant) | 13.57 | 1.31 | 10.35 | 0.000 |
| Model (GPT OSS 20B) × Variant (Irrelevant) | 1.12 | 1.18 | 0.94 | 0.345 |
| Model (LlaMa 3.1 405B IT) × Variant (Incorrect) | 1.96 | 1.02 | 1.91 | 0.056 |
| Model (GPT OSS 120B) × Variant (Incorrect) | −0.74 | 1.39 | −0.53 | 0.597 |
| Model (GPT OSS 20B) × Variant (Incorrect) | 0.77 | 1.15 | 0.67 | 0.504 |
| Model (LlaMa 3.1 405B IT) × Variant (Irr. Incor.) | 12.83 | 0.83 | 15.45 | 0.000 |
| Model (GPT OSS 120B) × Variant (Irr. Incorrect) | 12.70 | 1.37 | 9.25 | 0.000 |
| Model (GPT OSS 20B) × Variant (Irr. Incorrect) | 14.66 | 1.31 | 11.20 | 0.000 |
Appendix D
Appendix D.1. Effect of Problem Complexity and Variations When Premise Order Is Randomized

| Block | Number of Premises | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Same and Different | One | 100.00 | 98.75 | 100.00 | 90.00 |
| Two | 97.50 | 100.00 | 98.75 | 90.83 | |
| Three | 86.56 | 97.81 | 98.75 | 88.13 | |
| Four | 71.50 | 87.50 | 100.00 | 91.00 | |
| Five | 65.63 | 78.33 | 99.58 | 95.83 | |
| Same and Opposite | One | 100.00 | 100.00 | 100.00 | 96.25 |
| Two | 82.81 | 87.50 | 93.75 | 93.13 | |
| Three | 59.70 | 67.66 | 89.84 | 87.03 | |
| Four | 54.38 | 55.39 | 90.00 | 88.13 | |
| Five | 49.96 | 52.15 | 88.20 | 88.83 | |
| More Than and Less Than | One | 98.75 | 97.50 | 100.00 | 100.00 |
| Two | 100.00 | 100.00 | 100.00 | 95.63 | |
| Three | 100.00 | 100.00 | 98.75 | 93.13 | |
| Four | 98.75 | 100.00 | 100.00 | 91.88 | |
| Five | 98.13 | 100.00 | 97.50 | 92.50 | |
| Before and After | One | 97.50 | 98.75 | 100.00 | 91.25 |
| Two | 100.00 | 100.00 | 99.38 | 96.88 | |
| Three | 98.13 | 99.38 | 100.00 | 100.00 | |
| Four | 97.50 | 97.50 | 98.75 | 92.50 | |
| Five | 96.88 | 99.38 | 100.00 | 89.38 | |
| Contains and Is Part Of | One | 98.75 | 100.00 | 98.75 | 88.75 |
| Two | 96.25 | 100.00 | 87.50 | 92.50 | |
| Three | 89.38 | 92.50 | 84.38 | 88.75 | |
| Four | 89.38 | 98.13 | 85.63 | 93.75 | |
| Five | 86.88 | 93.75 | 83.75 | 98.75 |

| Block | Problem Variant | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Same and Different | Regular | 86.58 | 91.84 | 99.74 | 91.84 |
| Incorrect Conclusion | 75.91 | 90.68 | 99.32 | 90.46 | |
| Irrelevant Premise | 88.57 | 91.43 | 99.43 | 92.29 | |
| Irrelevant Incorrect | 62.57 | 82.86 | 99.14 | 93.14 | |
| Same and Opposite | Regular | 75.25 | 84.10 | 85.33 | 85.57 |
| Incorrect Conclusion | 39.46 | 36.46 | 94.46 | 93.00 | |
| Irrelevant Premise | 76.02 | 71.78 | 85.85 | 83.98 | |
| Irrelevant Incorrect | 31.70 | 41.53 | 91.78 | 92.37 | |
| More Than and Less Than | Regular | 100.00 | 100.00 | 99.44 | 95.00 |
| Incorrect Conclusion | 99.09 | 100.00 | 99.09 | 95.91 | |
| Irrelevant Premise | 98.13 | 98.75 | 99.38 | 92.50 | |
| Irrelevant Incorrect | 99.38 | 100.00 | 98.75 | 91.88 | |
| Before and After | Regular | 98.89 | 99.44 | 100.00 | 95.56 |
| Incorrect Conclusion | 97.73 | 99.09 | 99.09 | 94.55 | |
| Irrelevant Premise | 98.13 | 98.75 | 100.00 | 93.13 | |
| Irrelevant Incorrect | 97.50 | 98.75 | 99.38 | 93.75 | |
| Contains andIs Part Of | Regular | 95.00 | 95.56 | 96.11 | 94.44 |
| Incorrect Conclusion | 88.64 | 95.00 | 75.46 | 87.73 | |
| Irrelevant Premise | 88.13 | 96.25 | 96.25 | 95.00 | |
| Irrelevant Incorrect | 94.38 | 100.00 | 82.50 | 96.25 |
Appendix D.2. Deictic Responding

| Block | Problem Variant | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Interpersonal | Regular | 100.00 | 70.00 | 100.00 | 90.00 |
| Incorrect Conclusion | 95.00 | 100.00 | 70.00 | 95.00 | |
| Irrelevant Premise | 100.00 | 85.00 | 100.00 | 100.00 | |
| Irrelevant Incorrect | 95.00 | 100.00 | 80.00 | 85.00 | |
| Reversal | 80.00 | 85.00 | 85.00 | 100.00 | |
| Reversal Incorrect | 60.00 | 95.00 | 50.00 | 35.00 | |
| Reversal + Irrelevant | 75.00 | 85.00 | 90.00 | 95.00 | |
| Reversal Inc. + Irrelevant | 50.00 | 100.00 | 55.00 | 40.00 | |
| Temporal | Regular | 100.00 | 80.00 | 100.00 | 100.00 |
| Incorrect Conclusion | 100.00 | 100.00 | 95.00 | 100.00 | |
| Irrelevant Premise | 100.00 | 75.00 | 85.00 | 95.00 | |
| Irrelevant Incorrect | 100.00 | 100.00 | 100.00 | 100.00 | |
| Reversal | 100.00 | 95.00 | 100.00 | 100.00 | |
| Reversal Incorrect | 35.00 | 95.00 | 90.00 | 100.00 | |
| Reversal + Irrelevant | 95.00 | 100.00 | 100.00 | 100.00 | |
| Reversal Inc. + Irrelevant | 55.00 | 95.00 | 90.00 | 100.00 | |
| Spatial | Regular | 100.00 | 100.00 | 100.00 | 90.00 |
| Incorrect Conclusion | 100.00 | 100.00 | 90.00 | 95.00 | |
| Irrelevant Premise | 100.00 | 100.00 | 100.00 | 95.00 | |
| Irrelevant Incorrect | 100.00 | 100.00 | 75.00 | 90.00 | |
| Reversal | 100.00 | 100.00 | 100.00 | 90.00 | |
| Reversal Incorrect | 55.00 | 100.00 | 100.00 | 80.00 | |
| Reversal + Irrelevant | 100.00 | 100.00 | 100.00 | 100.00 | |
| Reversal Inc. + Irrelevant | 35.00 | 100.00 | 95.00 | 75.00 |
Appendix D.3. Analogy

| Block | Problem Variant | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Same and Different | Regular | 86.67 | 100.00 | 96.67 | 93.33 |
| Incorrect Conclusion | 99.67 | 56.67 | 90.00 | 90.00 | |
| Irrelevant Premise | 90.00 | 93.33 | 83.33 | 93.33 | |
| Irrelevant Incorrect | 96.67 | 50.00 | 96.67 | 90.00 | |
| Same and Opposite | Regular | 67.50 | 95.00 | 97.50 | 97.50 |
| Incorrect Conclusion | 60.00 | 57.50 | 90.00 | 95.00 | |
| Irrelevant Premise | 77.50 | 77.50 | 95.00 | 95.00 | |
| Irrelevant Incorrect | 57.50 | 50.00 | 92.50 | 92.50 | |
| More Than and Less Than | Regular | 5.00 | 5.00 | 15.00 | 5.00 |
| Incorrect Conclusion | 0.00 | 20.00 | 65.00 | 35.00 | |
| Irrelevant Premise | 5.00 | 15.00 | 40.00 | 30.00 | |
| Irrelevant Incorrect | 0.00 | 0.00 | 75.00 | 80.00 | |
| Before and After | Regular | 40.00 | 35.00 | 55.00 | 75.00 |
| Incorrect Conclusion | 0.00 | 60.00 | 35.00 | 65.00 | |
| Irrelevant Premise | 25.00 | 35.00 | 60.00 | 65.00 | |
| Irrelevant Incorrect | 0.00 | 40.00 | 65.00 | 85.00 | |
| Contains andIs Part Of | Regular | 40.00 | 85.00 | 70.00 | 70.00 |
| Incorrect Conclusion | 5.00 | 35.00 | 75.00 | 70.00 | |
| Irrelevant Premise | 55.00 | 80.00 | 85.00 | 90.00 | |
| Irrelevant Incorrect | 0.00 | 20.00 | 75.00 | 95.00 |
Appendix D.4. Transformation of Function

| Block | Problem Variant | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Same and Different | Regular | 92.50 | 82.50 | 99.50 | 90.00 |
| Incorrect Conclusion | 52.50 | 79.50 | 100.00 | 91.50 | |
| Irrelevant Premise | 92.00 | 80.00 | 99.50 | 92.00 | |
| Irrelevant Incorrect | 52.50 | 77.00 | 90.00 | 86.00 | |
| Same and Opposite | Regular | 70.00 | 74.84 | 90.16 | 93.07 |
| Incorrect Conclusion | 39.84 | 42.10 | 96.13 | 93.23 | |
| Irrelevant Premise | 70.48 | 70.00 | 86.29 | 91.77 | |
| Irrelevant Incorrect | 43.07 | 62.42 | 87.42 | 91.61 | |
| More Than and Less Than | Regular | 100.00 | 100.00 | 99.00 | 92.00 |
| Incorrect Conclusion | 94.00 | 99.00 | 100.00 | 95.00 | |
| Irrelevant Premise | 100.00 | 99.00 | 98.00 | 94.00 | |
| Irrelevant Incorrect | 82.00 | 98.00 | 92.00 | 92.00 | |
| Before and After | Regular | 100.00 | 100.00 | 100.00 | 99.00 |
| Incorrect Conclusion | 83.00 | 94.00 | 99.00 | 95.00 | |
| Irrelevant Premise | 100.00 | 99.00 | 100.00 | 97.00 | |
| Irrelevant Incorrect | 68.00 | 93.00 | 96.00 | 98.00 |
| Block | Number of Premises | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|---|
| Same and Different | One | 78.75 | 92.50 | 100.00 | 87.50 |
| Two | 85.83 | 69.67 | 100.00 | 84.17 | |
| Three | 80.63 | 88.75 | 100.00 | 89.38 | |
| Four | 63.50 | 0.0068 | 100.00 | 92.50 | |
| Five | 65.42 | 70.83 | 90.83 | 91.67 | |
| Same and Opposite | One | 93.75 | 98.75 | 100.00 | 100.00 |
| Two | 70.00 | 81.25 | 98.75 | 96.88 | |
| Three | 56.25 | 91.25 | 94.38 | 90.63 | |
| Four | 51.88 | 53.91 | 90.31 | 89.53 | |
| Five | 53.59 | 62.19 | 97.03 | 93.28 | |
| More Than andLess Than | One | 93.75 | 96.25 | 100.00 | 100.00 |
| Two | 96.25 | 100.00 | 100.00 | 92.50 | |
| Three | 96.25 | 100.00 | 98.75 | 93.75 | |
| Four | 92.50 | 98.75 | 98.75 | 87.50 | |
| Five | 91.25 | 100.00 | 88.75 | 92.50 | |
| Before and After | One | 97.50 | 100.00 | 100.00 | 95.00 |
| Two | 91.25 | 98.75 | 98.75 | 100.00 | |
| Three | 88.75 | 95.00 | 100.00 | 100.00 | |
| Four | 81.25 | 92.50 | 100.00 | 95.00 | |
| Five | 80.00 | 96.25 | 95.00 | 96.25 |
| 1 | More information, as well as the illustrations of the derivation tables and scripts to create or implement them can be found on the OSF: https://osf.io/bjxqg/overview?view_only=76aea06fae764230b8fa5dea3a5c4728 (accessed on 29 October 2025). |
| 2 | https://osf.io/78u36/overview?view_only=8f2df70d8ff845e9ad393d407f4c27c1 (accessed on 29 October 2025). |
| 3 | The preregistered (https://osf.io/78u36/overview?view_only=8f2df70d8ff845e9ad393d407f4c27c1, accessed on 29 October 2025) model sample also contained two models from Google’s Gemma 3 model family and from DeepSeek AI. However, for technical (we could not test Gemma models via Together AI’s API-service), practical (querying the DeepSeek models took a very long time compared to other models) and financial (the cost of running DeepSeek models) reasons, we did not test these models. |
| 4 | https://docs.together.ai/reference/chat-completions-1 (accessed on 29 October 2025). |
References
- Alexander, P. A., Dumas, D., Grossnickle, E. M., List, A., & Firetto, C. M. (2016). Measuring relational reasoning. The Journal of Experimental Education, 84(1), 119–151. [Google Scholar] [CrossRef]
- Ando, R., Morishita, T., Abe, H., Mineshima, K., & Okada, M. (2023). Evaluating large language models with NeuBAROCO: Syllogistic reasoning ability and human-like biases. arXiv, arXiv:2306.12567. [Google Scholar]
- Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021, March 3–10). On the dangers of stochastic parrots: Can language models be too big? 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623), Virtual. [Google Scholar]
- Bengio, Y., Hinton, G., Yao, A., Song, D., Abbeel, P., Darrell, T., Harari, Y. N., Zhang, Y., Xue, L., Shalev-Shwartz, S., Hadfield, G., Clune, J., Maharaj, T., Hutter, F., Baydin, A. G., McIlraith, S., Gao, Q., Acharya, A., Krueger, D., … Anca Dragan, A. (2024). Managing extreme AI risks amid rapid progress. Science, 384(6698), 842–845. [Google Scholar] [CrossRef]
- Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T., & Evans, O. (2023). The Reversal Curse: LLMs trained on” A is B” fail to learn” B is A”. arXiv, arXiv:2309.12288. [Google Scholar]
- Bertolazzi, L., Gatt, A., & Bernardi, R. (2024). A systematic analysis of large language models as soft reasoners: The case of syllogistic inferences. arXiv, arXiv:2406.11341. [Google Scholar] [CrossRef]
- Birney, D. P., Halford, G. S., & Andrews, G. (2006). Measuring the influence of complexity on relational reasoning: The development of the latin square task. Educational and Psychological Measurement, 66(1), 146–171. [Google Scholar] [CrossRef]
- Borji, A. (2023). A categorical archive of chatgpt failures. arXiv, arXiv:2302.03494. [Google Scholar] [CrossRef]
- Bowman, S. R., & Dahl, G. E. (2021). What will it take to fix benchmarking in natural language understanding? arXiv, arXiv:2104.02145. [Google Scholar] [CrossRef]
- Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. [Google Scholar]
- Cassidy, S., Roche, B., & Hayes, S. C. (2011). A relational frame training intervention to raise intelligence quotients: A pilot study. The Psychological Record, 61(2), 173–198. [Google Scholar] [CrossRef]
- Chollet, F. (2019). On the measure of intelligence. arXiv, arXiv:1911.01547. [Google Scholar]
- Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., & Schuh, P. (2023). Palm: Scaling language modelling with pathways. Journal of Machine Learning Research, 24(240), 1–113. [Google Scholar]
- Colbert, D., Dobutowitsch, M., Roche, B., & Brophy, C. (2017). The proxy-measurement of intelligence quotients using a relational skills abilities index. Learning and Individual Differences, 57, 114–122. [Google Scholar] [CrossRef]
- Colbert, D., Malone, A., Barrett, S., & Roche, B. (2020). The relational abilities index+: Initial validation of a functionally understood proxy measure for intelligence. Perspectives on Behavior Science, 43(1), 189–213. [Google Scholar] [CrossRef]
- Colbert, D., Tyndall, I., Roche, B., & Cassidy, S. (2018). Can SMART training really increase intelligence? A replication study. Journal of Behavioral Education, 27(4), 509–531. [Google Scholar] [CrossRef]
- Crockett, M., & Messeri, L. (2023). Should large language models replace human participants? PsyArXiv preprint. [Google Scholar] [CrossRef]
- Cummins, J. (2023). On the measurement of relational responding. Journal of Contextual Behavioral Science, 30, 155–168. [Google Scholar] [CrossRef]
- Cummins, J., Nevejans, M., Colbert, D., & De Houwer, J. (2023). On the structure of relational responding. Journal of Contextual Behavioral Science, 27, 16–25. [Google Scholar] [CrossRef]
- Dixon, M. R., Yi, Z., & Chastain, A. N. (2022). PEAK relational training system. In Handbook of applied behavior analysis interventions for autism: Integrating research into practice (pp. 341–360). Springer International Publishing. [Google Scholar] [CrossRef]
- Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., & Ganapathy, R. (2024). The llama 3 herd of models. arXiv. [Google Scholar] [CrossRef]
- Eisape, T., Tessler, M. H., Dasgupta, I., Sha, F., van Steenkiste, S., & Linzen, T. (2023). A systematic comparison of syllogistic reasoning in humans and language models. arXiv, arXiv:2311.00445. [Google Scholar]
- Finn, M., & De Houwer, J. (2021). The selective action of Cfunc control. Journal of the Experimental Analysis of Behavior, 116(3), 314–331. [Google Scholar] [CrossRef]
- Finn, M., Raemaekers, M., & De Houwer, J. (2023). Instructing via relations: Function transformations of response and consequence functions of upcoming contingencies. Journal of Contextual Behavioral Science, 30, 203–209. [Google Scholar] [CrossRef]
- Frank, M. C. (2023a). Baby steps in evaluating the capacities of large language models. Nature Reviews Psychology, 2(8), 451–452. [Google Scholar] [CrossRef]
- Frank, M. C. (2023b). Bridging the data gap between children and large language models. Trends in Cognitive Sciences, 27(11), 990–992. [Google Scholar] [CrossRef]
- Gemma Team, Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., & Rouillard, L. (2025). Gemma 3 technical report. arXiv, arXiv:2503.19786. [Google Scholar] [CrossRef]
- Gentner, D., & Smith, A. L. (2013). Analogical learning and reasoning. In D. Reisberg (Ed.), The oxford handbook of cognitive psychology. Oxford Library of Psychology. [Google Scholar] [CrossRef]
- Goel, V., & Dolan, R. J. (2001). Functional neuroanatomy of three-term relational reasoning. Neuropsychologia, 39(9), 901–909. [Google Scholar] [CrossRef] [PubMed]
- Goodwin, G. P., & Johnson-Laird, P. N. (2005). Reasoning about relations. Psychological Review, 112(2), 468. [Google Scholar] [CrossRef] [PubMed]
- Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., & Li, E. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv, arXiv:2501.12948. [Google Scholar]
- Halford, G. S., Wilson, W. H., & Phillips, S. (2010). Relational knowledge: The foundation of higher cognition. Trends in Cognitive Sciences, 14(11), 497–505. [Google Scholar] [CrossRef]
- Hayes, S. C., Barnes-Holmes, D., & Roche, B. (Eds.). (2001). Relational frame theory: A post-Skinnerian account of human language and cognition. Springer Science & Business Media. [Google Scholar]
- Holyoak, K. J. (2012). Analogy and relational reasoning. In The oxford handbook of thinking and reasoning (pp. 234–259). Oxford University Press. [Google Scholar] [CrossRef]
- Hughes, S., & Barnes-Holmes, D. (2015). Relational frame theory: The basic account. In The Wiley handbook of contextual behavioral science (pp. 129–178). Wiley Online Library. [Google Scholar] [CrossRef]
- Hummel, J. E., & Holyoak, K. J. (2001). A process model of human transitive inference. Spatial Schemas and Abstract Thought, 279–306. [Google Scholar] [CrossRef]
- Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., & Antoniak, S. (2024). Mixtral of experts. arXiv, arXiv:2401.04088. [Google Scholar] [CrossRef]
- Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv, arXiv:2001.08361. [Google Scholar] [CrossRef]
- Kaufman, A. S., & Lichtenberger, E. O. (2006). Assessing adolescent and adult intelligence (3rd ed.). John Wiley & Sons, Inc. [Google Scholar]
- Legg, S., & Hutter, M. (2007). A collection of definitions of intelligence. Frontiers in Artificial Intelligence and applications, 157, 17. [Google Scholar]
- Lewis, M., & Mitchell, M. (2024). Using counterfactual tasks to evaluate the generality of analogical reasoning in large language models. arXiv, arXiv:2402.08955. [Google Scholar] [CrossRef]
- Li, L. H., Hessel, J., Yu, Y., Ren, X., Chang, K. W., & Choi, Y. (2023). Symbolic chain-of-thought distillation: Small models can also” think” step-by-step. arXiv, arXiv:2306.14050. [Google Scholar]
- Lin, Z. (2025). Six fallacies in substituting large language models for human participants. Advances in Methods and Practices in Psychological Science, 8(3), 25152459251357566. [Google Scholar] [CrossRef]
- Lionello-DeNolf, K. M. (2009). The search for symmetry: 25 years in review. Learning & Behavior, 37(2), 188–203. [Google Scholar] [CrossRef]
- Lionello-DeNolf, K. M. (2021). An update on the search for symmetry in nonhumans. Journal of the Experimental Analysis of Behavior, 115(1), 309–325. [Google Scholar] [CrossRef]
- May, R. J., Tyndall, I., McTiernan, A., Roderique-Davies, G., & McLoughlin, S. (2022). The impact of the SMART program on cognitive and academic skills: A systematic review and meta-analysis. British Journal of Educational Technology, 53(5), 1244–1261. [Google Scholar] [CrossRef]
- McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D., & Griffiths, T. L. (2024). Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences, 121(41), e2322420121. [Google Scholar] [CrossRef]
- McHugh, L., Barnes-Holmes, Y., & Barnes-Holmes, D. (2004). Perspective-taking as relational responding: A developmental profile. The Psychological Record, 54(1), 115–144. [Google Scholar] [CrossRef]
- McLoughlin, S., Tyndall, I., & Pereira, A. (2020). Convergence of multiple fields on a relational reasoning approach to cognition. Intelligence, 83, 101491. [Google Scholar] [CrossRef]
- Meta AI. (2024, July 23). Introducing Llama 3.1: Our most capable models to date. Available online: https://ai.meta.com/blog/meta-llama-3-1/ (accessed on 29 October 2025).
- Morris, M. R., Sohl-dickstein, J., Fiedel, N., Warkentin, T., Dafoe, A., Faust, A., Farabet, C., & Legg, S. (2023). Levels of AGI for operationalizing progress on the path to AGI. arXiv, arXiv:2311.02462. [Google Scholar]
- Open AI. (2025, August 5). Introducing gpt-oss. Available online: https://openai.com/index/introducing-gpt-oss/ (accessed on 29 October 2025).
- Open AI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Bello, I. (2023). Gpt-4 technical report. arXiv, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Open AI, Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., Iftimie, A., Karpenko, A., Passos, A. T., Neitz, A., Prokofiev, A., Wei, A., Tam, A., Bennett, A., … Kumar, A. (2024). Openai o1 system card. arXiv, arXiv:2412.16720. [Google Scholar] [CrossRef]
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & Schulman, J. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744. [Google Scholar]
- Ozeki, K., Ando, R., Morishita, T., Abe, H., Mineshima, K., & Okada, M. (2024). Exploring reasoning biases in large language models through syllogism: Insights from the NeuBAROCO dataset. arXiv, arXiv:2408.04403. [Google Scholar] [CrossRef]
- Penn, D. C., Holyoak, K. J., & Povinelli, D. J. (2008). Darwin’s mistake: Explaining the discontinuity between human and nonhuman minds. Behavioral and Brain Sciences, 31(2), 109–130. [Google Scholar] [CrossRef] [PubMed]
- Premack, D. (1983). The codes of man and beasts. Behavioral and Brain Sciences, 6(1), 125–136. [Google Scholar] [CrossRef]
- Raemaekers, M. (in preparation). Open-source tools for relational network derivation, visualization and task generation.
- Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. arXiv. Available online: https://arxiv.org/abs/2305.18290 (accessed on 29 October 2025).
- Raven, J., & Raven, J. (2003). Raven progressive matrices. In R. S. McCallum (Ed.), Handbook of nonverbal assessment (pp. 223–237). Kluwer Academic/Plenum Publishers. [Google Scholar] [CrossRef]
- Shiffrin, R., & Mitchell, M. (2023). Probing the psychology of AI models. Proceedings of the National Academy of Sciences, 120(10), e2300963120. [Google Scholar] [CrossRef]
- Sourati, Z., Ilievski, F., Sommerauer, P., & Jiang, Y. (2024). Arn: Analogical reasoning on narratives. Transactions of the Association for Computational Linguistics, 12, 1063–1086. [Google Scholar] [CrossRef]
- Srivastava, A., Kleyko, D., & Wu, Z. (2023). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv. Available online: https://arxiv.org/abs/2206.04615 (accessed on 29 October 2025).
- Sternberg, R. J., & Detterman, D. K. (1987). What is intelligence? Contemporary viewpoints on its nature and definition. The American Journal of Psychology, 100(1), 141. [Google Scholar] [CrossRef]
- Todd, J. A. M., Andrews, G., & Conlon, E. G. (2019). Relational thinking in later adulthood. Psychology and Aging, 34(4), 486. [Google Scholar] [CrossRef] [PubMed]
- Wang, P. (2019). On defining artificial intelligence. Journal of artificial general intelligence, 10(2), 1–37. [Google Scholar] [CrossRef]
- Webb, T., Holyoak, K. J., & Lu, H. (2023). Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9), 1526–1541. [Google Scholar] [CrossRef]
- Wechsler, D. (2008). Wechsler adult intelligence scale-fourth edition (WAIS-IV) [Database record]. APA PsycTests. [Google Scholar] [CrossRef]
- Wu, W., & Deng, W. (2025, April 6–11). Transitive Inference in Large Language Models and Prompting Intervention. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5), Hyderabad, India. [Google Scholar] [CrossRef]
- Wu, Z., Qiu, L., Ross, A., Akyürek, E., Chen, B., Wang, B., Kim, N., Andreas, J., & Kim, Y. (2024). Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. In Proceedings of the 2024 conference of the north american chapter of the association for computational linguistics: Human language technologies (Volume 1: Long Papers) (pp. 1819–1862). Association for Computational Linguistics. [Google Scholar] [CrossRef]
- Zador, A., Escola, S., Richards, B., Ölveczky, B., Bengio, Y., Boahen, K., Botvinick, M., Chklovskii, D., Churchland, A., Clopath, C., DiCarlo, J., Ganguli, S., Hawkins, J., Körding, K., Koulakov, A., LeCun, Y., Lillicrap, T., Marblestone, A., Olshausen, B., … Pouget, A. (2023). Catalyzing next-generation artificial intelligence through neuroai. Nature Communications, 14(1), 1597. [Google Scholar] [CrossRef]



| Model (Group) | Training | Parameters |
|---|---|---|
| LLaMa 3.1 405B IT (LLM) | Pre: 15T+ multilingual, open-source text tokens. Post: Supervised Fine-Tuning, Rejection Sampling, and Direct Preference Optimization 1. | 405B |
| LLaMa 3.3 70B IT (LLM) | Pre: 15T+ multilingual, open-source text tokens. Post: Supervised fine-tuning and reinforcement learning with human feedback 1. | 70B |
| GPT OSS 20B (LRM) | Pre: trillions of text tokens, focus on STEM, coding, and general knowledge. Post: supervised fine-tuning and high-compute RL 2. | 21B, 3.6 active |
| GPT OSS 120B (LRM) | Pre: trillions of text tokens, focus on STEM, coding, and general knowledge. Post: supervised fine-tuning and high-compute RL 2. | 117B, 5.1 active |
| Block\Model | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|
| Same and Different | 70.59 | 83.16 | 99.32 | 99.80 |
| Same and Opposite | 55.18 | 61.25 | 91.11 | 93.56 |
| More Than and Less Than | 100 | 99.58 | 99.58 | 100 |
| Before and After | 99.31 | 99.17 | 99.58 | 99.58 |
| Contains and Is Part Of | 92.64 | 98.61 | 86.81 | 98.61 |
| Deictic | 86.67 | 92.17 | 87.50 | 89.79 |
| Analogy | 47.31 | 55.19 | 83.27 | 83.85 |
| Transformation Function | 69.04 | 82.06 | 96.74 | 99.24 |
| Block\Model | LlaMa 3.3 70B | LlaMa 3.3 405B | GPT OSS 20B | GPT OSS 120B |
|---|---|---|---|---|
| Same and Different | 78.42 | 89.34 | 99.41 | 91.84 |
| Same and Opposite | 55.37 | 58.14 | 89.45 | 88.81 |
| More Than and Less Than | 99.17 | 99.72 | 99.17 | 94.03 |
| Before and After | 98.06 | 99.03 | 99.58 | 94.31 |
| Contains and Is Part Of | 91.39 | 96.53 | 86.81 | 92.92 |
| Deictic | 84.58 | 94.17 | 89.58 | 89.58 |
| Analogy | 48.27 | 55.39 | 77.50 | 79.81 |
| Transformation Function | 65.96 | 72.70 | 92.99 | 92.48 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Raemaekers, M.; Finn, M.; De Houwer, J. Assessing the Relational Abilities of Large Language Models and Large Reasoning Models. Behav. Sci. 2026, 16, 45. https://doi.org/10.3390/bs16010045
Raemaekers M, Finn M, De Houwer J. Assessing the Relational Abilities of Large Language Models and Large Reasoning Models. Behavioral Sciences. 2026; 16(1):45. https://doi.org/10.3390/bs16010045
Chicago/Turabian StyleRaemaekers, Matthias, Martin Finn, and Jan De Houwer. 2026. "Assessing the Relational Abilities of Large Language Models and Large Reasoning Models" Behavioral Sciences 16, no. 1: 45. https://doi.org/10.3390/bs16010045
APA StyleRaemaekers, M., Finn, M., & De Houwer, J. (2026). Assessing the Relational Abilities of Large Language Models and Large Reasoning Models. Behavioral Sciences, 16(1), 45. https://doi.org/10.3390/bs16010045

