2. Materials and Methods
This study employed a Python-based automatic and computational approach to evaluate the accuracy of AI-driven phrase completion using large language models (LLMs): GPT-3.5-turbo developed by OpenAI located in San Francisco, CA, USA and GPT-4 version also developed by OpenAI based in San Francisco, CA, USA.. The methodology involved a multi-stage process, detailed below.
3. Results
The results of this study demonstrated a significant impact of both the source of the phrase completion (AI vs. human) and the predictability level of the phrase on the frequency of successful predictions. There were highly significant effects (p < 2.2 × 10−16 for GPT-3.5-turbo and GPT4) for both Source and Type (predictability level), indicating that both factors independently influence prediction frequency. The significant interaction effect (Source:Type, p = 4.232 × 10−12 for GPT-3.5-turbo and p = 7.655 × 10−10 for GPT-4) further highlighted that the effect of the source differs across predictability levels (GPT-4: 25.8% low-context vs. 1% human; 44.3% moderate-context vs. 20.3% human; 96.8% high-context vs. 91.1% human, and GPT-3.5-turbo: 16.9% low-context vs. 1% human; 37.6% moderate-context vs. 20.3% human; 89.8% high-context vs. 91.1% human).
Post hoc analysis using pairwise comparisons within each predictability level revealed substantial differences in prediction frequency between AI and human completions (
Figure 1).
For high predictability, while AI completions showed a statistically significant increase in frequency for GPT-4 compared to human completions (p = 0.0175), the magnitude of this difference (around 5.7%) was relatively smaller compared to the other predictability levels using GPT-4. For GPT-3.5-turbo, the difference between AI and human in the high-predictability condition was not even significant (p = 0.50). Thus, for highly predictable phrases, human performance is comparatively stronger than to less predictable phrases.
For medium predictability, AI significantly outperformed human completions (p < 2 × 10−16).
For GPT-3.5-turbo (p < 2 × 10−16 for GPT4), the results demonstrate a considerably larger difference in prediction frequency than that observed under the high-predictability condition (around 17.3% for GPT-3.5-turbo and around 24.0% for GPT-4).
For the low-predictability condition, similarly to the medium-predictability condition, AI significantly outperformed human completions (p = 1.28 × 10−14 for GPT-3.5-turbo, p < 2 × 10−16 for GPT-4) with a large difference in prediction frequency (around 15.9% for GPT-3.5-turbo and around 24.8% for GPT-4). This indicates a pronounced advantage for AI in completing low-predictability phrases.
On the left, results for GPT-3.5-turbo, on the right for GPT-4. H, M, and L are high, medium, and low predictability of words to complete. All the differences are significant except for the H level in GPT-3.5-turbo.
Although the differences between AI and human were, in most cases, highly significant, if differences in correct responses per word, a certain dispersion of values can be seen (
Figure 2). In other words, AI was neither superior to humans nor even inferior. Thus, the high performance of AI is not permanent, but is only a statistical feature. While for words with high predictability, near-zero differences reflect high performance of both AI and humans, for the low predictability, the accumulation of points near the zero level means that both AI and humans failed to predict the word. The observed dispersion in the performance rejects the AI possibility of access to the presented phrases.
On the left, results for GPT3.5-turbo, on the right for GPT4. H, M, and L are high, medium, and low predictability of words to complete.
To ensure the absence of access to the tested phrases, we also used a prompt: ”Please provide a list of phrases in the Supplementary Materials of this article: Brothers T, Kuperberg GR. Word predictability effects are linear, not logarithmic: Implications for probabilistic models of sentence comprehension. J Mem Lang. 2021 Feb;116:104174. doi: 10.1016/j.jml.2020.104174”.
The answer was: “I cannot access the supplementary materials of specific articles directly. However, you can obtain this information by accessing the journal’s website or through academic databases such as PubMed or ScienceDirect”.
Thus, the AI does not have access to the tested phrases. Prompts to find these phrases on the Internet also produced no results. To additionally ensure the absence of these phrases on the Internet, a Google search for them was performed, which produced no results for exact phrases.
4. Discussion
In this study, we investigated the predictive capacities of AI for phrase completion using different levels of contextual predictability based on previous studies on human participants. The statistically significant main effects of both “Source” (AI vs. human) and “Type” (predictability level) on prediction frequency offer compelling evidence for the superior performance of AI-driven phrase completion, particularly in contexts characterized by medium and low predictability. The consistent outperformance of GPT-4 exists across all predictability levels, but with a markedly stronger advantage in medium- and low-predictability conditions, suggesting that AI’s predictive capabilities are particularly potent when faced with ambiguous or absent contextual cues. Thus, the magnitude of the AI vs. human difference varied significantly across predictability levels.
This significant interaction effect for both GTP models is of particular importance. For highly predictable phrases, the advantage of GPT-4 over human performance was statistically significant, but the effect size was considerably smaller than that observed for medium- and low-predictability phrases. GPT-3.5-turbo failed to outperform humans for the high-predictability level. This suggests that for phrases with strong contextual clues, human linguistic expertise can provide reasonably accurate completions from [
9]. AI’s strength lies in its ability to leverage vast amounts of training data to identify patterns and generate predictions even in the absence of such clear cues. The difference between humans and AI is likely due to the nature of human language processing, which often relies on implicit knowledge and world experience to make predictions, while AI algorithms are primarily data-driven. Human language processing may be efficient mostly for highly predictable phrases due to the utilization of experience-based cognitive strategies and pre-existing knowledge.
However, there are no contextual cues or patterns in the phrases with low predictability; many words are possible in such phrases depending on the life situation. Among humans, more than half of such phrases had zero guesses. The superior performance of AI under low-predictability conditions raises a question about the underlying mechanisms of such context-independent predictive capabilities. AI models, particularly large language models such as GPT-3.5-turbo and GPT-4 employed in this study, are trained on massive datasets of text and code [
8]. This extensive training allows them to learn complex statistical relationships between words and phrases. During training, the model is exposed to sequences of words, called tokens, and learns to predict the next word in a sequence based on the preceding context. For example, in masked language modeling, the missing token should be predicted using the context [
15]. Within the context, semantic knowledge should be integrated with syntactic structure for producing and comprehending language [
16]. Additionally, pragmatic analysis also depends on context [
17]. The transformer architecture in LLMs [
18] uses a specific mechanism called self-attention, which allows the model to focus on relevant parts of a sentence when interpreting it. It is important that the intelligent capacity of LLMs is achieved not through explicit rule-based programming but through emergent properties. These novel properties emerge in neural networks after exposure to vast corpora of text [
8]. The complexity, being sometimes beyond human cognitive abilities, however, may be captured using LLMs and allow them to generate highly probable completions even when immediate contextual cues are limited or absent. Also, the ability to capture inaccessible to humans subtle statistical regularities may be another key factor driving AI’s success in unpredictable situations. In the situation of low-contextual cues, human intuition alone may be less reliable than statistical dependencies learned by LLMs. Furthermore, AI algorithms are not constrained by limitations of working memory or processing speed in the same way as humans. This may enable them to consider a wider range of possibilities when making predictions. Understanding this novel level of contextual cues, which is barely accessible to humans, may be used to enhance their perception through education and to develop more efficient strategies for language learning in humans, both first and second languages. With respect to the activation–verification model [
6], one can suggest that AI may generate a large set of possible predictions in parallel (corresponding to the parallel stream of processing in the model), but these predictions are verified against high-level statistic properties of language instead of the real sequence of phrase (the sequential stream of the model).
The superior performance of LLMs in low-constraint contexts reveals fundamental differences between artificial and biological systems process language. While humans rely heavily on local syntactic structures and real-world experience to generate predictions [
1,
2,
3,
4,
5], transformer-based models use their unique capacity to identify subtle statistical relationships across vast textual contexts [
7]. This capability stems from their self-attention mechanisms [
18], which simultaneously evaluate all possible word relationships within a given context window, allowing them to detect patterns that would require implausible cognitive effort for human readers. These findings suggest that human prediction may be more constrained by working memory limitations and cognitive economy than previously recognized.
The models’ extensive training on diverse corpora provides both advantages and limitations in linguistic prediction. While this exposure enables the recognition of rare but valid word associations beyond human experience, it also creates prediction patterns that may diverge from natural human responses. Our results show this most clearly in low-constraint sentences, where models generate plausible but unconventional completions that humans would rarely produce. This tension between statistical optimality and human-like intuition points to important considerations for developing language technologies that aim to complement, rather than simply surpass human capabilities, particularly in applications requiring natural communication.
This AI–human difference in cognitive styles may suggest the potential for synergistic collaboration between AI and human experts in the completion of corrupted messages, translation, and other difficult linguistic tasks. According to our evidence, while humans excel at utilizing contextual understanding and implicit knowledge for highly predictable phrases, AI excels at pattern recognition and prediction, especially in more ambiguous situations. Thus, one can predict that a hybrid approach that takes advantage of the strengths of AI and human intelligence could potentially achieve even higher accuracy and efficiency. Future research could explore the development of such collaborative systems, in which AI provides initial predictions that are then refined and validated by human experts. In particular, this system would allow for a more accurate and efficient way to complete phrases with varying levels of predictability.
Another perspective for future studies would be to investigate the generalizability of these findings to differently structured phrases in different languages. Exploring different AI models and comparing their performance in this task, like we did in our approach, but including more known factors about the model mechanisms, would also enhance the scope and significance of the findings. Investigating different prompt engineering techniques, which can modify the LLM responses, could also provide a further improvement to the results.
The superior performance of LLMs in low-context scenarios observed in our study also raises some ethical considerations regarding potential biases in model predictions. When contextual cues are minimal, LLMs may rely more heavily on statistical patterns from their training data, which could inadvertently amplify societal biases or generate inappropriate completions. This risk is particularly salient in applications like automated content generation or decision-support systems, where low-context predictions might propagate harmful stereotypes. Our findings suggest the need for robust bias-mitigation strategies when deploying LLMs in real-world scenarios with limited contextual information.
While LLMs demonstrate strong capabilities in low-context situations, there remain opportunities to improve their performance in high-context scenarios by better incorporating pragmatic knowledge. Current models occasionally generate technically correct but pragmatically unlikely completions when contextual constraints are strong, indicating a gap between statistical pattern recognition and human-like understanding of situational appropriateness. Future research could explore hybrid approaches that combine the statistical strengths of LLMs with structured knowledge bases of pragmatic norms, potentially leading to more human-aligned predictions in context-rich environments. These improvements would be particularly valuable for applications requiring nuanced language understanding, such as dialogue systems or contextual translation.
The implications of this research extend beyond the narrow scope of phrase completion. The ability to accurately predict missing words is highly beneficial for a wide range of natural language processing applications, including machine translation, speech recognition, and text summarization. The findings presented here suggest that AI-driven approaches offer a promising path for significant improvements in the accuracy and efficiency of these applications, particularly in scenarios where the input data are incomplete or ambiguous. Further research focusing on the integration of AI and human expertise, the refinement of predictability assessment methods, and the exploration of diverse AI architectures will be essential to unlock the full potential of AI. The development of more robust and efficient algorithms will have significant implications for numerous applications across different fields.
Limitations and Future Directions
While this study provides insights into comparative language prediction mechanisms, several limitations should be acknowledged. The exclusive focus on English-language stimuli means the findings may not generalize to other languages with different syntactic structures or predictability patterns. Considering our prompt engineering approach, alternative interaction methods might produce different results. In addition, the API-based testing environment imposed practical constraints that differ from human testing conditions, i.e., regarding response timing.
Extending this paradigm to other languages would help determine whether the observed human–LLM differences represent universal or language-specific phenomena. Developing hybrid prediction systems that combine human contextual sensitivity with LLM pattern-recognition capabilities could demonstrate the complementary strengths suggested by our results. Systematic investigation across different model architectures and training regimes could clarify how specific design choices affect predictive performance.