Challenges and Opportunities in Causality Analysis Using Large Language Models
Abstract
1. Introduction
- (1)
- It was recognized that any <C> assessment of glycemia in early pregnancy </C> would also result in <E> detection of milder degrees of hyperglycemia short of overt diabetes </E>.
- (2)
- <causal-relation> When a <cause> policyholder or insured person becomes sick or hurt </cause>, the company <trigger> pays </trigger> <outcome> cash benefits fairly and promptly for eligible claims </outcome> </causal-relation>.
- (3)
- If he had [reduced his sugar intake, he would be [free from diabetes.
- (4)
- [[A] Timely referral] is indicated] if [[CO] chronic or recurrent symptoms severely affect the patient’s productivity or quality of life]].
- Overview of the article: The article presents an overview of the field, and then showcases recent findings up to early Fall 2025. It is a personal view informed by the author’s experience and the findings of his students. (The overview part of the article is based on the author’s tutorial at FLAIRS-38, May 2025 [9].) It complements other surveys, e.g., [10,11,12,13] (with still maintained pointers to the relevant articles https://github.com/zhijing-jin/CausalNLP_Papers (accessed on 10 October 2025)), as well as another repository on causality extraction [14] https://github.com/causaltext/causal-text-papers (accessed on 10 October 2025).
2. Why Care About Extracting Causality-Related Information?
3. Views on Causality
- NLP adheres to a common sense view of causality as expressed in dictionaries. Tests for causality, administered by text annotators, are used to prepare data for information extraction, question answering, reasoning, and other AI tasks.
- Aristotle introduced the “Four Types of Causes” and argued that all four are necessary and sufficient for explanations [18].
- Other views will be alluded to mostly through references. This includes “Ontology of Causation” and seven views of causality in medical literature [21] .
3.1. Common Sense View of Causality in NLP
Why: The example is not causal if the reader is unable to construct a “Why” question regarding the effect. Temporal order: The example is not causal if the cause does not precede the effect in time. Counterfactual: The example is not causal if the effect is equally likely to occur or not occur without the cause. Ontological asymmetry: The example is not causal if the reader can readily swap the cause and effect claims in place. Linguistic: The example is likely to be causal if it can be rephrased into “X causes Y” or “Due to X, Y.”
3.2. Aristotle’s Four Causes
3.3. Pearl’s Ladder of Causation
3.4. Other Views of Causation
- Association: Causality implied through correlation or co-occurrence, not mechanisms.Example: Genetic factors are associated with disease susceptibility.
- Determinism: Causation as universal law: if A, then B.Example: Radiation causes cancer.
- Temporal order: Cause must precede effect in time sequence.Example: Exposure to a toxin precedes the onset of symptoms.
- Disposition: Objects have inherent powers to manifest effects.Example: A patient’s disposition to experience adverse effects from a drug.
- Causal chain: Events cause subsequent events in a linear sequence.Example: A pathogen triggers inflammation, which leads to organ damage.
- Influence: One process modifies the likelihood of another occurring.Example: Nicotine withdrawal positively influences smoking relapse.
- Production: Cause directly generates or triggers the effect.Example: A virus produces respiratory symptoms.
3.5. Why Care About This Diversity of Views?
4. Finding Causal Expressions: Datasets and Results
4.1. Finding Causal Expressions Within Sentences
4.2. Charlotte NLP Lab Work on Causality Extraction
- (1) Causality extraction from medical guidelines: Clinical Practice Guidelines (CPGs) can contain complex concepts expressed in a complex language. Our work started by extending the results of [34], both by creating new public datasets covering several hundred examples of sentences from hypertension, rhinosinusitis, and asthma guidelines (https://github.com/hematialam (accessed on 15 May 2025)) and annotated primarily for medical conditions, actions, and consequences. Several models were used for causality extraction and example scores include an F1 of 60–70% and accuracy of 85–90+% for Logistic Regression, and for BioBERT [35], the F1 is about 85% and the accuracy is 90–95%. The details of dozens of experiments are given in [36,37,38].
- (2) Causality extraction from business texts: The ORG Dataset (Organizational Behavior) was developed with business school collaborators and has over 14K examples for fine-tuning and over 2.2K for training and testing annotated datasets(in the BIO style) and is available at https://github.com/GoPeaks-AI/text2causalgraph (accessed on 15 May 2025). The dataset comes with a taxonomy, which helps with causality extraction [29,41]. The extraction results are comparable with the ones for the medical guidelines. Namely, 85–91% F1 dataset with BERT (various versions), which were better (as of Fall 2024) than Llama 3.5 7B and GPT-3.5.
4.3. Other Datasets and Studies on Causality Extraction
“The girl skipped school., The girl had not done her homework.”
- A recent article [47] reports GPT-4o performance of 50–80% depending on a task.
4.4. Overall Impression
5. LLMs, Hallucinations, and Causality
5.1. LLMs Limitations: Hallucinations Occur and Are Unavoidable
5.2. Hallucinations and Causality Extraction
6. LLMs for Causality Analysis
Probing LLMs’ Understanding of Causality
- Initial experiments: Several small-scale experiments were performed by this author in Spring of 2025. A text from the CNC corpus was used (3463_ecbplus.xml.xml—full text is given in Appendix A). It concerned a landslide that struck a quarry, burying a 2000-square-meter plant. Seventeen workers were on site: fourteen were safely evacuated and three remained missing. Over 300 rescuers and multiple teams from various departments were mobilized for large-scale search and rescue operations. Possibly, the landslide was triggered by prolonged dry weather that loosened the mountain slope.
- Substantive experiments: In the Fall of 2025, a larger experiment on a random sample of 25 texts from the earlier-mentioned CNC corpus was performed by this author. The texts were translated into English (by DeepSeek). GPT-5 and Gemini 2.5 Pro were prompted to extract pairs of events and establish causal relations between them. The experiments were conducted using Google Colab Pro and the respective APIs, with default temperature settings. The prompts asked for extracting ten events from each text, ordering them chronologically, and finding direct or indirect causal relations between them. After finding causal relationships, the models were asked to perform a critique of the reasoning of the other LLM.
“(e4, e5): Incorrect. The response was initiated because of the incident, not because of the confirmation of 14 trapped.”
7. Concluding Discussion: Overcoming Challenges in Causal Analysis
7.1. Challenges in Causality Extraction—What Have We Learned?
7.2. Can Causality Analysis with LLMs Help Address These Four Challenges?
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| LLM(s) | Large Language Model(s) |
| NLP | Natural Language Processing |
Appendix A. Working with GPT-4o on a Longer Text
- Translation of the text from Chinese to English (here using GPT, earlier in the article DeepSeek was used, to separate translation from interpretation).
- Semantic translation including annotations.
- Finding causal verbs.
- Listing the numbered points of what happened, in temporal order.
- Causal analysis for each pair of the points.
- Creation of the causal graph (several attempts).
Response:
On the morning of 17 December at around 10:43 a.m. a landslide occurred at the quarry of Anluyuan Mechanized Sand Co., Ltd. in Xinxing Town, Songyang County, Zhejiang Province. The landslide buried the company’s plant, covering approximately 2000 m2. An initial investigation identified 17 workers present at the scene; three workers were missing, while the remaining workers were safely evacuated. Immediately after the incident, county party-committee and government leaders organized firefighting, public-security, emergency-management and township officials to rush to the scene and search for the missing persons. Rescue operations were launched with combined forces from public security, firefighting, emergency response, militia, medical, electrical, telecommunications and civil-rescue teams. By noon on 17 December, Pan Saiqiu and other members of the Lishui Civil Defense Rescue Team had been notified of the incident. By 2:00 p.m., Pan Saiqiu’s team arrived at the scene to assist in the rescue efforts. Some team members entered the site, but owing to the large amount of equipment already present, others remained on standby outside. By 4:00 p.m. on 17 December, more than 300 rescuers, ten excavators and two loaders had been mobilized to continue full-scale search and rescue operations. A preliminary analysis concluded that the landslide was likely caused by prolonged abnormal dryness of the weather, which loosened the mountain slope.
Causality Analysis with Gemini Advanced 2.5 Pro; Early May 2025
This text describes an emergency incident and the subsequent rescue efforts. Here’s an analysis:
Key Event: A landslide occurred at the quarry of Anluyuan Mechanized Sand Co., Ltd. Location: Xinxing Town, Songyang County, Lishui, Zhejiang Province, China. Date and Time of Incident: 17 December, around 10:43 a.m. (The year is not specified in the text, but the reporting is from 17 December of that unspecified year). Impact: Human Casualties: 17 workers were on site. 3 are missing. The other 14 were safely evacuated. Property Damage: The company’s plant, covering approximately 2000 square meters, was buried. Response Efforts (as of 4:00 p.m. on 17 December): Mobilization: Over 300 rescuers. 10 excavators. 2 loaders. Follow-up rescue forces were still being increased. Participating Agencies/Groups: A multi-agency response is underway, including: (…) Quoted Individual: Pan Saiqiu (Lishui Civil Defense Rescue Team): Confirms his team’s arrival by 2:00 p.m. and mentions that some members have entered the site, while others are on standby due to the amount of rescue equipment. Source of Information for the Report: Reporter Zhou Yulong learned the details from the Publicity Department of Songyang, Zhejiang. Preliminary Cause of Landslide: Prolonged abnormal dryness of the weather, leading to the loosening of the mountain. Overall Tone: The tone is factual and urgent, focusing on the details of the incident and the scale of the rescue operation. It highlights the immediate and comprehensive response from various authorities and volunteer groups. Missing Information: The specific year of the incident is not mentioned. The condition of the three missing workers. Specific details about the “prolonged abnormal dryness” (e.g., duration). In summary, the text provides a snapshot of an active emergency situation—a landslide at a quarry—detailing the immediate aftermath, the significant rescue operation launched, the human impact and the preliminary suspected cause. It emphasizes a coordinated, large-scale response from multiple official and civil organizations.
(…)
2. Did Bullet 2 (Landslide) cause Bullet 3 (Plant buried)?
Yes. Why: The landslide is the physical event that directly resulted in the company’s plant being covered by debris.
(…)
Appendix B. Four Causes and Counterfactuals
“Economic aim: extract construction aggregate at lowest cost; Corporate mandate: maximize output, minimize downtime; Local demand for sand/concrete in Zhejiang’s building sector.” And it commented “Teleologically, the quarry wall was shaped the way it was so that large volumes of rock could be removed rapidly and cheaply. Reduced catch-bench spacing or steeper faces keep haul distances short, satisfying the final cause (profit), but at the expense of higher geotechnical risk.”
“The final cause, or telos, refers to the end, purpose, or the sake for which an event occurs. This is challenging for natural events which don’t have intention.”
Appendix B.1. FourCauses Analysis by Gemini 2.5 Pro
- Material Cause (What is it made of?)The material cause refers to the underlying matter from which the event or object is constituted. For the landslide:
- The Mountain Itself: This includes the rock, soil, earth and any other geological materials that comprised the slope before it failed.
- Water (or lack thereof, then its presence indirectly): While the prolonged dryness is cited as a factor for loosening, the actual movement in many landslides involves the interaction of water with the soil/rock. In this specific case, the dryness led to “loosening,” implying a change in the material properties that made it susceptible to movement.
- The Company’s Plant and Equipment: These became part of the debris and the consequence of the landslide, but the primary material of the landslide itself is the earth material.
Essentially, the material cause is the physical “stuff” that moved and was affected. - Formal Cause (What is its form or essence?)The formal cause is the pattern, model, or structure that defines the event; what makes it recognizable as a “landslide.”
- The Definition of a Landslide: The form is the rapid movement of a mass of rock, debris, or earth down a slope.
- The Specific Characteristics of This Landslide:
- -
- Occurred at a quarry.
- -
- Buried a plant of approximately 2000 square meters.
- -
- Involved a specific volume and type of material (though not detailed in the text beyond “mountain”).
- -
- Had a specific failure mechanism (e.g., a slump, a debris flow, a rockfall–the text doesn’t specify, but the “loosening” suggests a less cohesive failure).
- The Shape and Trajectory of the Moving Mass: The way the earth detached, moved and came to rest.
The formal cause is what allows us to identify this event as a “landslide” rather than, say, an earthquake or a flood. - Efficient Cause (What brought it into being?)The efficient (or moving) cause is the agent or process that initiated the event. This is often what we think of as the “cause” in modern scientific terms.
- Primary Efficient Cause (as per text): “Prolonged abnormal dryness of the weather, leading to loosening of the mountain.”
- Contributing/Underlying Efficient Causes (Implied or General):
- -
- Gravity: The constant force pulling the loosened material downwards.
- -
- Geological Conditions: The inherent properties of the rock and soil, slope angle and geological structure of the mountain at the quarry site.
- -
- Quarrying Activities (Potentially): The landslide occurred at a quarry. This human activity could be a significant contributing efficient cause.
- -
- The specific moment of failure: Some final, perhaps imperceptible, trigger that caused the already loosened material to give way.
The efficient cause is the “why it happened now” and “what made it happen.” - Final Cause (Telos: What is its purpose or end?)The final cause, or telos, refers to the end, purpose, or the sake for which an event occurs. This is challenging for natural events which don’t have intention.
- In a Purely Naturalistic Sense:
- -
- Achieving a More Stable State: From a physics perspective, the materials moved to reach a lower energy state, a more stable equilibrium under the force of gravity.
- -
- The Natural Process of Erosion: Landslides are a natural geological process that shapes the Earth’s surface over time.
- From a Human Perspective (Consequences, not “Purpose’):
- -
- The text describes the consequences: workers missing, a plant buried, a large-scale rescue operation.
- -
- It’s important not to anthropomorphize the landslide as having an intention to cause harm.
The human response (rescue) has its own final causes (saving lives, understanding the event).
Appendix B.2. Causal Analysis Using Aristotle’s Four Causes—GPT-o3
- Material + Formal Causes: Supplied the latent vulnerability:
- Weak rock.
- Over-steep configuration.
- Efficient Cause: Acted on 17 December to actualise that vulnerability:
- Crack propagation → block detachment.
- Final Cause: Explains why the slope was engineered so aggressively:
- Economic rationale shaped both formal geometry and tolerance for risk.
Aristotle’s framework clarifies that addressing only the efficient cause (e.g., blaming drought) is insufficient. - Preventive Interventions by Causal Type:
- Material fixes: rock bolting, shotcrete, drainage to improve inherent strength.
- Formal fixes: redesign bench angles, add catch benches, flatten overall slope.
- Efficient controls: drought monitoring, blast vibration limits, continuous slope radar.
- Final-cause alignment: integrate safety margin into the quarry’s business objectives so the purpose itself no longer incentivises risky geometry.
- Conclusion: By tackling all four causal dimensions, future failures of this type become far less probable.
References
- Devlin, J. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- OpenAI. OpenAI o3 and o4-Mini System Card; Technical Report; OpenAI: San Francisco, CA, USA, 2025. [Google Scholar]
- OpenAI. GPT-5 System Card; System Card; OpenAI: San Francisco, CA, USA, 2025. [Google Scholar]
- Contributors, W. List of Large Language Models—Wikipedia, The Free Encyclopedia. 2025. Available online: https://en.wikipedia.org/wiki/List_of_large_language_models (accessed on 17 May 2025).
- Rettenberger, L.; Reischl, M.; Schutera, M. Assessing political bias in large language models. J. Comput. Soc. Sci. 2025, 8, 1–17. [Google Scholar] [CrossRef]
- Bang, Y.; Chen, D.; Lee, N.; Fung, P. Measuring Political Bias in Large Language Models: What Is Said and How It Is Said. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 11142–11159. [Google Scholar] [CrossRef]
- Mariko, D.; Abi Akl, H.; Trottier, K.; El-Haj, M. The financial causality extraction shared task (FinCausal 2022). In Proceedings of the 4th Financial Narrative Processing Workshop@ LREC2022, Marseille, France, 15–24 June 2022; pp. 105–107. [Google Scholar]
- Zadrozny, W. Large Language Models (LLMs) and Causality Extraction from Text: Tutorial at FLAIRS-38. In Proceedings of the Florida Artificial Intelligence Research Society Conference (FLAIRS-38), LibraryPress@UF, Daytona Beach, FL, USA, 20–23 May 2025; Volume 38. [Google Scholar] [CrossRef]
- Li, X.; Cai, Z.; Wang, S.; Yu, K.; Chen, F. A Survey on Enhancing Causal Reasoning Ability of Large Language Models. arXiv 2025, arXiv:2503.09326. [Google Scholar] [CrossRef]
- Liu, X.; Xu, P.; Wu, J.; Yuan, J.; Yang, Y.; Zhou, Y.; Liu, F.; Guan, T.; Wang, H.; Yu, T.; et al. Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, Mexico, 29 April–4 May 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 7668–7684. [Google Scholar]
- Wang, L.; Shen, Y. Evaluating Causal Reasoning Capabilities of Large Language Models: A Systematic Analysis Across Three Scenarios. Electronics 2024, 13, 4584. [Google Scholar] [CrossRef]
- Jin, Z. Causality for NLP Reading List. 2021. Available online: https://github.com/zhijing-jin/Causality4NLP_Papers (accessed on 10 December 2025).
- Feder, A.; Keith, K.A.; Manzoor, E.; Pryzant, R.; Sridhar, D.; Wood-Doughty, Z.; Eisenstein, J.; Grimmer, J.; Reichart, R.; Roberts, M.E.; et al. Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond. Trans. Assoc. Comput. Linguist. 2022, 10, 1138–1158. [Google Scholar] [CrossRef]
- THORS eLearning Solutions. 7QC Tools: Fishbone Diagrams. 2025. Available online: https://thors.com/manufacturing-courses/manufacturing-fundamentals/quality/7qc-tools-fishbone-diagrams/ (accessed on 26 May 2025).
- Causality. Causality—Wikipedia, The Free Encyclopedia, 2022. Available online: https://en.wikipedia.org/wiki/Causality (accessed on 10 December 2025).
- Wang, P. Introduction to Artificial General Intelligence (AGI). 2023. Available online: https://cis.temple.edu/~pwang/AGI-Intro.html (accessed on 27 June 2025).
- Shields, C. Aristotle. 2022. Available online: https://plato.stanford.edu/archives/fall2022/entries/aristotle/#FouCauAccExpAde (accessed on 14 December 2025).
- Halpern, J.Y.; Pearl, J. Causes and explanations: A structural-model approach. Part I: Causes. Br. J. Philos. Sci. 2005. Available online: https://www.journals.uchicago.edu/doi/abs/10.1093/bjps/axi147 (accessed on 14 December 2025). [CrossRef]
- Bareinboim, E.; Correa, J.D.; Ibeling, D.; Icard, T.F. On Pearl’s Hierarchy and the Foundations of Causal Inference. In Probabilistic and Causal Inference; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
- Sawesi, S.; Rashrash, M.; Dammann, O. The Representation of Causality and Causation with Ontologies: A Systematic Literature Review. Online J. Public Health Inform. 2022, 14, e4. [Google Scholar] [CrossRef]
- Tan, F.A.; Hürriyetoğlu, A.; Caselli, T.; Oostdijk, N.; Nomoto, T.; Hettiarachchi, H.; Ameer, I.; Uca, O.; Liza, F.F.; Hu, T. The Causal News Corpus: Annotating Causal Relations in Event Sentences from News. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 2298–2310. [Google Scholar]
- Wikipedia Contributors. Four Causes. 2024. Available online: https://en.wikipedia.org/wiki/Four_causes (accessed on 5 February 2024).
- Pearl, J. Models, Reasoning and Inference, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
- Pearl, J. The seven tools of causal inference, with reflections on machine learning. Commun. ACM 2019, 62, 54–60. [Google Scholar] [CrossRef]
- Lorkowski, C.M. Hume, David: Causation. Internet Encyclopedia of Philosophy 2011. Available online: https://iep.utm.edu/hume-causation/ (accessed on 1 June 2025).
- VanderWeele, T.J.; Knol, M.J. A tutorial on interaction. Epidemiol. Methods 2014, 3, 33–72. [Google Scholar] [CrossRef]
- VanderWeele, T. Explanation in Causal Inference: Methods for Mediation and Interaction; Oxford University Press: Oxford, UK, 2015. [Google Scholar]
- Gopalakrishnan, S.; Chen, V.Z.; Dou, W.; Hahn-Powell, G.; Nedunuri, S.; Zadrozny, W. Text to Causal Knowledge Graph: A Framework to Synthesize Knowledge from Unstructured Business Texts into Causal Graphs. Information 2023, 14, 367. [Google Scholar] [CrossRef]
- Hendrickx, I.; Kim, S.N.; Kozareva, Z.; Nakov, P.; Séaghdha, D.O.; Padó, S.; Pennacchiotti, M.; Romano, L.; Szpakowicz, S. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. arXiv 2019, arXiv:1911.10422. [Google Scholar]
- Diaz-Garcia, J.A.; Lopez, J.A.D. A survey on cutting-edge relation extraction techniques based on language models. arXiv 2024, arXiv:2411.18157. [Google Scholar] [CrossRef]
- Norouzi, R.; Kleinberg, B.; Vermunt, J.K.; van Lissa, C.J. Capturing causal claims: A fine-tuned text mining model for extracting causal sentences from social science papers. Res. Synth. Methods 2025, 16, 139–156. [Google Scholar] [CrossRef]
- Demirhan, H.; Zadrozny, W. Survey of multimodal medical question answering. BioMedInformatics 2024, 4, 50–74. [Google Scholar] [CrossRef]
- Wenzina, R.; Kaiser, K. Identifying condition-action sentences using a heuristic-based information extraction method. In Process Support and Knowledge Representation in Health Care; Springer: Cham, Switzerland, 2013; pp. 26–38. [Google Scholar]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
- Hematialam, H.; Zadrozny, W. Identifying condition-action statements in medical guidelines using domain-independent features. arXiv 2017, arXiv:1706.04206. [Google Scholar]
- Hematialam, H. Knowledge Extraction and Analysis of Medical Text with Particular Emphasis on Medical Guidelines. Ph.D. Thesis, The University of North Carolina at Charlotte, Charlotte, NC, USA, 2021. [Google Scholar]
- Hematialam, H.; Zadrozny, W. Identifying Condition-action Statements in Medical Guidelines: Three Studies using Machine Learning and Domain Adaptation. 2021. Available online: https://www.researchsquare.com/article/rs-500521/v1 (accessed on 20 December 2025).
- Gopalakrishnan, S. Building Computational Representations of Medical Guidelines Using Large Language Models and Transfer Learning. Ph.D. Thesis, The University of North Carolina at Charlotte, Charlotte, NC, USA, 2023. [Google Scholar]
- Gopalakrishnan, S.; Chen, V.Z.; Dou, W.; Zadrozny, W. On the relation between K–L divergence and transfer learning performance on causality extraction tasks. Nat. Lang. Process. J. 2024, 6, 100055. [Google Scholar] [CrossRef]
- Chen, V.Z.; Montano-Campos, F.; Zadrozny, W.; Canfield, E. Machine reading of hypotheses for organizational research reviews and pre-trained models via R shiny app for non-programmers. arXiv 2021, arXiv:2106.16102. [Google Scholar] [CrossRef]
- Gopalakrishnan, S.; Garbayo, L.; Zadrozny, W. Causality extraction from medical text using large language models (LLMs). Information 2024, 16, 13. [Google Scholar] [CrossRef]
- Liu, X.; Yang, W.; Wei, F.; Wu, Z. Semantic aware enhanced event causality identification. (Nat.) Sci. Rep. 2024, 14, 32078. [Google Scholar] [CrossRef]
- Caselli, T.; Vossen, P. The Event StoryLine Corpus: A New Benchmark for Causal and Temporal Relation Extraction. In Proceedings of the Events and Stories in the News Workshop, Vancouver, BC, Canada, 4 August 2017; pp. 77–86. [Google Scholar] [CrossRef]
- Tan, F.A.; Hettiarachchi, H.; Hürriyetoğlu, A.; Caselli, T.; Uca, O.; Liza, F.F.; Oostdijk, N. Event Causality Identification with Causal News Corpus–Shared Task 3, CASE 2022. arXiv 2022, arXiv:2211.12154. [Google Scholar]
- Wu, M.; Zhang, Q.; Wu, C.; Wang, G. End-to-end multi-granulation causality extraction model. Digit. Commun. Netw. 2024, 10, 1864–1873. [Google Scholar] [CrossRef]
- Miliani, M.; Auriemma, S.; Bondielli, A.; Chersoni, E.; Passaro, L.; Sucameli, I.; Lenci, A. ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models. arXiv 2025, arXiv:2502.15487. [Google Scholar] [CrossRef]
- AI Futures Project (Daniel Kokotajlo, Eli Lifland, Thomas Larsen, Romeo Dean). AI 2027: A Research-Backed AI Scenario Forecast. 2025. Available online: https://ai-2027.com/ (accessed on 15 June 2025).
- OPEN AI. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
- Rein, D.; Hou, B.L.; Stickland, A.C.; Petty, J.; Pang, R.Y.; Dirani, J.; Michael, J.; Bowman, S.R. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
- Marcus, G. The Latest AI Scaling Graph, and Why It Should Worry You. 2024. Available online: https://garymarcus.substack.com/p/the-latest-ai-scaling-graph-and-why (accessed on 27 May 2025).
- Venkit, P.N.; Chakravorti, T.; Gupta, V.; Biggs, H.; Srinath, M.; Goswami, K.; Rajtmajer, S.; Wilson, S. An Audit on the Perspectives and Challenges of Hallucinations in NLP. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 6528–6548. [Google Scholar]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
- OpenAI. Fine-Tuning Guide. 2025. Available online: https://platform.openai.com/docs/guides/fine-tuning (accessed on 22 May 2025).
- Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is inevitable: An innate limitation of large language models. arXiv 2024, arXiv:2401.11817. [Google Scholar] [CrossRef]
- Kalai, A.T.; Vempala, S.S. Calibrated language models must hallucinate. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, Vancouver, BC, Canada, 24–28 June 2024; pp. 160–171. [Google Scholar]
- Wu, C.; Grama, A.; Szpankowski, W. No Free Lunch: Fundamental Limits of Learning Non-Hallucinating Generative Models. arXiv 2024, arXiv:2410.19217. [Google Scholar] [CrossRef]
- Suzuki, A.; He, Y.; Tian, F.; Wang, Z. Hallucinations are inevitable but statistically negligible. arXiv 2025, arXiv:2502.12187. [Google Scholar] [CrossRef]
- Simhi, A.; Itzhak, I.; Barez, F.; Stanovsky, G.; Belinkov, Y. Trust Me, I’m Wrong: High-Certainty Hallucinations in LLMs. arXiv 2025, arXiv:2502.12964. [Google Scholar]
- Zubić, N.; Soldá, F.; Sulser, A.; Scaramuzza, D. Limits of deep learning: Sequence modeling through the lens of complexity theory. arXiv 2024, arXiv:2405.16674. [Google Scholar] [CrossRef]
- Niess, G.; Razouk, H.; Mandic, S.; Kern, R. Addressing Hallucination in Causal Q&A: The Efficacy of Fine-tuning over Prompting in LLMs. In Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal), Abu Dhabi, United Arab Emirates, 19–20 January 2025; pp. 253–258. [Google Scholar]
- Dündar-Coecke, S.; Goldin, G.; Sloman, S.A. Causal reasoning without mechanism. PLoS ONE 2022, 17, e0268219. [Google Scholar] [CrossRef] [PubMed]
- Mollick, E. On Jagged AGI: o3, Gemini 2.5, and Everthing After. 2025. Available online: https://www.oneusefulthing.org/p/on-jagged-agi-o3-gemini-25-and-everything (accessed on 25 May 2025).
- Gopalakrishnan, S.; Chen, V.; Hahn-Powell, G.; Tirunagar, B. Computer-assisted construct classification of organizational performance concerning different stakeholder groups. arXiv 2021, arXiv:2107.05133. [Google Scholar] [CrossRef]
- Gopalakrishnan, S.; Padithala, S.; Demirhan, H.; Zadrozny, W. MDS_UNCC Question Answering System for Biomedical Data with Preliminary Error Analysis. In Proceedings of the CLEF (Working Notes), Bucharest, Romania, 21–24 September 2021; pp. 231–240. [Google Scholar]
- Zadrozny, W.; Garbayo, L. A Sheaf Model of Contradictions and Disagreements. Preliminary Report and Discussion. arXiv 2018, arXiv:1801.09036. [Google Scholar] [CrossRef]
- Hematialam, H.; Garbayo, L.; Gopalakrishnan, S.; Zadrozny, W.W. A Method for Computing Conceptual Distances between Medical Recommendations: Experiments in Modeling Medical Disagreement. Appl. Sci. 2021, 11, 2045. [Google Scholar] [CrossRef]
- Team, G. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]









| Cause Type | Description |
|---|---|
| Material | That from which something is generated and out of which it is made, e.g., the bronze of a statue. |
| Formal | The structure which the matter realizes and in terms of which it comes to be something determinate, e.g., the shape of the president, in virtue of which this quantity of bronze is said to be a statue of a president. |
| Efficient | The agent responsible for a quantity of matters coming to be informed, e.g., the sculptor who shaped the quantity of bronze into its current shape, the shape of the president. |
| Final | The purpose or goal of the compound of form and matter, e.g., the statue was created for the purpose of honoring the president. |
| Layer (Symbolic) | Typical Activity | Typical Question | Example | Machine Learning | |
|---|---|---|---|---|---|
| Associational | Seeing | What is? How would seeing X change my belief in Y? | What does a symptom tell us about the disease? | Supervised and Unsupervised Learning | |
| Interventional | Doing | What if?What if I do X? | What if I take aspirin, will my headache be cured? | Reinforcement Learning | |
| Counterfactual | Imagining | Why?What if I had acted differently? | Was it the aspirin that stopped my headache? |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zadrozny, W.W. Challenges and Opportunities in Causality Analysis Using Large Language Models. Entropy 2026, 28, 23. https://doi.org/10.3390/e28010023
Zadrozny WW. Challenges and Opportunities in Causality Analysis Using Large Language Models. Entropy. 2026; 28(1):23. https://doi.org/10.3390/e28010023
Chicago/Turabian StyleZadrozny, Wlodek W. 2026. "Challenges and Opportunities in Causality Analysis Using Large Language Models" Entropy 28, no. 1: 23. https://doi.org/10.3390/e28010023
APA StyleZadrozny, W. W. (2026). Challenges and Opportunities in Causality Analysis Using Large Language Models. Entropy, 28(1), 23. https://doi.org/10.3390/e28010023
