🌻 Research on the ability of LLMs to detect causal claims

11 Dec 2025

Summary#

This note explains why modern Large Language Models (LLMs) — especially since instruction-tuned chat models — often seem to have a “native” ordinary-language grasp of causation: they can spot, generate, and elaborate “A influences B” talk even when it is informal, implicit, or socially framed.

The core claim is hybrid:

Historically, this is framed as a transition from causality-as-extraction to causality-as-generation:

To ground “ordinary language causation,” this note uses two cognitive-linguistic lenses that match what chat models often do well:

This note is also explicit about limits and failure modes that matter for “detecting causal claims” in text:

Current frontiers (2024-2025) are framed as attempts to make causal reasoning more checkable and structured, including addressing the above shortcomings: “reasoning models” that perform intermediate causal checks (often hidden) and pipelines where LLMs extract candidate causal edges into explicit graphs (DAG-like representations) for downstream formal analysis.

1. Introduction: The Emergence of "Native" Causal Fluency#

The capacity of Large Language Models (LLMs) to identify, generate, and reason about causal relationships in ordinary language is a notable (and still debated) development in artificial intelligence over the last decade. Since the release of ChatGPT (based on GPT-3.5) and its successors, these systems have often appeared able to process prompts involving influence, consequence, and mechanism without extensive few-shot examples or rigid schema engineering that characterised previous generations of Natural Language Processing (NLP). This report investigates the trajectory of this capability from 2015 to 2025, asking how much is a by-product of scale versus the result of specific (often implicit) training choices.

Furthermore, the report explores the philosophical and linguistic dimensions of this capability, using frameworks such as Leonard Talmy’s Force Dynamics and the theory of Implicit Causality (IC) verbs to benchmark LLM performance against human cognitive patterns. The evidence suggests that while LLMs can often handle the linguistic interface of causality — the "language game" of cause and effect — significant questions remain regarding the grounding of these symbols in a genuine world model.


2. The Pre-Generative Landscape (2015-2019): Causality as Extraction#

To appreciate the "native" fluency of 2025-era models, one must first analyse the fragmented and rigid methodologies that dominated the field between 2015 and 2019. During this period, the "ordinary language concept of causation" was operationalised not as a generative understanding, but as a classification task known as Causal Relation Extraction (CRE).

2.1 The Legacy of SemEval-2010 Task 8#

For much of the decade, the benchmark defining the field was SemEval-2010 Task 8, which framed causality as a relationship between two nominals marked by specific directionality. Systems were tasked with identifying whether a sentence like "The fire was triggered by the spark" contained a Cause-Effect(e2, e1) relationship.   

Research from this era was characterised by a heavy reliance on feature engineering and pipeline architectures. Early approaches used Support Vector Machines (SVMs) and later, Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). These models did not "understand" causality in any holistic sense; rather, they learned to detect explicit lexical triggers — words like "caused," "led to," or "resulted in."   

The limitation of this paradigm was its inability to handle implicit causality — relationships where the causal link is inferred from world knowledge rather than stated explicitly. For instance, in the sentence "The rain stopped; the sun came out," a human reader infers a temporal and potentially causal sequence. Pre-transformer models, lacking a comprehensive probabilistic model of how events co-occur in the world, consistently failed to identify such links, achieving F1 scores that rarely exceeded 0.60 on implicit datasets. This era treated causality as a syntactic puzzle rather than a semantic reality.   

2.2 The Shift to Event-Centric Resources: EventStoryLine and Causal-TimeBank#

Between 2015 and 2018, the research community began to move beyond sentence-level extraction toward document-level understanding, driven by the creation of corpora like the EventStoryLine Corpus and Causal-TimeBank.   

Despite these richer datasets, the methods remained fundamentally discriminative. Systems like CATENA (2016) used "sieves" — rule-based filters — to extract causal links. These systems could identify likely causal passages, but they did so through rigid, handcrafted logic rather than conversational explanation. They could not generate an explanation or reason about counterfactuals; they could only point to where a human annotator might say a cause existed.   

2.3 The BERT Revolution and Contextual Embeddings#

The release of BERT (Bidirectional Encoder Representations from Transformers) in 2018 marked a pivotal transition. BERT introduced deep contextual embeddings, allowing models to distinguish the semantic nuance of causal words based on their surrounding text.

Comparative studies from this period show a dramatic jump in performance. Fine-tuned BERT models (such as BioBERT) achieved F1-scores of approximately 0.72 on medical causality tasks, significantly outperforming previous architectures. BERT represents a "careful reader" — a model that can attend to the entire sentence simultaneously to resolve ambiguities.   

However, BERT was still an encoder-only architecture. It was designed to understand (classify/tag), not to speak. While it could identify causal passages with greater accuracy than ever before, it lacked the autoregressive capability to generate causal narratives. The "native ordinary language concept" requires not just recognition, but the ability to formulate causal thoughts — a capability that would only emerge with the Generative Pre-trained Transformer (GPT) series.


3. The Generative Era (2020-2025): Structural Induction of Causal Logic#

The observation that models "since around ChatGPT 3.5" (released late 2022) exhibit a distinct causal proficiency aligns with the industry's shift toward Instruction Tuning (IT) and Reinforcement Learning from Human Feedback (RLHF). The analysis of research data suggests that this proficiency is not just a coincidence, but is materially shaped by training methodologies that (often unintentionally) act as a large "causal curriculum."

3.1 The "Coincidence" of Pre-training: Implicit World Models#

Before discussing specific training, one must acknowledge the foundation: pre-training on web-scale corpora (The Pile, Common Crawl, C4). The primary objective of these models is next-token prediction.

Theoretical research suggests that optimising for prediction error on a diverse corpus forces the model to learn a compressed representation of the data generating process — effectively, a "world model". Because human language is intrinsically causal (we tell stories of why things happen), a model trained to predict the next word in a narrative must implicitly model causal physics.   

Recent theoretical work on Semantic Characterization Theorems argues that the latent space of these models evolves to map the topological structure of these semantic relationships. Thus, the "native" understanding is partially a coincidence of the data's nature: the model learns causality because causality is the glue of human discourse.   

3.2 The Instruction Tuning Hypothesis: Specific Training via Templates#

The transition from "text completer" (GPT-3) to "helpful assistant" (ChatGPT) was mediated by Instruction Tuning. This process involves fine-tuning the model on datasets of (Instruction, Output) pairs. An analysis of major instruction datasets — FLANOIG, and Dolly — reveals that they are saturated with causal reasoning tasks.

3.2.1 The FLAN Collection: The Template Effect#

The FLAN (Finetuned Language Net) project  was instrumental in this development. Researchers took existing NLP datasets (including causal extraction datasets) and converted them into natural language templates.   

This contradicts the idea that the capability is purely coincidental. The models were specifically drilled on millions of "causal identification" exercises, disguised as instruction following.

3.2.2 Open Instruction Generalist (OIG) and Dolly#

The OIG and Dolly datasets  expanded this to open-domain interactions. These datasets contain thousands of "brainstorming" and "advice" prompts.   

3.3 Reinforcement Learning from Human Feedback (RLHF): The Coherence Filter#

The final layer of "specific training" is RLHF. In this phase, human annotators rank model outputs based on preference.

Conclusion on Training vs. Coincidence: The capability is a hybrid. The potential to understand causality is a coincidence of pre-training scale (World Models), but the ability to natively identify and articulate it in response to a prompt is the result of specific Instruction Tuning and RLHF regimens that prioritise causal templates and coherent explanation.


4. Linguistic Frameworks: Analysing "Ordinary" Causation#

This note emphasises the "native ordinary language concept of causation." To understand this, we must look beyond computer science to Cognitive Linguistics. Recent research has benchmarked LLMs against human linguistic theories, particularly Talmy’s Force Dynamics and Implicit Causality (IC).

4.1 Force Dynamics: Agonists and Antagonists in Latent Space#

Leonard Talmy’s theory of Force Dynamics posits that human causal understanding is rooted in the interplay of forces: an Agonist (the entity with a tendency towards motion or rest) and an Antagonist (the opposing force).   

4.2 Implicit Causality (IC) Verbs#

Another major area of inquiry is Implicit Causality (IC), which refers to the bias native speakers have regarding who is the cause of an event based on the verb used.   

In this sense, "bias" means the useful working expectation of which part of the sentence is the cause.

Benchmarking Results: Research comparing LLM continuations to human psycholinguistic data reveals a high degree of alignment.

4.3 The Limits of "Native" Understanding: The Causal Parrot Debate#

Despite these successes, a vigorous debate persists regarding whether this constitutes "understanding" or merely "stochastic parroting".   

A caution about dated negative findings: many "LLMs cannot do causal reasoning" results from around 2020-2022 are best read as results about a specific model family and evaluation setup (often base models, short prompts, and narrow benchmarks). Newer instruction-tuned models (and more careful prompting protocols) can reduce some of these gaps on standard tests, but the picture remains mixed and sensitive to benchmark design, leakage, and what is being counted as "causal reasoning" versus plausible explanation.


5. Benchmarking the "Informal": From Social Media to Counterfactuals#

The evaluation of causal understanding has evolved from F1 scores on extraction tasks to sophisticated benchmarks that test the model's ability to handle the messy, informal causality of the real world.

5.1 CausalTalk: Informal Causality in Social Media#

The CausalTalk dataset focuses on "passages where one thing influences another" in informal contexts.   

5.2 Explicit vs. Temporal Confusion (ExpliCa)#

The ExpliCa benchmark  investigates a specific failure mode: the confusion of time and cause.   

Again, a caution about dated negative findings: while these weaknesses are interesting, frontier models in 2025 are much less likely to display them.

5.3 Counterfactuals and "What If" (CRASS)#

The CRASS (Counterfactual Reasoning Assessment) benchmark  tests the model's ability to reason about what didn't happen.   


6. Philosophical Dimensions: Symbol Grounding and World Models#

The impressive performance of LLMs on causal tasks raises profound philosophical questions about the nature of meaning. Can a system that has never physically interacted with the world truly understand "force," "push," or "cause"?

6.1 The Symbol Grounding Problem#

Cognitive scientists have long argued that human concepts are grounded in sensorimotor experience. We understand "heavy" because we have felt gravity.


7. Current Frontiers (2024-2025): Reasoning Models and Future Directions#

The field is currently undergoing another shift with the introduction of "Reasoning Models" (e.g., OpenAI's o1/o3 series, DeepSeek R1).

7.1 Chain-of-Thought Monitoring and "Thinking" Tokens#

Newer models are trained to produce hidden "chains of thought" before generating a final answer.

7.2 Causal Graph Construction#

Recent work has moved back to structure, using LLMs to extract and construct Causal Graphs (DAGs) from unstructured text, a process sometimes known as causal mapping.   


8. Conclusion#

The research of the last decade suggests that the "native" causal understanding of LLMs is a constructed capability, developed through large-scale training on human text and refined by human preference signals. It is not just a coincidence, but a plausible consequence of optimising models to predict a world that is described in strongly causal terms.

  1. Origin: The capability originates in pre-training, where the model learns the distributional "shadow" of causation cast by billions of human sentences.

  2. Development: It is sharpened by Instruction Tuning (FLAN, Dolly), which explicitly teaches the model the "language game" of explanation and consequence through millions of templates.

  3. Refinement: It is polished by RLHF, which imposes a human preference for logical coherence and narrative flow, effectively pruning non-causal outputs.

  4. Nature: This understanding is linguistic and schematic. It often mirrors the force dynamics and implicit biases of human language, but can remain brittle when faced with novel physical interactions or rigorous counterfactual logic.

Overall, these systems can simulate many of the linguistic patterns humans use when describing causes and effects. That makes them useful for drafting, paraphrase, and extraction, but it should not be treated as evidence of intervention-level causal knowledge.


9. Comparative Data Tables#

Table 1: Evolution of Causal Tasks and Metrics (2015-2025)#

Era Primary Focus Methodology Dominant Datasets Typical Metric "Native" Capability
2015-2018 Relation Classification SVM, RNN, Sieves SemEval-2010 Task 8, EventStoryLine F1 Score (~0.50-0.60) None (Pattern Matching)
2019-2021 Span/Context Extraction BERT, RoBERTa Causal-TimeBank, BioCausal F1 Score (~0.72) Contextual Recognition
2022-2025 Generative Reasoning GPT-4, Llama, Instruction Tuning CausalTalk, CRASS, ExpliCa Accuracy, Human Eval Generative/Schematic

Table 2: Performance on Causal Benchmarks (Selected Studies)#

Benchmark Task Description Model Class Performance Note Source
SemEval Task 8 Classify relation between nominals BERT-based (BioBERT) ~0.72-0.80 F1 (High accuracy on explicit triggers)
CRASS Counterfactual "What if" reasoning GPT-3.5 / Llama Moderate baseline; significantly improved with LoRA/PEFT
CausalProbe Causal relations in fresh (unseen) text GPT-4 / Claude Significant drop compared to training data; suggests memorisation
Implicit Causality Predicting subject/object bias (John amazed Mary) GPT-4 High alignment with human psycholinguistic baselines
Force Dynamics Translating "letting/hindering" verbs GPT-4 High accuracy in preserving agonist/antagonist roles

Table 3: Key Instruction Tuning Datasets Influencing Causal Capability#

Dataset Content Type Causal Relevance Mechanism of Training Source
FLAN NLP Tasks -> Instructions High (COPA, e-SNLI templates) Explicitly maps "Premise" -> "Cause/Effect" in mixed prompts
OIG Open Generalist Dialogues High (Advice, How-to) Teaches Means-End reasoning (Action -> Result)
Dolly Human-generated Q&A High (Brainstorming, QA) Reinforces human-like explanatory structures
CausalTalk Social Media Claims High (Implicit assertions) Captures "gist" causality in informal discourse