!manuscript

🌻 !manuscript

2026

1. Introduction: an orthogonal yes#

The arrival of large language models in qualitative research has produced a debate that is usually drawn as a straight line with two ends. At one end is rejection. A widely endorsed commentary, written by 416 qualitative researchers, rejects the use of generative AI for reflexive thematic analysis and other reflexive approaches, on the grounds that the model is "simulated intelligence only" and "fundamentally incapable of genuinely making meaning from language", that reflexive qualitative work is "a distinctly human practice", and that the wider harms of the technology are serious (Jowsey et al. 2025). At the other end is enthusiasm: the prospect of working at scales that manual coding cannot reach, and of letting the model do much of the analytic work.

This paper takes a position that does not sit at either end of that line, nor at a midpoint along it. It runs at a right angle to it. We share a good deal of the instinct behind the rejection. We do not think a research question can be reduced to a decision procedure that an algorithm executes, and we do not think interpretation can be automated. We are not making a positivist argument that meaning is waiting in the text to be read off mechanically. And yet a deliberately generic and minimalist form of causal mapping, of all the approaches one might reach for, turns out to be surprisingly useful for getting to the centre of what applied researchers and evaluators most often want to know: what, according to the people they spoke to, causes what.

The reason is specific, and it is the argument of this paper. Causal mapping rests on one small coding act: record a single causal claim made by a source, as an ordered pair of a cause and an effect, attached to the verbatim quote that supports it and to a source identifier (Axelrod 1976; Eden 1988; Powell et al. 2024; Evaluation 2024). That act is narrow enough that its output can be checked locally, link by link, against the quote. This is the kind of task a model can do at scale without being asked to interpret, summarise or theorise. Every other analytic decision, framing the question, curating the vocabulary, building the pipeline, writing the account, stays with the human. In the middle sits a links table that the analyst, a co-researcher or a reviewer can read.

Two features of this arrangement matter for the special issue's concerns. The first is that it works as a one-way pipeline rather than a conversation. The work moves from narrow extraction, through deterministic transforms, to human-authored synthesis. There is a to-and-fro, but it lies in refining prompts and research questions rather than in interpreting the corpus inside a chat window. We will argue that the one-way character is the source of the method's accountability rather than a deficiency in it. The second is that the chain from any analytic claim back to a supporting quote is short and inspectable, which is exactly the "accountable links between analytic claims and empirical materials" the special issue asks for.

We should be plain about scope. This is closer to a small-q than a Big-Q argument, in Kidder and Fine's sense as developed by Braun and Clarke, where small q describes scientifically descriptive, often (post)positivist qualitative technique and Big Q describes artfully interpretive, non-positivist, reflexive work (Braun & Clarke 2023; Braun & Clarke 2025). We are not proposing causal mapping as a way of doing reflexive thematic analysis, and we are not proposing it as a route to the kind of interpretive insight that Big-Q work exists to produce. The contribution is more modest: a way of getting useful work done for researchers whose questions are about what other people think causes what, where the alternative is either much slower manual coding or a synthesis that reads well but is hard to audit.

The paper proceeds as follows. Section 2 maps the current debate as a set of orthogonal positions rather than a single line. Section 3 develops a bridge to the interpretive tradition: even within thematic analysis, the difference between a mere code and a proper theme is, in a sense we make precise, answer-shaped. Sections 4 and 5 set out causal mapping as a form of qualitative data analysis and explain why its coding act is the right shape for bounded AI assistance. Section 6 works through a real corpus, 48 interviews on loneliness, comparing a causal-mapping pass with a fully AI-led conversational thematic analysis of the same data. Section 7 is the core of the paper: we state the strongest criticisms of our position and try to neutralise each. Sections 8 and 9 deal with accountability and ethics, and with limits.

2. The debate as it stands#

It helps to see the current literature as several positions that are not all arguing about the same thing.

The first is the rejection just described (Jowsey et al. 2025). Its scope is worth reading carefully, because the letter is precise about it. The signatories reject generative AI for "Big Q Qualitative approaches, such as reflexive thematic analysis, or various phenomenological approaches". Their central argument is that reflexive analysis is "an inherently meaning-based technique" and that the model cannot make meaning. They draw an explicit line: just as the meaning-based requirement of reflexive thematic analysis "distinguishes it methodologically from word-counting techniques such as content analysis (which can be automated), so too it must also exclude GenAI". That parenthesis matters for us, and we return to it.

A second position rejects the rejection. De Paoli argues that prohibiting generative AI on what he reads as metaphysical grounds risks turning a methodological discussion into dogma and closes off useful innovation (n.d.). This is a corrective to the first position rather than an opposite of it.

A third position holds that the dominant way of using AI, asking it to imitate human line-by-line coding, is itself a mistake, but for a reason almost opposite to the rejection's. Nguyen-Trung and Friese argue that coding is a "skeuomorphic" workaround for human cognitive limits, "a human crutch" built to support our limited processing power, which an AI does not share, so forcing a model to imitate it commits a "skeuomorphic fallacy" and produces "methodological incongruence" between qualitative values and actual practice (n.d.). The recommended remedy is to move "from partitioning data to holistic querying", treating the model as an interlocutor rather than a coder. The same impulse animates a family of dialogic, post-coding frameworks: Friese's Conversational Analysis with AI (Friese 2025), Morgan's Query-Based Analysis, which "reverses" coding by developing themes through an extended conversation with the data before examining the supporting quotations (n.d.), and Nguyen-Trung and Nguyen's Narrative-Integrated Thematic Analysis, which supports theme generation without coding (Nguyen-Trung & Nguyen 2026). There is also work, including our own, that argues for moving beyond the binary of acceptance and rejection altogether (Friese et al. 2026).

A fourth position is empirical and cautionary. Ashwin, Chhabra and Rao show, with interviews from Rohingya refugees and their Bengali hosts in Bangladesh, that the errors LLMs make when coding are "not random with respect to the characteristics of the interview subjects", so that LLM-generated codes can lead to "misleading inferences"; they recommend training a bespoke supervised model on a subset of transcripts coded by trained researchers rather than using an LLM directly (Ashwin et al. 2025). Related work finds cultural alignment effects when GPT and human coders are compared (Wei et al. 2025).

These positions pull in different directions, but most of them share a target: the use of an LLM to perform open-ended interpretive synthesis, whether by imitating human coding or by carrying a conversation toward themes. The rejection says a machine cannot do this. The incongruence argument says imitating human coding is the wrong way to ask. The bias literature says the output is systematically skewed. Our position is orthogonal to all three because it does not ask the model to do interpretive synthesis at all. It asks the model for one narrow, checkable extraction, and keeps the interpretation, the synthesis and the authority with the human. The rejection's own line, that content analysis can be automated because it is not meaning-based in the reflexive sense, places a narrow, rule-governed extraction on the automatable side of the divide. Our coding act lives on that side too.

3. Code, theme, and the answer-shape of a finding#

There is a bridge from this orthogonal position back into the interpretive tradition, and it runs through the difference between a code and a theme.

In their guidance on good practice, Braun and Clarke draw a distinction that they regard as fundamental and that they find researchers often blur: between themes understood as "topic summaries" and themes understood as "meaning-based interpretive stories" (Braun & Clarke 2023). A topic summary collects what participants said about a subject, such as "good experiences in healthcare". A meaning-based theme is built around a central organising idea and tells an interpretive story; their example contrasts the topic "good experiences of healthcare" with the meaning-based theme "validation of my personhood". The test they offer is telling: if you could have written the theme before analysing the data, or if it maps onto a data-collection question, it is probably a topic summary. A real theme could not have been written in advance, because it says something.

The thing we want to draw out of this is not a claim about causation in Braun and Clarke. It is a claim about the shape of a finding. The difference between a code and a real theme, on their own account, is that a code labels and a theme makes a point in relation to the research question. A theme is answer-shaped: it is a response to a real query about something that matters, rather than a folder into which observations are filed. This is why a one-word theme name that names a topic, "Doctors", is a warning sign, while "validation of my personhood" is not. The first answers nothing; the second answers something.

Causal claims are one species of answer-shaped finding. When a participant says that a clinic's Saturday opening let them keep their job and therefore attend, the content is not a topic to be filed under "access" or "opening hours". It is a small answer to a small question about what made a difference. Causal mapping is built to capture exactly this species of content, at scale, and to keep each instance attached to the words that carried it. We are not claiming that a causal map is a reflexive thematic analysis, or that every worthwhile finding is causal; much of what matters in talk is not causal at all, and we return to this in the limits. The point is narrower and, we think, useful: the move from a code to a finding that answers a real question is not foreign to causal mapping. It is the thing causal mapping does, for the causal subset of what people say, and it does it in a way that keeps the answer tied to its evidence.

4. Causal mapping as small-q qualitative data analysis#

In ordinary qualitative coding a code typically denotes a theme or concept. In causal mapping the basic coded unit denotes a causal claim made by a source: a cause label, an effect label, a verbatim quote that supports the claim, and a source identifier. A coding act yields an ordered pair, written Cause -> Effect, attached to evidence and provenance. The dataset of such acts is a links table, and that links table is the core qualitative product rather than a step on the way to a narrative (Narayanan 2005; Ackermann & Maytorena-Sanchez 2024).

This is a small unit by design. We do not code strength, polarity, necessity, sufficiency, or role as moderator, the features that some systems-dynamics and grounded-theory traditions attach to links (Kim & Andersen 2012). Most respondents do not state those features, and most analysts cannot reliably extract them from text. By holding the coded unit to bare causation, we keep the chain of evidence intact and keep the act of coding within reach of both human coders and AI assistance. We have called this stance minimalist coding.

The links table can be queried directly. Every node is a factor that appears as a cause or effect in at least one claim. Every edge is a claim with a quote behind it. Natural questions include which factors are the most frequently mentioned upstream influences on a given outcome, how the pathways into a target factor differ across subgroups, and which links are contested, with both X -> Y and X -> not-Y appearing in different sources. Each question is answered by an explicit, reversible operation on the table, and each answer remains traceable to the underlying quotes. For worked examples in evaluation practice, see (Remnant et al. 2025) and (Powell et al. 2025).

This makes causal mapping a practically focused form of qualitative data analysis. The research question is causal, the coded unit is a causal claim, and the analysis is an explicit pipeline of operations on a structured intermediate product. Compared with a thematic analysis, it gives up breadth of interpretive scope in exchange for a much tighter audit trail from any map back to the original quotes. For research questions about drivers, barriers, mechanisms and pathways, the trade is usually worth making. It has a recognisable relative in Mayring's qualitative content analysis, which also seeks a systematic, rule-guided procedure with explicit category development, while preserving qualitative depth (Mayring 2000); causal mapping differs in that its unit is an ordered pair rather than a category, which is what makes pathway analysis possible.

A short worked illustration. If an interviewee says,

After the clinic started opening on Saturdays I did not have to miss work, so I could actually attend.

a thematic pass might code Access, Clinic opening hours, Employment constraints and Attendance. A causal pass records two links:

Saturday opening -> Not missing work
Not missing work -> Attendance

Both representations have value. Only the causal one is queryable as a mechanism. A reader can ask which other factors point into Attendance, which contexts mention Saturday opening, and which paths run from clinic operations to attendance, and every answer stays accountable, line by line, to the original quotes.

5. Why the causal coding act is the right shape for AI#

The standard worry about AI in qualitative analysis is a worry about synthesis. "Find the main themes in this corpus" or "summarise what these interviewees say about X" are open-ended instructions whose output depends on the model's implicit theory of what counts as a theme. The output may read well and may even be roughly right, but it is hard to audit. The model may have downweighted minority views, smoothed over disagreement, drifted from the text, or produced categories that fit its training distribution better than the data. When the analyst inspects the output, the fluency of the writing tends to discourage scrutiny. Or it may not have drifted at all. The difficulty is that we cannot easily tell.

The minimalist coding act has a different shape. The instruction to the model is, in effect:

Identify each passage where the text says that one thing influenced another. For each, record the cause, the effect, and the exact quote that supports the claim.

This instruction refers to features already present in the text, namely explicit causal claims. It produces outputs whose unit, a link with a quote, can be verified by reading the quote. It does not ask the model to weigh, summarise or theorise. Because each link is a separate unit, errors are local: a wrong link can be removed or corrected without unravelling the rest of the analysis. There is supporting evidence that narrow, codebook-anchored deductive coding is the regime in which LLM coding is most usable (Xiao et al. 2023), and a validation study of AI-assisted causal mapping in particular reports that the extraction is accurate enough for the rule to be worth applying (n.d.).

In practice we run the extraction on short chunks of text, often a single passage at a time, and recover the corpus-level structure by aggregating the resulting links. The model is not asked to hold the whole corpus in attention or to decide what matters across documents. It does the same small job repeatedly. When the output is wrong we usually do not patch individual links; we adjust the prompt, recode and iterate. Because a coding pass is cheap and its output is locally checkable, this experimentation is cheap too, and the analyst stays in control of what the model is being asked to do.

This is a qualitative version of the split-apply-combine strategy ({wickham 2011). The split is the minimalist coding act; the apply step is a deterministic pipeline of operations on the links table; the combine step is human-authored synthesis. The mapping between strategy and method is what makes the AI question tractable.

A division of labour follows. The model is a clerk: fast, consistent, willing to apply a stable rule across thousands of passages, and locally accurate enough for the rule to be worth applying. The human is the architect, responsible for everything that takes judgement: framing the research question, choosing and curating the factor vocabulary rather than letting the model invent it freely, deciding where hierarchies or paired opposites are warranted, designing the analytic pipeline, and writing the interpretation, including its limits, its contested claims and the cases that do not fit. None of this is delegated to the model. The consequence is that the model never produces an analytic claim. It produces candidate evidence in a fixed format, which the human assembles into claims.

It is worth being explicit about the boundary, because this is where criticism tends to land. In our workflow the model sees one chunk of text at a time and is asked to extract causal claims from it. It does not choose which research questions matter, which sources to include, what counts as a factor, where opposites or hierarchies belong, which paths through the map are interesting, how to summarise across documents, or how to write any part of the account. It proposes candidate cause and effect labels for each apparent causal claim in the chunk, pairs each with the exact supporting quote, and can flag uncertain cases for human review. Every analytic move beyond that is either the analyst's authorship or a deterministic transform whose code can be inspected.

6. A worked comparison: loneliness in London#

To make the argument concrete, and to put it next to its main rival on equal terms, consider a single corpus analysed two ways.

The corpus is 48 interviews on the experience of loneliness with young adults aged 18 to 24, recruited from four deprived London boroughs and collected in 2019, available as a de-identified open dataset (n.d.). It is good-quality, publicly available data on a topic that matters, which is why we use it here.

6.1 A causal-mapping pass#

Coding the 48 interviews with the minimalist causal act, using AI for the extraction step only, produced a links table of around 3,392 quote-grounded causal claims, in roughly twenty minutes of processing. This was a light, illustrative analysis rather than a full empirical study, but it is enough to show the workflow. The resulting knowledge graph can be interrogated as a whole or filtered to any question of interest. One can ask which factors young Londoners most often say drive loneliness and which they say relieve it; one can trace pathways, for example from threat-appraised public space, through withdrawal, to sustained isolation, or from structured shared-purpose settings to a sense of connection; one can compare the pathways that recur across subgroups. Every arrow in the map carries the verbatim quote behind it, so reading the map and reading the evidence are the same action: pick an edge, see the claims and the quotes that produced it.

The point is not that the map is a finished account of loneliness. It is not, and it is not meant to be. The point is that it is an instrument for exploring the corpus that stays accountable to the corpus at every step. The human work, deciding which questions to ask of the map, which pathways to follow, how to curate the factor vocabulary, and what the evidence does and does not support, remains human, and the AI's contribution is confined to proposing quote-backed links.

6.2 A fully autonomous AI thematic pass on the same data#

The same 48 interviews were also analysed in a contrasting way: an AI agent was given the transcripts and a high-level instruction to develop and apply a thematic analysis method of its own choosing, keeping its own memo, codebook, theory and evidence files as it went. The agent planned the workflow, carried it out, and wrote the paper, with the human role confined to selecting the data and writing the opening instruction. It produced a coherent and readable account of loneliness as four interacting "mechanism stories": misrecognition and social performance, place-based constraint, connection infrastructure, and digital and material filters.

This case is worth distinguishing carefully, because it is neither our position nor the dialogic one. There was no real conversation: the human did not interrogate the data through the model turn by turn, as Query-Based Analysis or Conversational Analysis with AI prescribe. The machine ran the whole interpretive procedure on its own. In that respect it is, oddly, the most positivist of the three, in that it treats analysis as a procedure to be executed rather than as situated human interpretation. The conversational camp would disown it as readily as the reject camp, since both insist that the human rather than the model must do the interpreting (Jowsey et al. 2025; n.d.). It is a clarifying limit case rather than a method we recommend.

What happened when its output was checked is instructive. The account read as systematic, and that was itself a hazard: it was easy to assume that because it looked systematic it must be right. On checking the quotations against the sources, some were not verbatim, and at least one was attributed to the wrong interview. On checking the described method against the files the agent had actually produced, some process claims overstated what had been done: a timestamped journal was not maintained as fully as the paper claimed, and a "systematic negative-case search" looked, on the evidence trail, more like a check of a deliberately chosen sample. None of these errors is fatal, and the experiment was a serious and revealing one. But they are exactly the errors that handing interpretation wholesale to the model makes hard to catch, because the analytic state lives inside the model's working notes and its narrative about itself, and the reader has to trust both.

6.3 What the comparison shows#

Three cases sit side by side here. A human-AI dialogue, in which the analyst converses with the model and keeps the interpretation by staying in the exchange. A fully autonomous pass, in which the model runs the interpretation itself. And a one-way causal-mapping pipeline, in which the model does only the narrow extraction and the human does everything else. The first two reach for meaning-based interpretive stories of the kind reflexive thematic analysis values, and a causal map does not and should not try to produce those. But on the specific matter the special issue foregrounds, accountable links between claims and materials, the contrast is sharp. In both the dialogic and the autonomous passes, the quote is something that must be requested and then verified after the fact, and the account of method is something the reader takes partly on trust. In the causal-mapping pass, the quote is the unit of coding, present on every link by construction, and the account of method is a pipeline whose steps can be rerun. The one-way street is not a poorer conversation. It is a different thing, and for this kind of question it is the more accountable thing.

7. The strongest objections, and why they do not land#

We now state the strongest criticisms of the position as clearly as we can, in the form their proponents would recognise, and try to neutralise each. None of them, we think, is really an opponent of our view so much as an objection that becomes sharp when our view is stated carelessly. Stating it carefully is most of the answer.

7.1 The incongruence objection#

The strongest version runs as follows. Reflexive, non-positivist qualitative analysis treats meaning as constructed through the researcher's situated subjectivity; a research question is not a specification to be executed. Standardised, automatable coding smuggles a positivist epistemology in under a procedural guise. Worse, coding is a skeuomorphic crutch built for human cognitive limits that an AI does not have, so building a method on AI coding imitates an obsolete workaround (n.d.). Dressing the model up as "just a clerk" hides the fact that deciding what counts as a causal claim is itself a theory-laden, interpretive act.

We grant almost all of the premise. For Big-Q reflexive work, we agree that meaning is constructed, that coding-as-imitation is the wrong way to use a model, and that calling such work automatable would be a category mistake. We do not claim our method for that work. The objection bites against automating theme generation; it does not reach a quote-grounded links table that a human reads, challenges and rebuilds. And the skeuomorphism point, which is a good one, actually misses our case. The criticism is that imposing line-by-line coding on a model imitates a memory crutch the model does not need, because the model can hold a whole interview at once. But our links table is not a memory aid for the model. The model is stateless from chunk to chunk; it gets no benefit from the table at all. The table exists for the humans and the readers, as the audit artefact that keeps interpretation visible. Holding the analysis "in the model's head", which the incongruence argument recommends as the cure, is precisely the move that removes the artefact and reinstates the black box. As for the last clause: yes, deciding what counts as a causal claim is theory-laden, and we do not pretend otherwise. The difference is that in our workflow that theory is written down, as an extraction prompt that functions as the codebook, and shared, so that its effects can be inspected and argued with rather than left inside a coder's head.

7.2 The reduction objection#

The second objection comes from the interpretive traditions the special issue serves. Cause-to-effect arrows flatten situated, indexical, interactionally produced talk into decontextualised propositions, and so lose exactly what conversation analysis, discourse analysis and narrative analysis exist to study.

The answer is scope, and it has three parts. First, a coded link records what a source claimed rather than a fact about the world, and it keeps the verbatim quote attached; the arrow indexes the talk rather than replacing it, and the analyst who wants to read the surrounding turn can always do so. Second, causal mapping addresses the causal layer of meaning and leaves the rest alone; it is compatible with, and can run alongside, a fine-grained interactional analysis of a subset of episodes. Third, the map is an instrument for closer reading rather than a substitute for it. Picking an edge and reading the quotes beneath it is usually where analysis begins. The reduction objection is decisive against any claim that a causal map is a complete account of a corpus. We make no such claim.

7.3 The bias objection#

The third objection is empirical and, of the five, the one we take most seriously. LLM coding has been shown to introduce systematic, non-random error correlated with the characteristics of interview subjects, which can drive misleading inferences; the proposed remedy is a bespoke supervised model trained on human codes rather than direct LLM use (Ashwin et al. 2025), and cultural-alignment effects appear when machine and human coding are compared (Wei et al. 2025). Even proponents of conversational AI analysis concede the point: Morgan notes that models can return non-existent quotes and that "there is no guaranteed system for detecting and eliminating bias in AI-based analysis" (n.d.).

We do not dispute the finding. Three things follow, and none of them rescues open-ended synthesis at our method's expense. First, our workflow relocates bias into an explicit, shareable prompt and makes each unit locally checkable against its quote, so that whether the extraction is faithful, for any subgroup, is something a reader can inspect rather than infer. Bias is most dangerous exactly where no one looks at the quote; our workflow requires looking at the quote. Second, the bias literature's own recommendation, a narrow model trained on a sample of carefully human-coded transcripts, points toward standardised, checkable coding and away from holistic LLM synthesis; it is closer to our position than to the dialogic alternative. Third, where bias would do the most damage, in moving from counts of claims to inferences about magnitude, our method is explicit that frequency is not effect size, and that the move from a tally of claims to a conclusion about the world requires the same care it would in any study. The bias warning is real, and it tells against trusting an unread synthesis far more than against a workflow whose every unit is presented for reading.

7.4 The post-coding objection#

The fourth objection says that coding is the past. LLMs free us from the reductive coding bottleneck, and the future is dialogic: query the data, converse with it, let themes emerge from the exchange (Friese 2025; n.d.; Nguyen-Trung & Nguyen 2026). Re-instituting a rigid coding step is a step backwards.

This is the objection most directly opposed to our design, and the disagreement is real rather than merely apparent, so it is worth being exact. The dialogic paradigm puts the analytic state inside the model's working memory and the analyst's conversation with it. That is the very location the rejection camp distrusts, and with reason: a conversation is hard to reproduce, and a model's account of its own process can be confabulated, as our worked comparison showed. A strong intermediate representation, the quote-grounded links table, is what keeps the work accountable, and we are not willing to give it up for fluency. The one-way pipeline is a feature of the design rather than a cost of it. We are not against dialogue as such; we relocate it to where it earns its keep, in refining the research question and the extraction prompt, and we keep it out of the place where it does damage, in the unrepeatable interpretation of the corpus. There is room for hybrids, and a conversational front end that proposes questions to ask of a links table is a plausible one. But the trade-offs are not symmetrical, and accountability is on our side of them.

7.5 The "where is the gain" objection#

The last objection is deflationary. You admit it is not a one-click workflow. You have simply moved the labour into prompt engineering and vocabulary curation, which are idiosyncratic and unvalidated; the human effort has not gone away, it has hidden.

The effort has not gone away, and we do not claim it has. But its concentration is the point. In a manual coding study, judgement is spread thinly and invisibly across many thousands of small decisions, each made once, none recorded. In this workflow, judgement moves to a few high-leverage moments: framing the causal question, curating the factor vocabulary, designing the pipeline, and authoring the synthesis. Those moments are fewer, they are explicit, and they are shareable. A prompt that functions as a codebook is a more inspectable object than the internalised judgement of a coder, precisely because it is written down. The gain is not that the human does less thinking. It is that the human's thinking is focused where it matters and is left on the record.

8. Accountability, transparency and ethics#

A causal map produced this way carries a complete audit trail. Pick any edge in the final map and you can see the bundle of underlying claims, each with its source, its verbatim quote and its raw extracted labels, and you can see which transforms have been applied between the raw coding and the rendered view: which sources were filtered, which labels were rewritten, which evidence thresholds were imposed. Every analytic claim in the paper reduces to a sequence of explicit operations on a links table whose rows are quote-grounded extractions. A reader who doubts the analysis can rerun the pipeline, change one step and observe the consequence. A coder who disagrees with a particular link can reject it without disturbing the rest.

This lets us give direct answers to the three disclosures the special issue requires.

Type of application. Commercial large language models are used at the extraction step. Specific model versions are reported, because output behaviour changes with versions and a replication needs to know which version produced which links.
Role in the workflow. The model is used only to propose candidate links from short text chunks, each paired with a verbatim quote. It is not used to summarise across documents, to choose the codebook, to write the report, or to make any claim about the world.
How accountability was maintained. Every link in the final analysis is traceable to a specific quote and source. Every analytic step beyond extraction is deterministic and human-authored. The links table is an artefact that can be shared with reviewers and co-researchers and inspected line by line. Because the extraction prompt is the codebook in another form, sharing the prompt is sharing the method, and a reviewer can read it and judge whether it framed the task fairly.

There are practical ethical considerations beyond declaration. Where the source material is sensitive, the choice of model matters, because some commercial services may train on inputs or store data in ways that conflict with consent and confidentiality commitments. We use settings that disable training on inputs where they are available, prefer providers with explicit data-handling commitments, and remove personally identifying information at the chunking stage where possible. Consent processes should make AI use explicit, including the kind of model, the role it plays and the safeguards applied. These are the same data-protection obligations the special issue rightly treats as integral rather than as add-ons.

9. Limits#

The workflow does not solve every qualitative research problem, and we do not claim that it should.

Non-causal meaning. Much valuable content in talk is not causal: identity work, norms, emotion, metaphor, the turn-by-turn structure of interaction. Causal coding ignores these, much as conversation analysis ignores other features. For questions about how participants account for what makes a difference, causal coding is on point; for questions about how they do identity or manage stance, it is not.
Claims are not facts. A coded link records what a source claimed. It does not establish what is true in the world. The map is a structured record of evidence, and the move from evidence to truth needs the same care it would in any qualitative study.
Frequency is not effect size. Counts of sources and claims measure how widely something is said in the corpus. They do not measure the magnitude of any underlying effect.
The transitivity trap. If one source says Training -> Knowledge and another says Knowledge -> Adoption, it is tempting to read a path from training to adoption. Unless within-source thread tracing is imposed and reported, that stitches together a mechanism no one in the corpus actually claimed.
Bias is relocated rather than removed. The advantage is that the prompt is an explicit, shareable artefact whose effects can be inspected and revised, but it is still doing the work a human codebook would otherwise do.
Attention is local. Because the model sees one chunk at a time, it cannot pick up cross-references or implicit qualifications that span chunks. For questions where cross-document context is the point, this workflow is not enough on its own.

10. Conclusion#

The AI question in qualitative work is usually posed at the wrong altitude. Asked whether to trust a model with thematic synthesis or with the interpretation of a corpus, most qualitative researchers reasonably refuse, and the rejection letter states that refusal with care and force. Asked instead whether to use a model for one narrow extraction at which it is locally accurate and locally checkable, while every other analytic decision stays in human hands, the answer can be yes, and it can be a yes that the anti-positivist is free to give, because nothing in it asks a machine to make meaning. That is the sense in which our position is orthogonal rather than opposed: we are not at the embrace end of the line, and the people at the reject end are not, on this narrow matter, our opponents.

Causal mapping earns its modest place in such a workflow because the unit of coding is small, the intermediate product is structured, and the chain from any analytic claim back to a supporting quote is short and inspectable. Those are also the properties that keep AI assistance accountable. The picture to hold on to is plain: a model that stays inside the narrow limits it is given, a human who never hands over judgement, and a quote-grounded links table in the middle that anyone can read.

References

Ackermann, & Maytorena-Sanchez (2024). Overlooked and Underused? The Benefits and Challenges of Using Causal Mapping for Project Studies. https://doi.org/10.1016/j.plas.2024.100161.

Ashwin, Chhabra, & Rao (2025). Using Large Language Models for Qualitative Analysis Can Introduce Serious Bias. SAGE Publications Inc. https://doi.org/10.1177/00491241251338246.

Axelrod (1976). The Analysis of Cognitive Maps. In Structure of Decision : The Cognitive Maps of Political Elites.

Braun, & Clarke (2023). Toward Good Practice in Thematic Analysis: Avoiding Common Problems and Be(Com)Ing a Knowing Researcher. Taylor \& Francis. https://doi.org/10.1080/26895269.2022.2129597.

Braun, & Clarke (2025). Reporting Guidelines for Qualitative Research: A Values-Based Approach. Routledge. https://doi.org/10.1080/14780887.2024.2382244.

Eden (1988). Cognitive Mapping. https://doi.org/10.1016/0377-2217(88)90002-1.

Evaluation (2024). Causal Mapping. https://www.betterevaluation.org/methods-approaches/methods/causal-mapping.

Friese (2025). Conversational Analysis with AI - CA to the Power of AI: Rethinking Coding in Qualitative Analysis. https://doi.org/10.2139/ssrn.5232579.

Friese, Nguyen-Trung, Powell, & Morgan (2026). Beyond Binary Positions: Making Space for Critical and Reflexive GenAI Integration in Qualitative Research. https://doi.org/10.2139/ssrn.5962174.

Jowsey, Braun, Clarke, Lupton, & Fine (2025). We Reject the Use of Generative Artificial Intelligence for Reflexive Qualitative Research. https://doi.org/10.2139/ssrn.5676462.

Kim, & Andersen (2012). Building Confidence in Causal Maps Generated from Purposive Text Data: Mapping Transcripts of the Federal Reserve. https://doi.org/10.1002/sdr.1480.

Mayring (2000). Qualitative Content Analysis. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research. https://doi.org/10.17169/FQS-1.2.1089.

Narayanan (2005). Causal Mapping: An Historical Overview. In Causal Mapping for Research in Information Technology. https://www.google.co.uk/books/edition/_/61z36j6QgmAC?hl=en&gbpv=1.

Nguyen-Trung, & Nguyen (2026). Narrative-Integrated Thematic Analysis (NITA): How Can LLMs Support Theme Generation without Coding?. Routledge. https://doi.org/10.1080/14780887.2026.2638348.

Powell, Copestake, & Remnant (2024). Causal Mapping for Evaluators. https://doi.org/10.1177/13563890231196601.

Powell, Cabral, & Mishan (2025). A Workflow for Collecting and Understanding Stories at Scale, Supported by Artificial Intelligence. SAGE PublicationsSage UK: London, England. https://doi.org/10.1177/13563890251328640.

Remnant, Copestake, Powell, & Channon (2025). Qualitative Causal Mapping in Evaluations. In Handbook of Health Services Evaluation: Theories, Methods and Innovative Practices. https://doi.org/10.1007/978-3-031-87869-5_12.

Wei, Liu, Barany, Ocumpaugh, Mehta, Nasiar, Baker, Zambrano, Vanacore, & Giordano (2025). Cultural Alignment and Biases in Qualitative Coding: Comparing GPT and Human Coders. https://doi.org/10.35542/osf.io/h8u4f_v1.

Xiao, Yuan, Liao, Abdelghani, & Oudeyer (2023). Supporting Qualitative Analysis with Large Language Models: Combining Codebook with GPT-3 for Deductive Coding. In 28th International Conference on Intelligent User Interfaces. https://doi.org/10.1145/3581754.3584136.

{wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. https://doi.org/10.18637/jss.v040.i01.