Pearl frequently refers to what he calls a “causal ladder”, a hierarchy of three types of problems of increasing difficulty: (1) prediction (2) intervention, and (3) counterfactuals. If we wish to ascend the so-called ladder, and increase the range of causal questions we can answer, it is crucial to understand what makes each level more difficult than the last and what additional knowledge we need at each level.

## Distinct levels: the ladder abstracted

So first off, why are the three levels distinct?

Prediction. To do prediction within a system, we only need to ask questions about the system as it currently as. Therefore, it’s sufficient to have the joint distribution $P$ over all variables $V$ in the system. If we have the joint distribution $P$, then we can answer any questions of the form, “Given that the variables $X \subseteq V$ are observed to be $x$, then what is the probability that the variables $Y \subseteq V$ are equal to $y$?”

Intervention. By performing an exogenous intervention in a system, we change its distribution. The original distribution $P$ may no longer be valid. Thus, in order to answer questions about intervention, we need a family of distributions $\{P_{X=x}\}_{X\subseteq V}$. Once we have this family, we can answer any question of the form, “If we force the variables $X \subseteq V$ to $x$, then what is the probability that the variables $Y \subseteq V$ are equal to $y$?”

Since $P_{X=x}$ is a valid probability distribution, we can also condition on observations after intervention, i.e., “If we force the variables $X \subseteq V$ to $x$, and then observe that variables $E \subseteq V$ are equal to $e$, then what is the probability that the variables $Y \subseteq V$ are equal to $y$?” The answer to this question is given by $P_{X=x}(Y=y \mid E=e)$.

Finally, note that the distribution with no intervention (i.e. $X = \emptyset$) is equal to our original distribution. Thus, intervention subsumes prediction, justifying its place on rung two of the ladder.

Counterfactuals. At the counterfactual level, we are allowed to ask questions of the form, “Given that variables $Z \subseteq V$ were observed to be $z$, if variables $X \subseteq V$ were forced to be $x$, then how likely is that variables $Y \subseteq V$ would have been equal to $y$?” The new piece of information we have added to the question is the conditioning on $Z=z$. You might think, “Wait, but we could already condition on variables in interventions!” You’re right, except at the intervention level, we could only condition on variables after intervention, not before intervention.

To understand why conditioning on variables before intervention is difficult, consider the question, “Given that Ann and Bob are not at the party, if Ann were at the party, would Bob be at the party?” The observed $Z=z$ (Ann not at the party) contradicts the intervention $X=x$ (Ann being at the party). And such a contradiction doesn’t need to arise explicitly either. Sometimes the world before intervention can conflict with the world after intervention because of the causal structure. For example, “Given that the sky was totally clear, if the volcano had erupted, would Ann go to work?” In this case, the intervention (the volcano erupts) would result in ash in the sky, leading to a contradiction with the antecedent (the sky is clear).

To compute a counterfactual, we need to define some way in which the world before intervention can provide evidence about the world after intervention, even if the two worlds conflict. In other words, we need a family of distributions, $\{P_{Z=z,X=x}\}_{Z,X\subseteq V}$. The difficulty, of course, comes from specifying a coherent family of distributions that matches what we mean when we ask counterfactual questions. Many philosophers had proposed a “closest worlds” approach to counterfactuals, in which the intervention should be thought of as occurring in the closest world to the real world in which we observed $Z=z$ [Lewis, 1973]. A drawback of the closest worlds approach is that it is unclear what the similarity metric between worlds should be. Pearl operationalizes the closest worlds approach through the exogenous variables $U$ in his functional causal models. The exogenous variables between different worlds are assumed to be the same. Thus, the observed $Z=z$ in the real world can provide evidence about the exogenous, latent variables $U$, and this new belief $P(U=u\mid Z=z)$, carries over into the hypothetical world after the intervention.

Finally, note that by setting $Z=\emptyset$, we recover a family of interventional distributions. Thus counterfactuals subsume intervention, and rightfully earn their place at the top of the ladder.

## Increasing need for data and mechanistic models

As we ascend the ladder, we need more sophisticated data and more mechanistic models.

Prediction. To predict the system in its unaltered state, all we need is observational data.

Intervention. In general, observational data is not sufficient to predict what happens after intervention. To answer a question about what will happen after an intervention, we need a model of the mechanisms in the domain or have to get experimental/interventional data (e.g. randomized controlled trials). If it is possible to perform the intervention, then technically we don’t need a proper model; we just see what happens after the intervention (and perhaps repeat it for many trials to reduce noise). But for cases where direct intervention/experimentation is not possible, we have to rely on a model of the domain. For example, we cannot actually force an earthquake to happen (and that’s probably a good thing), however we still have a mechanistic model of the domain that allows us to agree with the statement, “Earthquakes can cause damage to buildings.”

Counterfactuals. The nature of counterfactual questions makes a model-free approach implausible. The reason is that the evidence $Z=z$ can conflict with the hypothetical intervention $X=x$, making it impossible to both observe $Z=z$ and perform the intervention $X=x$. For example, consider again the question, “Given that Ann and Bob are not at the party, if Ann were at the party, would Bob be at the party?” Or the question, “What fraction of patients who are treated and died would have survived if they were not treated?” The world after the intervention is only a hypothetical that cannot be observed directly, so in order to compute the counterfactual, we require a model of how reality (in which we observe $Z=z$) connects to this hypothetical world.

(Maybe you think that you can trivially answer the patient question by ignoring the $Z=z$ part and just looking at how many untreated patients survived. But in fact, the validity of this answer rests on the validity of a particular functional causal model. See [Pearl, 2009; Section 1.4.4] for a discussion and this stackexchange answer for a summary. Similarly, see [Balke and Pearl, 1994] for a discussion of the party example.)

Sometimes I hear someone advocate for (usually deep-)reinforcement learning (RL) by arguing that “unlike purely statistical models, reinforcement learning agents perform interventions, thus allowing them to learn causality.” The caveat with this statement is that it doesn’t distinguish between causality at level two (intervention) and causality at level three (counterfactuals). Model-free RL approaches, which have received a disproportionate amount of attention, have no chance at being able to compute counterfactuals. Counterfactuals, by definition, require models.

## Are functional causal models and causal Bayes nets at different levels?

I was initially very confused by the difference between level two (intervention) and level three (counterfactuals) because of an additional claim Pearl makes about their distinction. He claims that causal Bayes nets (CBN) can do interventions, but that in order to do counterfactuals, you need functional causal models (FCM). The only consistent way I’ve found of interpreting this claim is as the statement, “Given a functional causal model of the world, a causal Bayes net whose variables only consist of the endogenous nodes of the functional causal model can do intervention, but not counterfactuals.”

In the examples I’ve seen where Pearl argues for the separation between FCMs and CBNs, he argues for the claim precisely by showing this statement. In the patient example [Pearl, 2009; Section 1.4.4], he constructs a FCM for the example, and then shows that a CBN which only consists of the endogenous nodes of the FCM can compute interventions, but not counterfactuals. In the party example [Balke and Pearl, 1994], the claim is argued for through exactly the same procedure.

Pearl actually explicitly attributes the weaker capacity of CBNs to the lack of exogenous variables [Pearl, 2009; Section 1.4.4]:

The three-step model of counterfactual reasoning also uncovers the real reason why stochastic causal models are insufficient for computing probabilities of counterfactuals. Because the $U$ variables do not appear explicitly in stochastic models, we cannot apply step 1 so as to update $P(u)$ with the evidence $e$ at hand. This implies that several ubiquitous notions based on counterfactuals – including probabilities of causes (given the effects), probabilities of explanations, and context-dependent causal effect – cannot be defined in such models.

As explained earlier, computing counterfactuals requires some way to connect the real world to a hypothetical, counterfactual world. In Pearl’s framework, the exogenous variables, which are assumed to remain constant between the real world and the counterfactual world, play this role. So, it makes sense that if you take them away, you can no longer compute counterfactuals.

But what I found so confusing is that this is not the same as the general claim that functional causal models can do counterfactuals, but causal Bayes nets can only do interventions. If the variables in the CBN are unrestricted, then I believe FCMs and CBNs are formally equivalent. Pearl himself agrees that any causal Bayes net can be represented by a functional causal model, however he doesn’t agree that the reverse is true [Pearl, 2009; Section 1.4]:

Every stochastic model can be emulated by many functional relationships (with stochastic inputs), but not the other way around; functional relationships can only be approximated, as a limiting case, using stochastic models.

But can’t one always represent a deterministic functional relationship by a degenerate point mass distribution? In fact, in the paper with the party example [Balke and Pearl, 1994], after arguing that CBNs are insufficient, in a later section (titled “Party again”), they show how to represent the party example with a CBN that includes the exogenous variables and uses point mass distributions. I guess Pearl would categorize this as a limiting case?

In [Pearl, 2009; Section 7.2.2], Pearl addresses intrinsic non-determinism, which cannot be represented by an FCM. In this case, he constructs a causal Bayes net that performs valid counterfactuals. He contrasts the validity of this causal Bayes net with ordinary ones:

This evaluation can, of course, be implemented in ordinary causal Bayesian networks (i.e., not only in ones that represent intrinsic nondeterminism), but in that case the results computed would not represent the probability of the counterfactual $Y_x = y$. Such evaluation amounts to assuming that units are homogenous, with each possessing the stochastic properties of the population. Such an assumption may be adequate in quantum-level phenomena, where units stand for specific experimental conditions, but it will not be adequate in macroscopic phenomena, where units may differ appreciably from each other. In the example of Chapter 1 (Section 1.4.4, Figure 1.6), the stochastic attribution amounts to assuming that no individual is affected by the drug (as dictated by model 1) while ignoring the possibility that some individuals may, in fact, be more sensistive to the drug than others (as in model 2).

But again, the patient example that he references uses a causal Bayes net with only the endogenous nodes. The reason it amounts to assuming that the units are homogenous is because the exogenous nodes are no longer included.

So, in conclusion, my best interpretation of Pearl’s claim is that no causal Bayes net with only the endogenous nodes can compute counterfactuals. If my interpretation is incorrect, please let me know; I would like to understand.

Edit:

A word about the exogenous variables U: These variables specify a “unit”, be it an individual, an agricultural plot, time of day, etc, whatever refinement is needed to make all relationships deterministic. I hope this clarifies the dilemma posed in your last paragraph.

## References

1. Balke, Alexander, and Pearl, Judea. “Probabilistic evaluation of counterfactual queries.” AAAI, 2011.
2. Lewis, David. Counterfactuals. John Wiley & Sons, 2013.
3. Pearl, Judea. Causality. Cambridge University Press, 2009.