Smitha Milli
http://smithamilli.com/
Translation of Peggy Gou's Han Jan (한잔)<p>I’m a fan of <a href="https://www.residentadvisor.net/dj/peggygou/biography">Peggy Gou</a>.
While listening to one of her songs (<a href="https://www.youtube.com/watch?v=XUIHG8-Kamg">Han
Jan</a>), I realized I could actually
understand it (because it’s the simplest song she has lyric-wise :P). I googled for the english lyrics and there were none. All I found was this <a href="https://www.reddit.com/r/Korean/comments/8n3wob/transcribe_lyrics_for_this_song_for_me_please/">reddit post</a>:</p>
<blockquote>
<p>Hey dudes me and my girl here are high and were digging this new Peggy Gou
song but we wanted to sing the lyrics without being like totally racist and
stuff so we wanted the lyrics but there arent any so we were wondering if
anyone could write them down for us.</p>
<p>If you dont want to do the whole thing, just the chorus is cool, thats our fave
part. The bit that says han jan. Not the “enjoy your night” bit we already know
english. Thanks.</p>
</blockquote>
<p>Somehow the song is so fitting for that context. To help out future homies:</p>
<p>어젯밤에 데낄라 마셨네 = Yesterday night, drank tequila</p>
<p>집에가서 침대에 뻗었네 = Went home to crash on my bed</p>
<p>한잔 두잔 세잔 네잔 원샷 = 1 drink, 2 drinks, 3 drinks, 4 drinks, 1 shot</p>
<p>한잔 두잔 세잔 = 1 drink, 2 drinks, 3 drinks</p>
<p>어젯밤에 생각해 보니 = After last night, now that I think about it</p>
<p>취했어도 난 행복했네 = Even though I was drunk, I was happy</p>
<p>한잔 두잔 세잔 네잔 원샷 = 1 drink, 2 drinks, 3 drinks, 4 drinks, 1 shot</p>
<p>한잔 두잔 세잔 = 1 drink, 2 drinks, 3 drinks</p>
<p>You gotta do it right</p>
<p>Enjoy your night</p>
<p>You gotta do it right</p>
<hr />
<p>Peggy Gou has an interesting <a href="https://daily.bandcamp.com/2018/03/21/the-year-of-peggy-gou/">interview on
bandcamp</a> where
she talks about the decision to write lyrics in korean:</p>
<blockquote>
<p>If you speak in English, then maybe it will be easier for people to sing
along,” Gou says, “but it has been done too many times. I wanted to do
something different, so I thought, ‘OK, why don’t I do it in Korean, which is
my language.’</p>
</blockquote>
<p>When she talks about It Makes You Forget/Itgehane (잊게하네), another song on the same album as Han Jan, she says:</p>
<blockquote>
<p>Even with Koreans, it’s hard to understand the lyrics, because some of them
are very philosophical, and some of the words in this track we don’t use
anymore</p>
</blockquote>
<p>Yeah, I’ll stick to translating her drinking songs for now :P</p>
Mon, 04 Nov 2019 01:00:00 +0000
http://smithamilli.com/blog/peggy-gou/
http://smithamilli.com/blog/peggy-gou/When a Better Human Model Means Worse Reward Inference<p>Imagine I just lost a game of chess. You might infer that I’m disappointed or not very good at chess. Without any additional information, you probably wouldn’t infer that I <em>wanted</em> to lose the game. Yet, that is the inference that most <em>inverse reinforcement learning</em> (IRL) methods would make. Nearly all of them assume, incorrectly, that the human is (approximately) rational.</p>
<p>Unsurprisingly, an inaccurate model of the human can lead to perverse
inferences; many have constructed and pointed out examples of such (<a href="https://jsteinhardt.wordpress.com/ 2017/02/07/model-mis-specificationand-inverse-reinforcement-learning/">Steinhardt and Evans, 2017</a>; <a href="https://arxiv.org/abs/1512.05832">Evans, 2016</a>; <a href="http://proceedings.mlr.press/v97/shah19a.html">Shah, 2019</a>). Here, however, I’m going to talk about a seemingly contradictory phenomenon, one that was originally quite perplexing to us: a <em>better</em> human model can lead to <em>worse</em> reward inference. The reason is that we usually evaluate human models in terms of prediction. However, a more predictive human model does not necessarily imply better <em>inference</em>.</p>
<h2 id="case-study-better-human-model-but-worse-reward-inference">Case study: better human model, but worse reward inference?</h2>
<p>One setting where the standard assumption fails is in <em>collaborative</em> environments where the human is aware that the system (“robot”) needs to learn from their demonstrations. In collaborative settings, humans may act <em>pedagogically</em> and to try to teach the robot the reward function. However, optimizing for <em>teaching</em> the reward function is different from optimizing for the reward function itself (the standard assumption).</p>
<p>Intuitively, using a more accurate model of the human should result in better
reward inference. Thus, we would expect that when the human is pedagogic, the robot can improve its inference by explicitly modeling the human as pedagogic. In theory, this is certainly true, and indeed it is the direction suggested and pursued by prior work (<a href="https://arxiv.org/abs/1707.06354">Fisac, 2018</a>; <a href="https://arxiv.org/abs/1806.03820">Malik, 2018</a>). However, as part of <a href="https://arxiv.org/abs/1903.03877">our upcoming UAI 2019 paper</a>, we tested this in practice, and found it to be much more nuanced than expected.</p>
<h3 id="experimental-setup">Experimental setup</h3>
<p>We tested the impact of modeling pedagogy on reward inference by performing additional analyses on human studies originally performed by Ho et al (2016). <a href="https://markkho.github.io/documents/NIPS_2016_Teaching_by_demonstration.pdf">Ho et al (2016)</a> conducted an experiment to test whether humans do indeed pick more informative demonstrations in pedagogic settings. In their study, participants were split into two conditions, which I’ll refer to as the <em>literal</em> and the <em>pedagogic</em> condition. Participants in the literal group tried to maximize the reward of their demonstration, while participants in the pedagogic group tried to teach the reward to a partner who would see their demonstration. The environments they tested were gridworlds that were made up of three differently colored tiles. Each color could either be safe or dangerous, leading to eight possible reward functions.</p>
<p><img src="/img/human_mispec/experiment.jpg" />
<small><em>(a) The instructions given to participants in the literal and pedagogic condition, and a sample demonstration from both conditions. (b) All possible reward functions. Each tile color can be either safe (0 points) or dangerous (-2 points). Figure modified from Ho et al (2018).</em></small></p>
<p>Participants in the pedagogic condition did indeed choose more communicative demonstrations. For example, they were more likely to visit all safe colors and to loop over safe tiles multiple times. To quantitatively model the difference between the humans in both conditions, Ho et al (2016,
2018) developed what I’ll call the <em>literal model</em> and the <em>pedagogic model</em>. The literal model is the standard model – the human optimizes for the reward function directly. The pedagogic model, on the other hand, assumes that the human optimizes for teaching the reward function. Compared to the literal model, they found that the pedagogic model was a much better fit to humans in the pedagogic condition. In the figure below, we plotted the log-likelihood of demonstrations under both models for both conditions:</p>
<div class="center"><img src="/img/human_mispec/ll.jpg" width="500px" /></div>
<h3 id="improving-the-robots-inference">Improving the robot’s inference?</h3>
<p>Turning to the robot side, we analyzed whether the robot’s inference could benefit from explicitly modeling the human as pedagogic. We tested a literal robot, which uses the literal model of the human, and a pedagogic (often referred to as
“pragmatic”) robot, which uses the pedagogic model of the human. If we test these two robots with humans <em>simulated</em> according to the literal and pedagogic
human model, then we get exactly what we would expect. When the human is
pedagogic, the robot has a higher accuracy of reward inference if it models the
human as being pedagogic (first bar from left) than if it models the human as
being literal (second bar), and the advantage is quite large:</p>
<div class="center"><img src="/img/human_mispec/sim_results.jpg" width="500px" /></div>
<p><br /></p>
<p>But, if we test the two models with actual humans (abbreviated <strong>AH</strong>), then we
get puzzling results. The large advantage that the pedagogic robot had
disappears. When the human is pedagogic, the pedagogic robot now does no
better than the literal robot, and in fact it does slightly worse. What’s the
deal?</p>
<div class="center"><img src="/img/human_mispec/human_results.jpg" width="500px" /></div>
<h2 id="an-explanation-forward-versus-reverse-model">An explanation: forward versus reverse model</h2>
<p>Why does using the pedagogic model not improve reward inference? My first
reaction was that the pedagogic model was “wrong”. But then I plotted the log-likelihoods under the pedagogic and literal model (see figure above) and verified that the pedagogic model <em>is</em> a much better fit than the literal model to the humans from the pedagogic condition. So simply saying that the pedagogic model is misspecified is too crude of an answer. Because in that case the literal model is “more” misspecified, so why doesn’t it do worse?</p>
<p>It’s not about generalization either. We’re testing reward inference in the
exact same population that the literal and pedagogical models were fit to. So surely, at least in this idealized example, the pedagogic model should improve reward inference.</p>
<p>Finally, I realized why the results felt paradoxical. I assumed the pedagogic model was “better” because it was a “better fit” to human behavior because it was better at predicting human demonstrations. However, just because a model is better for
<em>prediction</em> does not mean that it is better for <em>inference</em>. In other words,
even if a model is better at predicting human behavior, it is not necessarily a better model for the robot to use to infer the reward.</p>
<p>Another way to see it is that the best forward model is not the same as the best reverse model. When we talk about a “human model”, we typically mean a <em>forward model</em> for the human, a model that goes from reward to behavior. To perform reward inference, we need a <em>reverse model</em>, a model that goes from behavior to reward. For the sake of inference, it is common to convert a forward model into a reverse model. For example, through Bayes’ rule, which is what we do in the pedagogy case study. But, crucially, the ranking of models is not guaranteed to be preserved after applying Bayes’ rule (or a more general conversion method). That is, suppose we have forward models A and B such that A is better than B (in terms of prediction). Once they are converted into reverse models A’ and B’, it may be the case that B’ is better than A’ (in terms of inference).</p>
<p>I think the reason our results felt paradoxical is that we have an intuition that inversion, i.e. applying Bayes’ rule, should preserve the ranking of models, but this is just not the case. (And no, it’s not about the prior being misspecified. In our controlled experiment, there is a ground-truth prior, the uniform distribution.) Maybe it doesn’t feel counterintuitive to you, but if it does, see the toy example in the next section.</p>
<h3 id="prediction-vs-inference-a-toy-example">Prediction vs inference: a toy example</h3>
<p>Suppose there is a latent variable <script type="math/tex">\theta \in \Theta</script> with prior distribution
<script type="math/tex">p(\theta)</script> and observed data <script type="math/tex">x \in \mathcal{X}</script> generated by some
distribution <script type="math/tex">p(x \mid \theta)</script>. In our setting, <script type="math/tex">\theta</script> corresponds to the
reward parameters and <script type="math/tex">x</script> corresponds to the human behavior. For simplicity, we assume
<script type="math/tex">\Theta</script> and <script type="math/tex">\mathcal{X}</script> are finite. We have access to a training dataset
<script type="math/tex">\mathcal{D} = \{(\theta_i, x_i)\}_{i=1}^n</script> of size <script type="math/tex">n</script>. A <em>predictive model</em>
<script type="math/tex">m(x \mid \theta)</script> models the conditional probability of the data <script type="math/tex">x</script> given
latent variable <script type="math/tex">\theta</script> for all <script type="math/tex">x \in \mathcal{X}, \theta \in \Theta</script>. In
our case, the predictive model is the forward model of the human. The <em>predictive
likelihood</em> <script type="math/tex">\mathcal{L}_{\mathcal{X}}</script> of a predictive model <script type="math/tex">m</script> is simply
the likelihood of the data under the model:</p>
<script type="math/tex; mode=display">\mathcal{L}_{\mathcal{X}}(m) = \prod_{i=1}^n m(x_i \mid \theta_i)\,.</script>
<p>The <em>inferential likelihood</em> is the likelihood of the latent variables
after applying Bayes’ rule:</p>
<script type="math/tex; mode=display">\mathcal{L}_{\Theta}(m) = \prod_{i=1}^{n} \frac{m(x_i \mid
\theta_i)p(\theta_i)}{\sum_{\theta}m(x_i \mid \theta)p(\theta)} \,.</script>
<p>Predictive likelihood does not necessarily imply higher
inferential likelihood. In particular, there exist settings in which there are
two predictive models <script type="math/tex">m_1, m_2</script> such that <script type="math/tex">\mathcal{L}_{\mathcal{X}}(m_1) >
\mathcal{L}_{\mathcal{X}}(m_2)</script>, but <script type="math/tex">% <![CDATA[
\mathcal{L}_{\Theta}(m_1) <
\mathcal{L}_{\Theta}(m_2) %]]></script>.</p>
<p>For example, suppose that <script type="math/tex">\Theta = \{\theta_1, \theta_2\}</script> and <script type="math/tex">\mathcal{X} = \{x_1, x_2,
x_3\}</script>, the prior <script type="math/tex">p(\theta)</script> is uniform over <script type="math/tex">\Theta</script>, and the dataset
<script type="math/tex">\mathcal{D}</script> contains the following <script type="math/tex">n=9</script> items: <script type="math/tex">\mathcal{D} =
\{(\theta_1, x_1), (\theta_1, x_1), (\theta_1, x_2), (\theta_2, x_2), (\theta_2,
x_2), (\theta_2, x_3), (\theta_2, x_3), (\theta_2, x_3), (\theta_2, x_3)\}</script></p>
<!---<div style="margin: auto; max-width: 500px;">
<table style="float: center">
<tr><td></td><td>$$\theta = \theta_1$$</td><td>$$\theta = \theta_2$$</td></tr>
<tr><td>$$x$$</td><td>$$x_1, x_1, x_2$$</td><td> $$x_2, x_2, x_3, x_3, x_3, x_3$$ </td></tr>
</table></div>-->
<p>Define the models <script type="math/tex">m_1(x \mid \theta)</script> and <script type="math/tex">m_2(x \mid \theta)</script> by the following conditional probabilities tables.</p>
<div style="float: left; width: 50%">
<div style="float: left; width: 80%">
<table style="float: left;">
$$m_1(x \mid \theta)$$
<tr><td></td><td>$$x_1$$</td><td>$$x_2$$</td><td>$$x_3$$</td></tr>
<tr><td>$$\theta_1$$</td><td>$$2/3$$</td><td>$$1/3$$</td><td>$$0$$</td></tr>
<tr><td>$$\theta_2$$</td><td>$$0$$</td><td>$$1/3$$</td><td>$$2/3$$</td></tr>
</table>
</div>
</div>
<div style="float: right; width: 50%">
<div style="float: right; width: 80%">
<table style="float: left;">
$$m_2(x \mid \theta)$$
<tr><td></td><td>$$x_1$$</td><td>$$x_2$$</td><td>$$x_3$$</td></tr>
<tr><td>$$\theta_1$$</td><td>$$2/3$$</td><td>$$1/3$$</td><td>$$0$$</td></tr>
<tr><td>$$\theta_2$$</td><td>$$0$$</td><td>$$2/3$$</td><td>$$1/3$$</td></tr>
</table>
</div>
</div>
<p>The model <script type="math/tex">m_1</script> has predictive likelihood <script type="math/tex">\mathcal{L}_{\mathcal{X}}(m_1) = (2/3)^6(1/3)^3</script> and inferential likelihood <script type="math/tex">\mathcal{L}_{\Theta}(m_1) = (1/2)^3</script>. The model <script type="math/tex">m_2</script> has predictive likelihood <script type="math/tex">\mathcal{L}_{\mathcal{X}}(m_2) = (2/3)^4(1/3)^5</script> and inferential likelihood <script type="math/tex">\mathcal{L}_{\Theta}(m_2) = (1/3)(2/3)^3</script>. Thus, <script type="math/tex">\mathcal{L}_{\mathcal{X}}(m_1) > \mathcal{L}_{\mathcal{X}}(m_2)</script>, but <script type="math/tex">% <![CDATA[
\mathcal{L}_{\Theta}(m_1) < \mathcal{L}_{\Theta}(m_2) %]]></script>.</p>
<p>The key reason for the discrepancy is that predictive likelihood normalizes over the space of observations <script type="math/tex">\mathcal{X}</script>, while inferential likelihood normalizes over the space of latent variables <script type="math/tex">\Theta</script>. It may seem surprising that the normalization can have such an impact. But the difference is precisely what creates the difference between optimizing for prediction and optimizing for inference and is what leads to e.g. the difference between predictive robot motion and legible robot motion (<a href="https://dl.acm.org/citation.cfm?id=2447672">Dragan et al, 2013</a>).</p>
<h2 id="conclusion">Conclusion</h2>
<p>My claim that a better human model can lead to worse reward inference was, admittedly, underspecified. As we’ve seen, it depends on how you define “better” – in terms of prediction or inference. But I left it underspecified for a reason. I believe most people automatically fill in “better” with “more predictive”, and I wanted to demonstrate how that inclination can lead us astray. While I believe better prediction probably does usually lead to better inference, as illustrated by this example, it is not a logical necessity, and we should be wary of that.</p>
<p>For details on everything in this post, as well other points about misspecification in reward learning, you can check out <a href="https://arxiv.org/abs/1903.03877">our paper</a>.</p>
<p><em>Thanks to Jacob Steinhardt, Frances Ding, and Andreas Stuhlmüller for comments on this post.</em></p>
<h2 id="references">References</h2>
<p>Dragan, A. D., Lee, K. C., & Srinivasa, S. S. “Legibility and
predictability of robot motion.” HRI 2013.</p>
<p>Evans, O., Stuhlmüller, A., and Goodman N. “Learning the preferences
of ignorant, inconsistent agents.” AAAI 2016.</p>
<p>Fisac, J., Gates, M. A., Hamrick, J. B., Liu, C., Hadfield-Menell, D.,
Palaniappan, M., Malik, D., Sastry S. S., Griffiths T. L., and Dragan A. D.
“Pragmatic-pedagogic value alignment.” ISRR 2018.</p>
<p>Ho, M. K., Littman, M., MacGlashan, J., Cushman, F. and Austerweil, J. L. “Showing versus doing: Teaching by demonstration.” NeurIPS 2016</p>
<p>Ho, M. K., Littman, M. L., Cushman, F., and Austerweil, J. L. “Effectively Learning from Pedagogical Demonstrations.” CogSci 2018.</p>
<p>Malik, D., Palaniappan, M., Fisac, J. F., Hadfield-Menell, D., Russell, S., and Dragan, A. D. “An Efficient, Generalized Bellman Update For Cooperative
Inverse Reinforcement Learning.” ICML 2018.</p>
<p>Milli, S., Dragan, A. D. “Literal or Pedagogic Human? Analyzing Human Model
Misspecification in Objective Learning” UAI 2019.</p>
<p>Shah, R., Gundotra, N., Abbeel, P. and Dragan, A. D. “On the Feasibility of
Learning, Rather than Assuming, Human Biases for Reward Inference.” ICML 2017.</p>
<p>Steinhardt, J. and Evans, O. “Model misspecification and inverse reinforcement
learning.” <a href="https://jsteinhardt.wordpress.com/ 2017/02/07/model-mis-specificationand-inverse-reinforcement-learning/">https://jsteinhardt.wordpress.com/
2017/02/07/model-mis-specificationand-inverse-reinforcement-learning/</a>. 2017.</p>
Wed, 26 Jun 2019 01:00:00 +0000
http://smithamilli.com/blog/predict-vs-inf/
http://smithamilli.com/blog/predict-vs-inf/Pearl's Causal Ladder<p>Pearl frequently refers to what he calls a “causal ladder”, a hierarchy of three types of
problems of increasing difficulty: (1) prediction (2) intervention, and (3) counterfactuals. If we wish to ascend the so-called ladder, and increase the range of causal questions we can answer, it is crucial to understand what makes each level more difficult than the last and what additional knowledge we need at each level.</p>
<h2 id="distinct-levels-the-ladder-abstracted">Distinct levels: the ladder abstracted</h2>
<p>So first off, why are the three levels distinct?</p>
<p><strong>Prediction</strong>. To do prediction within a system, we only need to ask questions about the system as it currently as. Therefore, it’s sufficient to have the joint distribution <script type="math/tex">P</script> over all variables <script type="math/tex">V</script> in the system. If we have the joint distribution <script type="math/tex">P</script>, then we can answer any questions of the form, “Given that the variables <script type="math/tex">X \subseteq V</script> are observed to be <script type="math/tex">x</script>, then what is the probability that the variables <script type="math/tex">Y \subseteq V</script> are equal to <script type="math/tex">y</script>?”</p>
<p><strong>Intervention.</strong> By performing an exogenous intervention in a system, we change its distribution. The original distribution <script type="math/tex">P</script> may no longer be valid. Thus, in order to
answer questions about intervention, we need a family of distributions <script type="math/tex">\{P_{X=x}\}_{X\subseteq V}</script>. Once we have this family, we can answer any question of the form, “If we force the variables <script type="math/tex">X \subseteq V</script> to <script type="math/tex">x</script>, then what is the probability that the variables <script type="math/tex">Y \subseteq V</script> are equal to <script type="math/tex">y</script>?”</p>
<p>Since <script type="math/tex">P_{X=x}</script> is a valid probability distribution, we can also condition on observations <em>after</em> intervention, i.e., “If we force
the variables <script type="math/tex">X \subseteq V</script> to <script type="math/tex">x</script>, and then observe that variables <script type="math/tex">E
\subseteq V</script> are equal to <script type="math/tex">e</script>, then what is the probability that the
variables <script type="math/tex">Y \subseteq V</script> are equal to <script type="math/tex">y</script>?” The answer to this question is
given by <script type="math/tex">P_{X=x}(Y=y \mid E=e)</script>.</p>
<p>Finally, note that the distribution with no intervention (i.e. <script type="math/tex">X = \emptyset</script>) is equal to our original distribution. Thus, intervention subsumes prediction, justifying its place on rung two of the ladder.</p>
<p><strong>Counterfactuals.</strong> At the counterfactual level, we are allowed to ask
questions of the form, “Given that variables
<script type="math/tex">Z \subseteq V</script> were observed to be <script type="math/tex">z</script>, if variables <script type="math/tex">X \subseteq V</script> were forced to be <script type="math/tex">x</script>,
then how likely is that variables <script type="math/tex">Y \subseteq V</script> would have been equal to <script type="math/tex">y</script>?” The new piece of information we have added to the question is the conditioning on <script type="math/tex">Z=z</script>. You might think, “Wait, but we could already condition on variables in interventions!” You’re right, except at the intervention level, we could only condition on variables <em>after</em> intervention, not <em>before</em> intervention.</p>
<p>To understand why conditioning on variables <em>before</em> intervention is difficult,
consider the question, “Given that Ann and Bob are not at the party, if Ann were
at the party, would Bob be at the party?” The observed <script type="math/tex">Z=z</script> (Ann not at the
party) contradicts the intervention <script type="math/tex">X=x</script> (Ann being at the party). And such
a contradiction doesn’t need to arise explicitly either. Sometimes the world before intervention
can conflict with the world after intervention because of the causal structure.
For example, “Given that the sky was totally clear, if the volcano had erupted,
would Ann go to work?” In this case, the intervention (the volcano erupts) would result in ash in the sky, leading to a contradiction with the antecedent (the sky is clear).</p>
<p>To compute a counterfactual, we need to define some way in which the world
before intervention can provide evidence about the world after intervention,
even if the two worlds conflict. In other words, we need a family of distributions, <script type="math/tex">\{P_{Z=z,X=x}\}_{Z,X\subseteq V}</script>. The difficulty, of course, comes from specifying a coherent family of distributions that matches what we mean when we ask counterfactual questions. Many philosophers had proposed a “closest worlds” approach to counterfactuals, in
which the intervention should be thought of as occurring in the closest world to
the real world in which we observed <script type="math/tex">Z=z</script> [Lewis, 1973]. A drawback of the closest worlds approach
is that it is unclear what the similarity metric between worlds should
be. Pearl operationalizes the closest worlds approach through the exogenous
variables <script type="math/tex">U</script> in his functional causal models. The exogenous variables between
different worlds are assumed to be the same. Thus, the observed <script type="math/tex">Z=z</script> in the
real world can provide evidence about the exogenous, latent variables <script type="math/tex">U</script>, and this new belief <script type="math/tex">P(U=u\mid
Z=z)</script>, carries over into the hypothetical world after the intervention.</p>
<p>Finally, note that by setting <script type="math/tex">Z=\emptyset</script>, we recover a family of
interventional distributions. Thus counterfactuals subsume intervention,
and rightfully earn their place at the top of the ladder.</p>
<h2 id="increasing-need-for-data-and-mechanistic-models">Increasing need for data and mechanistic models</h2>
<p>As we ascend the ladder, we need more sophisticated data and more mechanistic models.</p>
<p><strong>Prediction.</strong> To predict the system in its unaltered state, all we need is observational data.</p>
<p><strong>Intervention.</strong> In general, observational data is not sufficient to predict what happens after intervention. To answer a question about what will happen after an intervention, we need a model of the mechanisms in the domain or have to get experimental/interventional data (e.g. randomized controlled trials). If it is possible to perform the intervention, then technically we don’t need a proper model; we just see what happens after the intervention (and perhaps repeat it for many trials to reduce noise). But for cases where direct intervention/experimentation is not possible, we have to rely on a model of the domain. For example, we cannot actually force an earthquake to happen (and that’s probably a good thing), however we still have a mechanistic model of the domain that allows us to agree with the statement, “Earthquakes can cause damage to buildings.”</p>
<p><strong>Counterfactuals.</strong> The nature of counterfactual questions makes a model-free approach implausible. The reason is that the evidence <script type="math/tex">Z=z</script> can conflict with the hypothetical intervention <script type="math/tex">X=x</script>, making it impossible to both observe <script type="math/tex">Z=z</script> and perform the intervention <script type="math/tex">X=x</script>. For example, consider again the question, “Given that Ann and Bob are not at the party, if Ann were at the party, would Bob be at the party?” Or the question, “What fraction of patients who are treated and died would have survived if they were not treated?” The world after the intervention is only a hypothetical that cannot be observed directly, so in order to compute the counterfactual, we require a model of how reality (in which we observe <script type="math/tex">Z=z</script>) connects to this hypothetical world.</p>
<p>(Maybe you think that you can trivially answer the patient question by ignoring the <script type="math/tex">Z=z</script> part and just looking at how many untreated patients survived. But in fact, the validity of this answer rests on the validity of a particular functional causal model. See [Pearl, 2009; Section 1.4.4] for a discussion and <a href="https://stats.stackexchange.com/questions/379799/difference-between-rungs-two-and-three-in-the-ladder-of-causation">this stackexchange answer</a> for a summary. Similarly, see [Balke and Pearl, 1994] for a discussion of the party example.)</p>
<p>Sometimes I hear someone advocate for (usually deep-)reinforcement learning (RL) by arguing that “unlike purely statistical models, reinforcement learning agents perform interventions, thus allowing them to learn causality.” The caveat with this statement is that it doesn’t distinguish between causality at level two (intervention) and causality at level three (counterfactuals). Model-free RL approaches, which have received a disproportionate amount of attention, have no chance at being able to compute counterfactuals. Counterfactuals, by definition, require models.</p>
<h2 id="are-functional-causal-models-and-causal-bayes-nets-at-different-levels">Are functional causal models and causal Bayes nets at different levels?</h2>
<p>I was initially very confused by the difference between level two (intervention)
and level three (counterfactuals) because of an additional claim Pearl makes about
their distinction. He claims that causal Bayes nets (CBN) can do
interventions, but that in order to do counterfactuals, you need functional causal
models (FCM). The only consistent way I’ve found of interpreting this claim is
as the statement, “Given a functional causal model of the world, a causal Bayes net whose variables only consist of the endogenous nodes of the functional causal model can do intervention, but not counterfactuals.”</p>
<p>In the examples I’ve seen where Pearl argues for the separation between FCMs and
CBNs, he argues for the claim precisely by showing this statement. In the patient
example [Pearl, 2009; Section 1.4.4], he constructs a FCM for the example, and
then shows that a CBN which only consists of the endogenous nodes of the FCM can
compute interventions, but not counterfactuals. In the party example [Balke and
Pearl, 1994], the claim is argued for through exactly the same procedure.</p>
<p>Pearl actually explicitly attributes the weaker capacity of CBNs to the lack of exogenous
variables [Pearl, 2009; Section 1.4.4]:</p>
<blockquote>
<p>The three-step model of counterfactual reasoning also uncovers the real reason
why stochastic causal models are insufficient for computing probabilities of
counterfactuals. Because the <script type="math/tex">U</script> variables do not appear explicitly in
stochastic models, we cannot apply step 1 so as to update <script type="math/tex">P(u)</script> with the
evidence <script type="math/tex">e</script> at hand. This implies that several ubiquitous notions based on
counterfactuals – including probabilities of causes (given the effects),
probabilities of explanations, and context-dependent causal effect – cannot
be defined in such models.</p>
</blockquote>
<p>As explained earlier, computing counterfactuals requires some way to connect the
real world to a hypothetical, counterfactual world. In Pearl’s framework, the exogenous variables, which are assumed to remain constant between
the real world and the counterfactual world, play this role. So, it makes sense
that if you take them away, you can no longer compute counterfactuals.</p>
<p>But what I found so confusing is that this is not the same as the general claim that
functional causal models can do counterfactuals, but causal Bayes nets can only
do interventions. If the variables in the CBN are unrestricted, then I believe FCMs and CBNs are formally equivalent. Pearl himself agrees that any causal Bayes net can be represented by a
functional causal model, however he doesn’t agree that the reverse
is true [Pearl, 2009; Section 1.4]:</p>
<blockquote>
<p>Every stochastic model can be emulated by many functional relationships (with
stochastic inputs), but not the other way around; functional relationships can
only be approximated, as a limiting case, using stochastic models.</p>
</blockquote>
<p>But can’t one always represent a deterministic functional relationship by a
degenerate point mass distribution? In fact, in the paper with the party example [Balke and Pearl, 1994], after arguing that CBNs are
insufficient, in a later section (titled “Party again”), they show how to represent the party example with a CBN
that includes the exogenous variables and uses point mass distributions. I guess Pearl would categorize this as a limiting case?</p>
<p>In [Pearl, 2009; Section 7.2.2], Pearl addresses intrinsic non-determinism, which cannot be represented by an FCM. In this case, he constructs a causal Bayes net that performs valid counterfactuals. He contrasts the validity of this causal Bayes net with
ordinary ones:</p>
<blockquote>
<p>This evaluation can, of course, be implemented in ordinary causal Bayesian
networks (i.e., not only in ones that represent intrinsic nondeterminism), but
in that case the results computed would not represent the probability of the
counterfactual <script type="math/tex">Y_x = y</script>. Such evaluation amounts to assuming that units are
homogenous, with each possessing the stochastic properties of the population.
Such an assumption may be adequate in quantum-level phenomena, where units
stand for specific experimental conditions, but it will not be adequate in
macroscopic phenomena, where units may differ appreciably from each other. In
the example of Chapter 1 (Section 1.4.4, Figure 1.6), the stochastic
attribution amounts to assuming that no individual is affected by the drug (as
dictated by model 1) while ignoring the possibility that some individuals may,
in fact, be more sensistive to the drug than others (as in model 2).</p>
</blockquote>
<p>But again, the patient example that he references
uses a causal Bayes net with only the endogenous nodes. The reason it amounts to assuming that the units are homogenous is because the
exogenous nodes are no longer included.</p>
<p>So, in conclusion,
my best interpretation of Pearl’s claim is that no causal Bayes net <em>with only the endogenous nodes</em> can compute counterfactuals. If my interpretation is incorrect, please let me know; I would like to understand.</p>
<p>Edit:</p>
<p><a href="https://twitter.com/yudapearl/status/1127860712770916352">Pearl adds</a>:</p>
<blockquote>
<p>A word about the exogenous variables U: These variables specify a “unit”, be it an individual, an agricultural plot,
time of day, etc, whatever refinement is needed to make all relationships
deterministic. I hope this clarifies the dilemma posed in your last
paragraph.</p>
</blockquote>
<h2 id="references">References</h2>
<ol>
<li>Balke, Alexander, and Pearl, Judea. “Probabilistic evaluation of
counterfactual queries.” AAAI, 2011.</li>
<li>Lewis, David. Counterfactuals. John Wiley & Sons, 2013.</li>
<li>Pearl, Judea. Causality. Cambridge University Press, 2009.</li>
</ol>
Sun, 12 May 2019 00:00:00 +0000
http://smithamilli.com/blog/causal-ladder/
http://smithamilli.com/blog/causal-ladder/Thanksgiving<p>I had a traditional Thanksgiving — a puja at our family friend’s house, which as per tradition, we (minus my mom) skipped in order to smoothly arrive at noon, just before the food is served. But as we walk in and hear the sanskrit chanting, we realize, as per tradition, our attempts have been foiled because everything is running on the usual Indian Standard Time. Somehow, even after a lifetime to practice, my friends and I still have no clue what to do when it’s our turn to go up to the altar. Our mothers give us the look, and in loud whispers, they start scoldingly coaching from the sidelines—use your right hand, rotate the plate three times, throw the rice over the flowers. The only change in the whole arrangement is that when the banquet begins the traditional banana leaves have been modernized and replaced by leaf-shaped paper plates.</p>
<p>Nikki finally arrives last and zooms straight to me. “I saw your laugh from outside,” she says. Everyone laughs and teases her for the “saw”, but then upon reflection agrees that “saw” is more appropriate than “heard”. Suddenly we’re all reminiscing about this or that. Remember when Rohit thought those palm trees were real, remember when Lavanya was the tallest, remember when Puneet ripped that band-aid off my knee.</p>
<p>It’s been years since we’ve all seen each other. Growing up, we fluidly roamed in and out of each other’s houses, sleeping at Pratyush’s one weekend, at mine another, at Mani’s the next. It wouldn’t be right to say any of us grew up in a single house. We grew up in all of our homes.</p>
<p>As if it were the most natural thing in the world, the seven of us spend the next thirty hours with each other. We effortlessly switch between roasting each other, impersonating scammy swamis, and dancing anywhere and everywhere. We stop by Nikki’s house because we have to light 365 candles. Why? No one knows. We get lost infinite-looping in a roundabout. My sides hurt so much from laughing non-stop. Nikki and I have our first drink together. The boiz update me on their relationships. I suddenly realize I’ve missed out on a lot, being away.</p>
<p>“I know Kiwibots are cool and all, but we’re cooler”-Prat</p>
<p>It’s true. I miss them. I don’t know what moment we all became so close, but us “chaddi buddies” are inseparable.</p>
Sat, 24 Nov 2018 12:00:01 +0000
http://smithamilli.com/blog/thanksgiving/
http://smithamilli.com/blog/thanksgiving/Paradigm Shift<p>She stared at the keys of her laptop. It wasn’t that she had nothing to write about—so much had happened. Her mind had effortfully packaged the ideas into words that were ready to be transmitted. However, her hands, which rested upon her keyboard, remained unresponsive, deaf to the requests sent by the mental authorities, or perhaps merely pretending to be deaf, already acting in allegiance with a subversive, subconscious agenda.</p>
<p>Her treasonous hands sparked her curiosity, triggering her instinctive reflex to experiment. Here, lying in her bed, she had none of her normal equipment: no electrodes, no MRIs, no statistical procedures that provided the requisite pretense of rigor. But reflexes have no care for the best practices one has carefully drilled in; they simply grasp for whatever is at hand, desperately trying to resolve the trigger. So, she unapologetically latched onto the only tools she had available—her imagination and her awareness—because of first and foremost priority was fulfilling her primal urge to understand.</p>
<p>Her hands had already made it clear that they were opposed to what she had intended to write, so instead she tried imagining something as far from it as possible, some sort of silly made-up fable. A cute, boisterous rabbit searching in the nearby bushes for carrots. As the rabbit jumped around, bouncing from one bush to another, she felt the tension in her hands pour out from the bottom, making space for a warmth that poured in from the top, collecting in her palms and spreading until it was present everywhere from her wrist to the tips of the fingers. Movement was restored.</p>
<p>Next, she began to move her focus back, began to imagine her original intention. Patterns of neuronal activity rearranged themselves, and like the way the magma flowing in the Earth’s mantle pulls upon the tectonic plates it supports, as the patterns rearranged themselves, they also rearranged higher- level, human concepts. But humans, whose awareness is too weak, remain oblivious to the complex motion that supports the ground they consider solid, only recognizing it when they are forced to, when the plates collide strongly enough, producing a quake that cuts through their abstraction. In the same way, despite her best efforts to apply her tool of awareness, she would have had no indication of the processes supporting her context-switch, if not for how forcefully the incoming concepts pushed through the pastoral imageries and prancing rabbits. The collision of these incongruous concepts and their incompatible feelings gave rise to numerous quivers and shakes, visual flashbacks that lasted a fraction of a second.</p>
<p>The day she started her PhD, sitting in the office of her academic hero, and soon-to-be advisor, marveling at the rows of books, impassioned by the mountain of knowledge she had yet to glimpse, and secretly, just momentarily, noticing the half shelf still empty, and wondering whether there was any space for her.</p>
<p>Her friendly labmates waving, trying to catch her attention from her monitor, leaving to go home for the night, already used to her not taking their offers to walk with them, and lingering for a moment with eyes of sympathy (pity?) before exiting as usual.</p>
<p>The growing list of publications, awards, and invited talks on her website. Scroll. The headshots of her students, each accompanied with a short biography of their passions and dreams. Zoom in on one student. Brightly smiling, a promising 2nd-year student who “aimed to understand”.</p>
<p>The same student, not with his bright smile, but his conflicted face a month ago when he had sought her out, nervous and ashamed, but still confiding to her, as a desperate last attempt, that he just could not see—he paused—what was new. And her opening her mouth, ready to give her normal motivational spiel, but suddenly her neck muscles tensed, vocal chords unresponsive.</p>
<p>Finally, the surprise and confusion on her department chair’s face as he came to her office to question her out-of-the-blue request for an early sabbatical. Her looking at him, then looking around her office, looking at her bookshelf, secretly glancing at the minuscule section she made up, looking at him again, and shrugging.</p>
<p>In the same second it took to pass through the past fourteen years of her life, the warmth occupying her hands froze into solid ice. She shivered.</p>
<p>She needed to isolate further, to test something in between. She tried to write it again, but indirectly, padded with fluffy, protective layers of metaphor. Her hands were choppy, but they moved. She wondered if this was what the patients with prosthetics felt, who were in possession of hands that were ostensibly their’s, but which would every once in a while—whenever the user got too comfortable —throw a jerk here or there, reminding her of who did or did not have autonomy.</p>
<p><em>But why?</em> She still did not have a satisfying explanation. Once again, she retreated into her mental laboratory. She again imagined writing the original prompt. She listened, sensing for what associations this stimulus triggered. A knot in her stomach and a tightening in her chest. Panic? Anxiety? But why? She continued replaying the stimulus, slowly going over the data, smoothing out the noise, searching for the key difference, until finally a pattern struck her. Obvious in hindsight, but so infused into everything that it had been difficult to extricate as a potential cause: “I”.</p>
<p>I. I. <em>I.</em> Every use of the word conjured up images of herself. She saw herself from the outside, delivering a keynote at a podium, a translucent blanket of respect and admiration enveloping her. Yes, she had accomplished something. <em>I have done something. I have done something. I am doing something.</em></p>
<p>But done what? She saw herself again, this time from her own perspective, in a vast desert, toiling day by day under the hot sun, carefully probing at the ground with a tiny shovel, brushing away dust in swift, controlled sweeps, probing again, until finally she uncovered a tiny finger bone. She held the beautiful, delicate thing up to examine and felt a wave of ecstasy roll down her. (How absurd!) She turned it over again and again, relishing the discovery, admiring its shimmer, basking in its novelty.</p>
<p>As night approached, however, her focus, which during the day stretched like a belt to tighten and subdue the core parts of her mind, began to falter in strength, no longer fueled by the day’s minutiae. The core, taking advantage of the weakened state of its oppressor, pulsed against the belt, generating waves that struck against her skull, asserting its presence to her in the form of a throbbing headache, until finally it burst through the barrier, and the reservoir of angst was released. <em>Wasn’t it obvious that the finger bone would be near the hand?</em> The distal phalange was after the intermediate phalange after the proximal phalange. It followed trivially. She had simply known the proper angle, the proper force that was required to officially unearth it.</p>
<p>But far worse than the insignificance of the finger, was the fear that clutched her and made her pane her gaze across the rest of the desert. The fear that it wasn’t the finger, nor the hand, nor the rest of the body, the fear that it simply wasn’t the right spot at all, that the real spot was somewhere else out <em>there</em>, elusively hidden under untouched, pristine sands. She (a version of her) jumped into the remaining landscape, ready to maniacally dig at random. But she (another version of her) held her back, repeatedly whispering the odds, i.e. ~0, she would discover anything.</p>
<p>These two were in constant friction, frequently colliding with enough force to ignite a powerful fire, one whose territory stretched to the boundaries of her mind, and caused all other operations in the dysfunctional town to halt. Until finally, one day, an innocent townsperson, no longer able to take the ever increasing burden of mediating the war between the two, let a suppressed scream loose: “Fuck it!” Turning to the first she said, “Fuck you.” Then she pointed to the second and said, “And fuck you.” And then she went on, “And fuck everyone. Continue digging up the goddamn fingers and be a success that no one remembers. Or go wild and rogue and be a failure that no one remembers.”</p>
<p>The oil had been thrown—the fire roared up, shot high, asserted its dominance, irreverently burst through buildings, melted down long-standing frameworks, embraced the inhabitants with a hot and deadly smoke, until finally, everything had been consumed, and the crackles pittered and pattered, becoming quieter and quieter, until softly, slowly, the fire retracted back down into nothingness. The only sign left behind of the chaos was the absence of any sign at all.</p>
<p>Finally, her mind had recognized what her hands already had. Her right index finger rested at ease upon the letter “I”, unable to press it, but no longer tense. Finally, she had found the paradigm shift she had longed for. Or to be more accurate, it had found her, grasped her by the throat, told her to fuck off, and left her scrambling to repair her core. It was unfair. As a scientist, you could wait, only switching once an alternative became available. But as a person, she was not offered that basic courtesy. The “I” was gone, but without any replacement for her to embrace.</p>
<p>All her hands could do was end as usual: Future Work.</p>
Sat, 15 Sep 2018 01:00:00 +0000
http://smithamilli.com/blog/paradigm-shift/
http://smithamilli.com/blog/paradigm-shift/Undergraduate Research Tips<p>There’s a lot of <a href="https://github.com/smilli/research-advice">research advice</a> out there, but not much focused specifically on
undergrads. So here I’ve tried outlining undergrad-focused tips.</p>
<ol>
<li>
<p><strong>Consider trying other options (e.g. software internship) first.</strong> Some types of
research will require more advanced knowledge that you probably won’t have yet
as, say, a freshman, so try other options earlier. I would especially recommend
becoming fast and effective at coding, so that isn’t a bottleneck in your
research later.</p>
</li>
<li>
<p><strong>How do you get research? Just ask.</strong> This is something that is far simpler than
people think. You literally just ask people. Email professors or grad students
whose research you think is interesting and tell them you’d like to get
involved. Sending an email takes a max of 5 minutes, so why not just try?</p>
</li>
<li><strong>Be selective in the research you choose.</strong> Because people think it’s harder to
find research opportunities than it actually is (#2) they tend to be more afraid
to reject research opportunities than they should be. Here are some criteria for
picking research:
<ol>
<li><em>Pick research you find interesting.</em> You won’t do well if you don’t find it
interesting.</li>
<li><em>Pick research that helps you figure out whether you like research.</em> You will not apriori know whether or not you like research. Instead, you can have a loose theory of
why you think you might like doing research and seek out experiences that help
evaluate your theory. For example, if like me, you think a big component of why
you would like research is having autonomy, then don’t do research where you’re
always told exactly what to do. This is another reason you should not do
research you’re not interested in (3a). If you do research on a topic you know
you’re not interested in, then when you find out that (unsurprisingly) you
didn’t like the research, you can’t tell if it’s because of general aspects of
research or because you found the topic boring.</li>
<li><em>Pick a mentor that communicates well and cares about your research
growth.</em> You
won’t know anything when you start (even if you think you do) and having a
mentor you work well with is essential to your success. You also ultimately want
to become a good researcher, so pick someone who cares about your research
growth and will spend time teaching you about the field and how to do research.
Don’t pick e.g. a grad student who just wants you to code something they didn’t
want to do themselves.</li>
<li><em>Avoid “research” that is mainly, e.g., making material for a class.</em>
Sometimes a research opportunity is less about research and more about
some other useful output that no grad student or professor wants to do
themselves.</li>
</ol>
</li>
<li>
<p><strong>Time management is the #1 reason you will not make progress.</strong> If you put less
than ten hours in per week, it’s unlikely you’ll make any progress in research.
As an undergrad, it can be easy to fall into the trap of putting off research
because you have a wave of midterms or something, but you need to make time. A
generic recommendation is spending ~15 hours a week on research. I probably
spent more like 20-30 hours. Also consider spending a summer doing research, you
can get a lot more done when research is your only focus.</p>
</li>
<li><strong>Prioritize research over uninteresting and irrelevant classes.</strong> For better or worse, the type of undergrad reading this post will probably still
feel like they need to do well in a class, even if it’s not interesting or relevant. The most common reason I’ve seen for undergrads being too busy is
having a midterm, regardless of whether the class is worth it or not. So,
the easiest way you can open up time in your schedule is to just be more seletive and take fewer classes.</li>
</ol>
Sun, 12 Mar 2017 04:38:00 +0000
http://smithamilli.com/blog/undergrad-tips/
http://smithamilli.com/blog/undergrad-tips/Bounded Optimality<p>A friend recently asked me why I find bounded optimality interesting. Here’s
why:</p>
<ol>
<li>
<p>It is necessary to have a normative framework for how agents should act under
computational pressure because this is what the real world is like. In the
real world an agent should understand not to think long when it is about to get
hit by a car, but should definitely perform more computation before declaring
war. (This is related to our work on <a href="http://smithamilli.com/pubs/2017aaai_meta.pdf">bounded-optimal
metareasoning</a>!)
See <a href="https://people.eecs.berkeley.edu/~russell/papers/ptai13-intelligence.pdf">Rationality and
Intelligence: A Brief Update</a>
(Russell 2014) for more on this point.</p>
</li>
<li>
<p>It’s an elegant framework because it provides <a href="http://gershmanlab.webfactional.com/pubs/GershmanHorvitzTenenbaum15.pdf">“a converging paradigm
for intelligence in brains, minds, and
machines”</a>
that may allow for more transfer of insights between cognitive science and
artificial intelligence (Gershman, Horvitz, and Tenenbaum 2015). For example, understanding the bounded-optimal
solutions that humans use may be useful for creating better
approximation strategies for artificial agents to use under computational pressure. On the other hand,
when we have an optimal solution that an artificial agent can implement, we can
then ask what the bounded-optimal solution for the real-world environment that
humans live in would be and see if that is the kind of behavior that humans
showcase.</p>
</li>
<li>
<p>It provides an appealing way to <a href="https://cocosci.berkeley.edu/tom/papers/RationalUseOfCognitiveResources.pdf">bridge the gap between Marr’s computational and
algorithmic
level</a>
(Griffiths, Lieder, and Goodman 2014). As mentioned in (1)
bounded-optimality is necessary as a normative framework for artificial agents
because the costs of computation are an important factor to decision-making in
real-world environments. But humans also have costs to computation that arise
from intrinsic biological bounds, rather than the environment, so an
interesting question is what are the fundamental limitations on human
intelligence that arise from this and how close can human bounded-optimality
ever get to AI bounded-optimality? Being able to take into account computational
constraints at varying levels of abstraction may be useful for progress on this
question.</p>
</li>
<li>
<p>It can potentially be used as a more
principled, versatile way to predict when people will use different types of
approximations, which is useful for lots of applications. For example, understanding the
circumstances under which your employee’s decisions are likely to be less
accurate than usual. Or for artificial agents trying to decide how much to trust
your “expert knowledge”.</p>
</li>
</ol>
<p>I’m curious as to what the other reasons are that people find bounded optimality
interesting, so please let me know. :)</p>
Sun, 18 Sep 2016 12:00:00 +0000
http://smithamilli.com/blog/bounded-optimality/
http://smithamilli.com/blog/bounded-optimality/Kneser-Ney Smoothing<p><a href="https://en.wikipedia.org/wiki/Language_model">Language modeling</a> is important for almost all natural language processing tasks:
speech recognition, spelling correction, machine translation, etc. Today I’ll
go over Kneser-Ney smoothing, a historically important technique for language
model smoothing.</p>
<h2>Language Models</h2>
<p>A language model estimates the probability of an n-gram from a training corpus.
The simplest way to get a probability distribution over n-grams from a corpus is
to use the <a href="https://en.wikipedia.org/wiki/Maximum_likelihood">MLE</a>.
That is, the probability of an n-gram <script type="math/tex">(w_{1}, \dots, w_{n})</script> is simply the number of times
it appears divided by the number of n-grams. Usually we’re interested in the
conditional probability of the last word, given the context of the last (n-1)
words:</p>
<p><script type="math/tex">P(w_n | w_1, \dots, w_{n - 1}) = \frac{C(w_1, \dots , w_n
)}{\sum_{w' \in L}C(w_1, \dots , w')}</script>
where C(x) is the number of times that x appears and L is the set of all possible
words.</p>
<p>The problem with the MLE arises when the n-gram you want a probability
for was not seen in the data; in these cases the MLE will simply assign a
probability of zero to the sequences. This is an inevitable problem for language
tasks because no matter how large your corpus is it’s impossible for it to contain
all possibilities of n-grams from the language.</p>
<p>(About a
month ago I also wrote about how to use a <a href="/blog/anagrams">trigram character model</a> to generate pronounceable
anagrams. Can you see why smoothing was unnecessary for a character model?)</p>
<h2>Kneser-Ney Smoothing</h2>
<p>The solution is to “smooth” the language models to move some probability towards
unknown n-grams. There are many ways to do this, but the method with the <a href="https://en.wikipedia.org/wiki/Perplexity">best
performance</a> is interpolated modified <a href="https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing">Kneser-Ney
smoothing</a>. I’ll explain the intuition behind Kneser-Ney in three parts:</p>
<h3>Absolute-Discounting</h3>
<p>To retain a valid probability distribution (i.e. one that sums to one) we must
remove some probability mass from the MLE to use for n-grams that were not seen
in the corpus. Absolute discounting does this by subtracting a fixed number D from all n-gram counts. The adjusted count of an
n-gram is <script type="math/tex">A(w_{1}, \dots, w_{n}) = C(w_{1}, \dots, w_{n}) - D</script>.</p>
<h3>Interpolation</h3>
<p>After we’ve assured that we have probability mass to use for unknown n-grams,
now we still need to figure out how to actually estimate the probability of unknown n-grams.</p>
<p>A clever way to do this is to use lower order models. Suppose your language model estimates the probabilities of trigrams. When you come across an unknown
trigram, e.g. (‘orange’, ‘cat’, ‘pounced’), although the trigram may be
unknown, the bigram suffix, (‘cat’, ‘pounced’), may be present in the corpus.
So, when creating a language model, we don’t merely calculate the probabilities
of all N-grams, where N is the highest order of the language model, we estimate
probabilities for all k-grams where <script type="math/tex">k \in {1, \dots, N}</script>.</p>
<p>Interpolation recursively combines probabilities of all lower-order models to
get the probability of an n-gram:</p>
<script type="math/tex; mode=display">P_{s}(w_{n} | w_{i}, \dots, w_{n - 1}) = \frac{C(w_i, \dots , w_n
) - D}{\sum_{w' \in L}C(w_i, \dots , w')} + \gamma(w_{i}, \dots, w_{n -
1})P_{s}(w_{n} | w_{i + 1} \dots, w_{n - 1})</script>
<p>The recursion stops at the unigram model:
<script type="math/tex">P_{s}(w) = \frac{C(w)}{\sum_{w' \in L} C(w')}</script></p>
<p><script type="math/tex">\gamma(w_{i}, \dots, w_{n -1})</script> is known as the back-off weight. It is simply
the amount of probability mass we left for the next lower order model.</p>
<script type="math/tex; mode=display">\gamma(w_{i}, \dots, w_{n -1}) = \frac{D \cdot |\{(w_{i}, \dots, w_{n -1},
w') : C(w_{i}, \dots, w_{n -1}, w') > 0 \}| }{\sum_{w' \in L}C(w_i, \dots , w')}</script>
<p>After interpolating the probabilities, if a sequence has any k-gram suffix
present in the corpus, it will have a non-zero probability.</p>
<p>It’s also easier to see why absolute discounting works so well now. Notice how
the fewer words there are that follow the context (the sequence of words we’re
conditioning on), the lower the associated back-off weight for that context is.
This makes sense since if there are only a few words that follow a given
contexts, it’s less likely that a new word following the context is valid.</p>
<h3>Word Histories</h3>
<p>This is the part that is actually attributed to Kneser & Ney. When predicting
the probability of a word given a context, we not only want to take into account
the current context, but the number of contexts that the word appears in. Remember how absolute discounting works well because if there are only a few words that come
after a context, a novel word in that context should be less likely? It also works the other way. If a word appears after a small number of contexts, then it should be less likely to appear in a novel context.</p>
<p>The quintessential example is ‘San Francisco’. Francisco alone may have a high
count in a corpus, but it should never be predicted unless it follows ‘San’.
This is the motivation for replacing the MLE unigram probability with the
‘continuation probability’ that estimates how likely the unigram is to continue
a new context.</p>
<p>Let <script type="math/tex">N_{1+}(\bullet w_{1}, \dots, w_{k}) =
|\{(w', w_{1}, \dots, w_{k}) : C(w', w_{1}, \dots, w_{k}) > 0 \}|</script></p>
<script type="math/tex; mode=display">P_{KN}(w) = \frac{N_{1+}(\bullet w)}{N_{1+}(\bullet \bullet)}</script>
<p>The unigram Kneser-Ney probability is the number of unique words the unigram
follows divided by all bigrams. The Kneser-Ney unigram probability can be
extended to k-grams, where 1 <= k < N, as such:</p>
<script type="math/tex; mode=display">P_{KN}(w_{n} | w_{i}, \dots, w_{n - 1}) = \frac{N_{1+}(\bullet w_{i}, \dots,
w_{n}) - D}{N_{1+}(\bullet w_{i}, \dots, w_{n - 1}, \bullet)} +
\lambda(w_{i}, \dots, w_{n - 1})P_{KN}(w_{n} | w_{i + 1}, \dots, w_{n - 1})</script>
<p>Note that the above equation does NOT apply to the highest order; we have no data
on the ‘word histories’ for the highest order N-grams. When i = 1 in the above
equation, we instead use normal counts discounted and interpolated with the remaining Kneser-Ney probabilities:</p>
<script type="math/tex; mode=display">P_{KN}(w_{n} | w_{1}, \dots, w_{n - 1}) = \frac{C(w_{1}, \dots,
w_{n}) - D}{\sum_{w' \in L} C(w_{1}, \dots, w_{n - 1}, w')} +
\lambda(w_{1}, \dots, w_{n - 1})P_{KN}(w_{n} | w_{2}, \dots, w_{n - 1})</script>
<p>Side note:
In reality, there are normally three different discount values, <script type="math/tex">D_{k, 1}</script>,
<script type="math/tex">D_{k, 2}</script>,
and <script type="math/tex">D_{k, 3+}</script>, computed for each k-gram order (1 <= k <= N). <script type="math/tex">D_{k, i}</script>
is used if <script type="math/tex">C(w_{N - k + 1}, \dots, w_{N}) = i</script>. The closed-form estimate for the optimal discounts (see <a href="http://www.cs.berkeley.edu/~klein/cs294-5/chen_goodman.pdf">Chen &
Goodman</a>) is</p>
<script type="math/tex; mode=display">D_{k, i} = i - (i + 1)Y_{k}\frac{N_{k, i + 1}}{N_{k, i}}</script>
<p>where <script type="math/tex">Y_{k} = \frac{N_{k, 1}}{N_{k, 1} + 2N_{k, 2}}</script>. If k = n,
<script type="math/tex">N_{k, i} = |\{w_{N - k + 1}, \dots, w_{N} : C(w_{N - k + 1}, \dots, w_{N}) =
i\}|</script> Otherwise, <script type="math/tex">N_{k, i} = |\{w_{N - k + 1}, \dots, w_{N} : N_{1+}(\bullet w_{N - k + 1}, \dots, w_{N}) = i\}|</script></p>
<p>The use of multiple discount values is what the ‘modified’ part of ‘modified’
Kneser-Ney smoothing is.</p>
<h2>Language Modeling Toolkits</h2>
<p>How do you actually create a Kneser-Ney language model? I put a pretty
bare-bones, unoptimized <a href="https://github.com/smilli/kneser-ney">implementation of Kneser-Ney smoothing on Github</a> in the hopes that it
would be easy to learn from / use for small datasets.</p>
<p>But there exist several free and open-source language modeling toolkits that are
much more optimized for memory/performance. I recommend <a href="https://kheafield.com/code/kenlm/">KenLM</a>.
It’s written in c++, but there’s <a href="http://victor.chahuneau.fr/notes/2012/07/03/kenlm.html">also an article on how
to use KenLM in Python</a>. Others include <a href="https://code.google.com/p/berkeleylm/">BerkeleyLM</a>, <a href="http://www.speech.sri.com/projects/srilm/">SRILM</a>, <a href="https://code.google.com/p/mitlm/">MITLM</a>.</p>
<h2>Further Reading</h2>
<ul>
<li>NLP Courses
<ul>
<li><a href="https://class.coursera.org/nlp/lecture">Stanford NLP
Coursera</a></li>
<li><a href="http://www.cs.columbia.edu/~cs4705/">Columbia's NLP class</a>: Michael Collins' lecture notes are
really good.</li>
</ul>
</li>
<li>Smoothing
<ul>
<li><a href="http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf">
Stanford NLP Smoothing Tutorial</a>: Easy explanations of
different smoothing techniques.</li>
<li><a href="http://www.cs.berkeley.edu/~klein/cs294-5/chen_goodman.pdf">An
Empirical Study of Smoothing Techniques for Language Modeling
</a>: Compares performance of different smoothing techniques.</li>
<li><a href="http://www.aclweb.org/anthology/D07-1090.pdf">
Stupid Backoff
</a>: An extremely simplistic type of smoothing that does as
well as Kneser-Ney smoothing for very large datasets.</li>
</ul>
</li>
<li>Language Model Estimation
<ul>
<li><a href="https://kheafield.com/professional/edinburgh/estimate_paper.pdf">Scalable
Modified Kneser-Ney Language Model Estimation</a>: This is the paper
that explains how KenLM does language model estimation. Section
three, "Estimation Pipeline", is really helpful.</li>
<li><a href="http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf">Faster
and Smaller N-Gram Language Models</a>: BerkeleyLM paper</li>
<li><a href="http://www.aclweb.org/anthology/W09-1505">Tightly Packed
Tries: How to Fit Large Models into Memory,
and Make them Load Fast, Too</a></li>
</ul>
</li>
</ul>
Tue, 30 Jun 2015 16:00:00 +0000
http://smithamilli.com/blog/kneser-ney/
http://smithamilli.com/blog/kneser-ney/