Smitha Milli
http://smithamilli.com/
09.14.18<p>She was briskly walking downtown when the pressure attacked her, as if a hand
were clenching and releasing her heart, clenching and releasing, clenching and
releasing. It toyed with her, tantalizing her with momentary relief, before
tightening its grip again. With each repetition the hand grew and expanded its
control to a larger area. Its domain grew from her heart to her chest to her
torso. Soon, even the brief reprieves were dutifully eliminated and replaced by
the mechanical launching of sharp swords straight to her skull.</p>
<p>She fell to her knees, holding her real, physical hands upon her chest. <em>Why now? Please, not now.</em></p>
<p>The sea of bankers bustling forward flowed around a fictional sphere of five-foot diameter whose singular focus was the woman. It was the flow who not only sculpted the object but <del>prevented</del> protected its members from seeing it. A benevolent creator.</p>
<p><em>One chance, just one chance.</em> She gasped and exhaled a nonsensical cry. At
that moment, a vertical blow had whipped her whole being. Instantly, she
curled into a compact ball. But to say that it was “she” is an imprecise
misnomer. The rearrangement of the constituent parts had no need for
consciousness, for agency; it was the inexorable actualization of the
reactants, the preconditions. However, in this shriveled, defeated position,
she, with all her effort, squinted her face, to voice her final word:
“Motherfuckers.”</p>
Fri, 14 Sep 2018 12:00:00 +0000
http://smithamilli.com/blog/cons/
http://smithamilli.com/blog/cons/conversation with a rationalist<p>“How do you know Tania?”</p>
<p>“Oh, from the rationalist sphere, of course.. How about you?”</p>
<p>“Oh, well we’re both in the same PhD program.”</p>
<p>“Oh, nice,” he nodded.</p>
<p><em>Pause.</em></p>
<p>“Cool dog tag necklace,” she pointed.</p>
<p>“It’s for the Cryonics Institute.” He held it up, so she could read the engraving detailing the number to call in the case of death.</p>
<p>“Wait, cryonics? Like, freezing yourself?”</p>
<p>“Yeah. You should consider it.”</p>
<p>“Uhhhh… you believe they’re going to be able to revive you?”</p>
<p>“I believe there’s a non-neglible probability that one day we will have the technology to do so, yes.”</p>
<p>“Why would they care about keeping their promise to you if you’re dead?”</p>
<p>“That’s a common critique. But presumably, family members of people who had just died would rally to revive their loved ones. Then the people who got revived would want to revive their loved ones and so on.”</p>
<p>“So… you pay for this?”</p>
<p>“Yes,” he answered back (leaving the “isn’t it obvious?” implied).</p>
<p>“How much?”</p>
<p>“It’s $120 per year and a $35,000 one-time fee.”</p>
<p>“What?! Are you serious? That’s so expensive.”</p>
<p>“No, it’s actually really cheap. Even if you discount future years, it would be rational to pay a small $35k, for the chance of infinite future life.”</p>
<p>She stared at him, shaking her head in disbelief.</p>
<p>“Plus, it has great signaling value,” he said gleefully.</p>
<p>“Well… that makes sense.” She leaned back. “I’m going to get a drink.”</p>
Wed, 12 Sep 2018 12:00:00 +0000
http://smithamilli.com/blog/convo/
http://smithamilli.com/blog/convo/Bounded Optimality<p>A friend recently asked me why I find bounded optimality interesting. Here’s
why:</p>
<ol>
<li>
<p>It is necessary to have a normative framework for how agents should act under
computational pressure because this is what the real world is like. In the
real world an agent should understand not to think long when it is about to get
hit by a car, but should definitely perform more computation before declaring
war. (This is related to our work on <a href="http://smithamilli.com/pubs/2017aaai_meta.pdf">bounded-optimal
metareasoning</a>!)
See <a href="https://people.eecs.berkeley.edu/~russell/papers/ptai13-intelligence.pdf">Rationality and
Intelligence: A Brief Update</a>
(Russell 2014) for more on this point.</p>
</li>
<li>
<p>It’s an elegant framework because it provides <a href="http://gershmanlab.webfactional.com/pubs/GershmanHorvitzTenenbaum15.pdf">“a converging paradigm
for intelligence in brains, minds, and
machines”</a>
that may allow for more transfer of insights between cognitive science and
artificial intelligence (Gershman, Horvitz, and Tenenbaum 2015). For example, understanding the bounded-optimal
solutions that humans use may be useful for creating better
approximation strategies for artificial agents to use under computational pressure. On the other hand,
when we have an optimal solution that an artificial agent can implement, we can
then ask what the bounded-optimal solution for the real-world environment that
humans live in would be and see if that is the kind of behavior that humans
showcase.</p>
</li>
<li>
<p>It provides an appealing way to <a href="https://cocosci.berkeley.edu/tom/papers/RationalUseOfCognitiveResources.pdf">bridge the gap between Marr’s computational and
algorithmic
level</a>
(Griffiths, Lieder, and Goodman 2014). As mentioned in (1)
bounded-optimality is necessary as a normative framework for artificial agents
because the costs of computation are an important factor to decision-making in
real-world environments. But humans also have costs to computation that arise
from intrinsic biological bounds, rather than the environment, so an
interesting question is what are the fundamental limitations on human
intelligence that arise from this and how close can human bounded-optimality
ever get to AI bounded-optimality? Being able to take into account computational
constraints at varying levels of abstraction may be useful for progress on this
question.</p>
</li>
<li>
<p>It can potentially be used as a more
principled, versatile way to predict when people will use different types of
approximations, which is useful for lots of applications. For example, understanding the
circumstances under which your employee’s decisions are likely to be less
accurate than usual. Or for artificial agents trying to decide how much to trust
your “expert knowledge”.</p>
</li>
</ol>
<p>I’m curious as to what the other reasons are that people find bounded optimality
interesting, so please let me know. :)</p>
Sun, 18 Sep 2016 12:00:00 +0000
http://smithamilli.com/blog/bounded-optimality/
http://smithamilli.com/blog/bounded-optimality/Kneser-Ney Smoothing<p><a href="https://en.wikipedia.org/wiki/Language_model">Language modeling</a> is important for almost all natural language processing tasks:
speech recognition, spelling correction, machine translation, etc. Today I’ll
go over Kneser-Ney smoothing, a historically important technique for language
model smoothing.</p>
<h2>Language Models</h2>
<p>A language model estimates the probability of an n-gram from a training corpus.
The simplest way to get a probability distribution over n-grams from a corpus is
to use the <a href="https://en.wikipedia.org/wiki/Maximum_likelihood">MLE</a>.
That is, the probability of an n-gram <script type="math/tex">(w_{1}, \dots, w_{n})</script> is simply the number of times
it appears divided by the number of n-grams. Usually we’re interested in the
conditional probability of the last word, given the context of the last (n-1)
words:</p>
<p><script type="math/tex">P(w_n | w_1, \dots, w_{n - 1}) = \frac{C(w_1, \dots , w_n
)}{\sum_{w' \in L}C(w_1, \dots , w')}</script>
where C(x) is the number of times that x appears and L is the set of all possible
words.</p>
<p>The problem with the MLE arises when the n-gram you want a probability
for was not seen in the data; in these cases the MLE will simply assign a
probability of zero to the sequences. This is an inevitable problem for language
tasks because no matter how large your corpus is it’s impossible for it to contain
all possibilities of n-grams from the language.</p>
<p>(About a
month ago I also wrote about how to use a <a href="/blog/anagrams">trigram character model</a> to generate pronounceable
anagrams. Can you see why smoothing was unnecessary for a character model?)</p>
<h2>Kneser-Ney Smoothing</h2>
<p>The solution is to “smooth” the language models to move some probability towards
unknown n-grams. There are many ways to do this, but the method with the <a href="https://en.wikipedia.org/wiki/Perplexity">best
performance</a> is interpolated modified <a href="https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing">Kneser-Ney
smoothing</a>. I’ll explain the intuition behind Kneser-Ney in three parts:</p>
<h3>Absolute-Discounting</h3>
<p>To retain a valid probability distribution (i.e. one that sums to one) we must
remove some probability mass from the MLE to use for n-grams that were not seen
in the corpus. Absolute discounting does this by subtracting a fixed number D from all n-gram counts. The adjusted count of an
n-gram is <script type="math/tex">A(w_{1}, \dots, w_{n}) = C(w_{1}, \dots, w_{n}) - D</script>.</p>
<h3>Interpolation</h3>
<p>After we’ve assured that we have probability mass to use for unknown n-grams,
now we still need to figure out how to actually estimate the probability of unknown n-grams.</p>
<p>A clever way to do this is to use lower order models. Suppose your language model estimates the probabilities of trigrams. When you come across an unknown
trigram, e.g. (‘orange’, ‘cat’, ‘pounced’), although the trigram may be
unknown, the bigram suffix, (‘cat’, ‘pounced’), may be present in the corpus.
So, when creating a language model, we don’t merely calculate the probabilities
of all N-grams, where N is the highest order of the language model, we estimate
probabilities for all k-grams where <script type="math/tex">k \in {1, \dots, N}</script>.</p>
<p>Interpolation recursively combines probabilities of all lower-order models to
get the probability of an n-gram:</p>
<script type="math/tex; mode=display">P_{s}(w_{n} | w_{i}, \dots, w_{n - 1}) = \frac{C(w_i, \dots , w_n
) - D}{\sum_{w' \in L}C(w_i, \dots , w')} + \gamma(w_{i}, \dots, w_{n -
1})P_{s}(w_{n} | w_{i + 1} \dots, w_{n - 1})</script>
<p>The recursion stops at the unigram model:
<script type="math/tex">P_{s}(w) = \frac{C(w)}{\sum_{w' \in L} C(w')}</script></p>
<p><script type="math/tex">\gamma(w_{i}, \dots, w_{n -1})</script> is known as the back-off weight. It is simply
the amount of probability mass we left for the next lower order model.</p>
<script type="math/tex; mode=display">\gamma(w_{i}, \dots, w_{n -1}) = \frac{D \cdot |\{(w_{i}, \dots, w_{n -1},
w') : C(w_{i}, \dots, w_{n -1}, w') > 0 \}| }{\sum_{w' \in L}C(w_i, \dots , w')}</script>
<p>After interpolating the probabilities, if a sequence has any k-gram suffix
present in the corpus, it will have a non-zero probability.</p>
<p>It’s also easier to see why absolute discounting works so well now. Notice how
the fewer words there are that follow the context (the sequence of words we’re
conditioning on), the lower the associated back-off weight for that context is.
This makes sense since if there are only a few words that follow a given
contexts, it’s less likely that a new word following the context is valid.</p>
<h3>Word Histories</h3>
<p>This is the part that is actually attributed to Kneser & Ney. When predicting
the probability of a word given a context, we not only want to take into account
the current context, but the number of contexts that the word appears in. Remember how absolute discounting works well because if there are only a few words that come
after a context, a novel word in that context should be less likely? It also works the other way. If a word appears after a small number of contexts, then it should be less likely to appear in a novel context.</p>
<p>The quintessential example is ‘San Francisco’. Francisco alone may have a high
count in a corpus, but it should never be predicted unless it follows ‘San’.
This is the motivation for replacing the MLE unigram probability with the
‘continuation probability’ that estimates how likely the unigram is to continue
a new context.</p>
<p>Let <script type="math/tex">N_{1+}(\bullet w_{1}, \dots, w_{k}) =
|\{(w', w_{1}, \dots, w_{k}) : C(w', w_{1}, \dots, w_{k}) > 0 \}|</script></p>
<script type="math/tex; mode=display">P_{KN}(w) = \frac{N_{1+}(\bullet w)}{N_{1+}(\bullet \bullet)}</script>
<p>The unigram Kneser-Ney probability is the number of unique words the unigram
follows divided by all bigrams. The Kneser-Ney unigram probability can be
extended to k-grams, where 1 <= k < N, as such:</p>
<script type="math/tex; mode=display">P_{KN}(w_{n} | w_{i}, \dots, w_{n - 1}) = \frac{N_{1+}(\bullet w_{i}, \dots,
w_{n}) - D}{N_{1+}(\bullet w_{i}, \dots, w_{n - 1}, \bullet)} +
\lambda(w_{i}, \dots, w_{n - 1})P_{KN}(w_{n} | w_{i + 1}, \dots, w_{n - 1})</script>
<p>Note that the above equation does NOT apply to the highest order; we have no data
on the ‘word histories’ for the highest order N-grams. When i = 1 in the above
equation, we instead use normal counts discounted and interpolated with the remaining Kneser-Ney probabilities:</p>
<script type="math/tex; mode=display">P_{KN}(w_{n} | w_{1}, \dots, w_{n - 1}) = \frac{C(w_{1}, \dots,
w_{n}) - D}{\sum_{w' \in L} C(w_{1}, \dots, w_{n - 1}, w')} +
\lambda(w_{1}, \dots, w_{n - 1})P_{KN}(w_{n} | w_{2}, \dots, w_{n - 1})</script>
<p>Side note:
In reality, there are normally three different discount values, <script type="math/tex">D_{k, 1}</script>,
<script type="math/tex">D_{k, 2}</script>,
and <script type="math/tex">D_{k, 3+}</script>, computed for each k-gram order (1 <= k <= N). <script type="math/tex">D_{k, i}</script>
is used if <script type="math/tex">C(w_{N - k + 1}, \dots, w_{N}) = i</script>. The closed-form estimate for the optimal discounts (see <a href="http://www.cs.berkeley.edu/~klein/cs294-5/chen_goodman.pdf">Chen &
Goodman</a>) is</p>
<script type="math/tex; mode=display">D_{k, i} = i - (i + 1)Y_{k}\frac{N_{k, i + 1}}{N_{k, i}}</script>
<p>where <script type="math/tex">Y_{k} = \frac{N_{k, 1}}{N_{k, 1} + 2N_{k, 2}}</script>. If k = n,
<script type="math/tex">N_{k, i} = |\{w_{N - k + 1}, \dots, w_{N} : C(w_{N - k + 1}, \dots, w_{N}) =
i\}|</script> Otherwise, <script type="math/tex">N_{k, i} = |\{w_{N - k + 1}, \dots, w_{N} : N_{1+}(\bullet w_{N - k + 1}, \dots, w_{N}) = i\}|</script></p>
<p>The use of multiple discount values is what the ‘modified’ part of ‘modified’
Kneser-Ney smoothing is.</p>
<h2>Language Modeling Toolkits</h2>
<p>How do you actually create a Kneser-Ney language model? I put a pretty
bare-bones, unoptimized <a href="https://github.com/smilli/kneser-ney">implementation of Kneser-Ney smoothing on Github</a> in the hopes that it
would be easy to learn from / use for small datasets.</p>
<p>But there exist several free and open-source language modeling toolkits that are
much more optimized for memory/performance. I recommend <a href="https://kheafield.com/code/kenlm/">KenLM</a>.
It’s written in c++, but there’s <a href="http://victor.chahuneau.fr/notes/2012/07/03/kenlm.html">also an article on how
to use KenLM in Python</a>. Others include <a href="https://code.google.com/p/berkeleylm/">BerkeleyLM</a>, <a href="http://www.speech.sri.com/projects/srilm/">SRILM</a>, <a href="https://code.google.com/p/mitlm/">MITLM</a>.</p>
<h2>Further Reading</h2>
<ul>
<li>NLP Courses
<ul>
<li><a href="https://class.coursera.org/nlp/lecture">Stanford NLP
Coursera</a></li>
<li><a href="http://www.cs.columbia.edu/~cs4705/">Columbia's NLP class</a>: Michael Collins' lecture notes are
really good.</li>
</ul>
</li>
<li>Smoothing
<ul>
<li><a href="http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf">
Stanford NLP Smoothing Tutorial</a>: Easy explanations of
different smoothing techniques.</li>
<li><a href="http://www.cs.berkeley.edu/~klein/cs294-5/chen_goodman.pdf">An
Empirical Study of Smoothing Techniques for Language Modeling
</a>: Compares performance of different smoothing techniques.</li>
<li><a href="http://www.aclweb.org/anthology/D07-1090.pdf">
Stupid Backoff
</a>: An extremely simplistic type of smoothing that does as
well as Kneser-Ney smoothing for very large datasets.</li>
</ul>
</li>
<li>Language Model Estimation
<ul>
<li><a href="https://kheafield.com/professional/edinburgh/estimate_paper.pdf">Scalable
Modified Kneser-Ney Language Model Estimation</a>: This is the paper
that explains how KenLM does language model estimation. Section
three, "Estimation Pipeline", is really helpful.</li>
<li><a href="http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf">Faster
and Smaller N-Gram Language Models</a>: BerkeleyLM paper</li>
<li><a href="http://www.aclweb.org/anthology/W09-1505">Tightly Packed
Tries: How to Fit Large Models into Memory,
and Make them Load Fast, Too</a></li>
</ul>
</li>
</ul>
Tue, 30 Jun 2015 16:00:00 +0000
http://smithamilli.com/blog/kneser-ney/
http://smithamilli.com/blog/kneser-ney/