Smitha Milli
http://smithamilli.com/
Fairness Criteria<p>A reference list of the observational criteria for fairness from <a href="https://fairmlclass.github.io/4.html#/">Moritz’s slides</a>. View the slides for intuition + pros/cons.</p>
<h2 id="setup">Setup</h2>
<p><script type="math/tex">X</script>: features of an individual<br />
<script type="math/tex">A</script>: sensitive attribute (race, gender)<br />
<script type="math/tex">C(X, A)</script>: classifier mapping <script type="math/tex">X</script> and <script type="math/tex">A</script> to some prediction<br />
<script type="math/tex">Y</script>: actual outcome<br />
Assume <script type="math/tex">C, Y</script> are binary 0/1 variables.<br />
<script type="math/tex">P_{a}(E) = P(E|A=a)</script></p>
<h2 id="criteria">Criteria</h2>
<p>For all groups <script type="math/tex">a, b</script><br />
<strong>Demographic parity</strong>:<br />
<script type="math/tex">P_{a}(C=1) = P_{b}(C=1)</script><br />
<strong>Accuracy parity</strong>:<br />
<script type="math/tex">P_{a}(C=Y) = P_{b}(C=Y)</script></p>
<h3 id="positive-rate-parity">Positive rate parity</h3>
<p><strong>True positive parity (TPP)</strong>:<br />
<script type="math/tex">P_{a}(C=1|Y=1) = P_{b}(C=1|Y=1)</script><br />
<strong>False positive parity (FPP)</strong>:<br />
<script type="math/tex">P_{a}(C=1|Y=0) = P_{b}(C=1|Y=0)</script><br />
<strong>Positive rate parity/equalized odds</strong>:<br />
TPP+FPP</p>
<h3 id="predictive-value-parity">Predictive value parity</h3>
<p><strong>Positive predictive value parity</strong>:<br />
<script type="math/tex">P_{a}(Y=1|C=1) = P_{b}(Y=1|C=1)</script><br />
<strong>Negative predictive value parity</strong>:<br />
<script type="math/tex">P_{a}(Y=1|C=0) = P_{b}(Y=1|C=0)</script><br />
<strong>Predictive value parity</strong>:<br />
Positive + negative predictive value parity</p>
Sun, 12 Nov 2017 00:00:00 +0000
http://smithamilli.com/blog/fairness-criteria/
http://smithamilli.com/blog/fairness-criteria/Undergraduate Research Tips<p>There’s a lot of <a href="https://github.com/smilli/research-advice">research advice</a> out there, but not much focused specifically on
undergrads. So here I’ve tried outlining undergrad-focused tips.</p>
<ol>
<li>
<p><strong>Consider trying other options (e.g. software internship) first.</strong> Some types of
research will require more advanced knowledge that you probably won’t have yet
as, say, a freshman, so try other options earlier. I would especially recommend
becoming fast and effective at coding, so that isn’t a bottleneck in your
research later.</p>
</li>
<li>
<p><strong>How do you get research? Just ask.</strong> This is something that is far simpler than
people think. You literally just ask people. Email professors or grad students
whose research you think is interesting and tell them you’d like to get
involved. Sending an email takes a max of 5 minutes, so why not just try?</p>
</li>
<li><strong>Be selective in the research you choose.</strong> Because people think it’s harder to
find research opportunities than it actually is (#2) they tend to be more afraid
to reject research opportunities than they should be. Here are some criteria for
picking research:
<ol>
<li><em>Pick research you find interesting.</em> You won’t do well if you don’t find it
interesting.</li>
<li><em>Pick research that helps you figure out whether you like research.</em> You will not apriori know whether or not you like research. You should have a loose theory of
why you think you might like doing research and seek out experiences that help
evaluate your theory. For example, if like me, you think a big component of why
you would like research is having autonomy, then don’t do research where you’re
always told exactly what to do. This is another reason you should not do
research you’re not interested in (3a). If you do research on a topic you know
you’re not interested in, then when you find out that (unsurprisingly) you
didn’t like the research, you can’t tell if it’s because of general aspects of
research or because you found the topic boring.</li>
<li><em>Pick a mentor that communicates well and cares about your research
growth.</em> You
won’t know anything when you start (even if you think you do) and having a
mentor you work well with is essential to your success. You also ultimately want
to become a good researcher, so pick someone who cares about your research
growth and will spend time teaching you about the field and how to do research.
Don’t pick e.g. a grad student who just wants you to code something they didn’t
want to do themselves.</li>
<li><em>Pick research that will become publication quality.</em> If you eventually want to go to grad school, it’s important to publish. If a project looks like something no
one but your mentor cares about, you may want to avoid it.</li>
</ol>
</li>
<li>
<p><strong>Time management is the #1 reason you will not make progress.</strong> If you put less
than ten hours in per week, it’s unlikely you’ll make any progress in research.
As an undergrad, it can be easy to fall into the trap of putting off research
because you have a wave of midterms or something, but you need to make time. A
generic recommendation is spending ~15 hours a week on research. I probably
spent more like 20-30 hours. Also consider spending a summer doing research, you
can get a lot more done when research is your only focus.</p>
</li>
<li><strong>To get into grad school prioritize (publishable) research over classes.</strong> Grad
school is very competitive. For example, only ~3% of applicants get into the top
AI/ML programs and this is only going to get worse in the future. Whenever you
have a situation where only the top x% (where x is small) get a prize and
everyone else gets nothing, you should focus on the important and high variance
properties that distinguish you from the rest. In this case, what you should not
focus on is your grades. Everyone gets high grades, so it’s low variance and not
useful in picking who gets admitted. What actually provides information to
distinguish people is their research. The problem is you can’t get bad grades
(<A) either because that’s a red flag. So the trick here is to <strong>take fewer
classes and use your extra time to do more research.</strong></li>
</ol>
Sun, 12 Mar 2017 04:38:00 +0000
http://smithamilli.com/blog/undergrad-tips/
http://smithamilli.com/blog/undergrad-tips/Bounded Optimality<p>A friend recently asked me why I find bounded optimality interesting. Here’s
why:</p>
<ol>
<li>
<p>It is necessary to have a normative framework for how agents should act under
computational pressure because this is what the real world is like. In the
real world an agent should understand not to think long when it is about to get
hit by a car, but should definitely perform more computation before declaring
war. (This is related to our work on <a href="http://smithamilli.com/pubs/2017aaai_meta.pdf">bounded-optimal
metareasoning</a>!)
See <a href="https://people.eecs.berkeley.edu/~russell/papers/ptai13-intelligence.pdf">Rationality and
Intelligence: A Brief Update</a>
(Russell 2014) for more on this point.</p>
</li>
<li>
<p>It’s an elegant framework because it provides <a href="http://gershmanlab.webfactional.com/pubs/GershmanHorvitzTenenbaum15.pdf">“a converging paradigm
for intelligence in brains, minds, and
machines”</a>
that may allow for more transfer of insights between cognitive science and
artificial intelligence (Gershman, Horvitz, and Tenenbaum 2015). For example, understanding the bounded-optimal
solutions that humans use may be useful for creating better
approximation strategies for artificial agents to use under computational pressure. On the other hand,
when we have an optimal solution that an artificial agent can implement, we can
then ask what the bounded-optimal solution for the real-world environment that
humans live in would be and see if that is the kind of behavior that humans
showcase.</p>
</li>
<li>
<p>It provides an appealing way to <a href="https://cocosci.berkeley.edu/tom/papers/RationalUseOfCognitiveResources.pdf">bridge the gap between Marr’s computational and
algorithmic
level</a>
(Griffiths, Lieder, and Goodman 2014). As mentioned in (1)
bounded-optimality is necessary as a normative framework for artificial agents
because the costs of computation are an important factor to decision-making in
real-world environments. But humans also have costs to computation that arise
from intrinsic biological bounds, rather than the environment, so an
interesting question is what are the fundamental limitations on human
intelligence that arise from this and how close can human bounded-optimality
ever get to AI bounded-optimality? Being able to take into account computational
constraints at varying levels of abstraction may be useful for progress on this
question.</p>
</li>
<li>
<p>It can potentially be used as a more
principled, versatile way to predict when people will use different types of
approximations, which is useful for lots of applications. For example, understanding the
circumstances under which your employee’s decisions are likely to be less
accurate than usual. Or for artificial agents trying to decide how much to trust
your “expert knowledge”.</p>
</li>
</ol>
<p>I’m curious as to what the other reasons are that people find bounded optimality
interesting, so please let me know. :)</p>
Sun, 18 Sep 2016 12:00:00 +0000
http://smithamilli.com/blog/bounded-optimality/
http://smithamilli.com/blog/bounded-optimality/Kneser-Ney Smoothing<p><a href="https://en.wikipedia.org/wiki/Language_model">Language modeling</a> is important for almost all natural language processing tasks:
speech recognition, spelling correction, machine translation, etc. Today I’ll
go over Kneser-Ney smoothing, a historically important technique for language
model smoothing.</p>
<h2>Language Models</h2>
<p>A language model estimates the probability of an n-gram from a training corpus.
The simplest way to get a probability distribution over n-grams from a corpus is
to use the <a href="https://en.wikipedia.org/wiki/Maximum_likelihood">MLE</a>.
That is, the probability of an n-gram <script type="math/tex">(w_{1}, \dots, w_{n})</script> is simply the number of times
it appears divided by the number of n-grams. Usually we’re interested in the
conditional probability of the last word, given the context of the last (n-1)
words:</p>
<p><script type="math/tex">P(w_n | w_1, \dots, w_{n - 1}) = \frac{C(w_1, \dots , w_n
)}{\sum_{w' \in L}C(w_1, \dots , w')}</script>
where C(x) is the number of times that x appears and L is the set of all possible
words.</p>
<p>The problem with the MLE arises when the n-gram you want a probability
for was not seen in the data; in these cases the MLE will simply assign a
probability of zero to the sequences. This is an inevitable problem for language
tasks because no matter how large your corpus is it’s impossible for it to contain
all possibilities of n-grams from the language.</p>
<p>(About a
month ago I also wrote about how to use a <a href="/blog/anagrams">trigram character model</a> to generate pronounceable
anagrams. Can you see why smoothing was unnecessary for a character model?)</p>
<h2>Kneser-Ney Smoothing</h2>
<p>The solution is to “smooth” the language models to move some probability towards
unknown n-grams. There are many ways to do this, but the method with the <a href="https://en.wikipedia.org/wiki/Perplexity">best
performance</a> is interpolated modified <a href="https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing">Kneser-Ney
smoothing</a>. I’ll explain the intuition behind Kneser-Ney in three parts:</p>
<h3>Absolute-Discounting</h3>
<p>To retain a valid probability distribution (i.e. one that sums to one) we must
remove some probability mass from the MLE to use for n-grams that were not seen
in the corpus. Absolute discounting does this by subtracting a fixed number D from all n-gram counts. The adjusted count of an
n-gram is <script type="math/tex">A(w_{1}, \dots, w_{n}) = C(w_{1}, \dots, w_{n}) - D</script>.</p>
<h3>Interpolation</h3>
<p>After we’ve assured that we have probability mass to use for unknown n-grams,
now we still need to figure out how to actually estimate the probability of unknown n-grams.</p>
<p>A clever way to do this is to use lower order models. Suppose your language model estimates the probabilities of trigrams. When you come across an unknown
trigram, e.g. (‘orange’, ‘cat’, ‘pounced’), although the trigram may be
unknown, the bigram suffix, (‘cat’, ‘pounced’), may be present in the corpus.
So, when creating a language model, we don’t merely calculate the probabilities
of all N-grams, where N is the highest order of the language model, we estimate
probabilities for all k-grams where <script type="math/tex">k \in {1, \dots, N}</script>.</p>
<p>Interpolation recursively combines probabilities of all lower-order models to
get the probability of an n-gram:</p>
<script type="math/tex; mode=display">P_{s}(w_{n} | w_{i}, \dots, w_{n - 1}) = \frac{C(w_i, \dots , w_n
) - D}{\sum_{w' \in L}C(w_i, \dots , w')} + \gamma(w_{i}, \dots, w_{n -
1})P_{s}(w_{n} | w_{i + 1} \dots, w_{n - 1})</script>
<p>The recursion stops at the unigram model:
<script type="math/tex">P_{s}(w) = \frac{C(w)}{\sum_{w' \in L} C(w')}</script></p>
<p><script type="math/tex">\gamma(w_{i}, \dots, w_{n -1})</script> is known as the back-off weight. It is simply
the amount of probability mass we left for the next lower order model.</p>
<script type="math/tex; mode=display">\gamma(w_{i}, \dots, w_{n -1}) = \frac{D \cdot |\{(w_{i}, \dots, w_{n -1},
w') : C(w_{i}, \dots, w_{n -1}, w') > 0 \}| }{\sum_{w' \in L}C(w_i, \dots , w')}</script>
<p>After interpolating the probabilities, if a sequence has any k-gram suffix
present in the corpus, it will have a non-zero probability.</p>
<p>It’s also easier to see why absolute discounting works so well now. Notice how
the fewer words there are that follow the context (the sequence of words we’re
conditioning on), the lower the associated back-off weight for that context is.
This makes sense since if there are only a few words that follow a given
contexts, it’s less likely that a new word following the context is valid.</p>
<h3>Word Histories</h3>
<p>This is the part that is actually attributed to Kneser & Ney. When predicting
the probability of a word given a context, we not only want to take into account
the current context, but the number of contexts that the word appears in. Remember how absolute discounting works well because if there are only a few words that come
after a context, a novel word in that context should be less likely? It also works the other way. If a word appears after a small number of contexts, then it should be less likely to appear in a novel context.</p>
<p>The quintessential example is ‘San Francisco’. Francisco alone may have a high
count in a corpus, but it should never be predicted unless it follows ‘San’.
This is the motivation for replacing the MLE unigram probability with the
‘continuation probability’ that estimates how likely the unigram is to continue
a new context.</p>
<p>Let <script type="math/tex">N_{1+}(\bullet w_{1}, \dots, w_{k}) =
|\{(w', w_{1}, \dots, w_{k}) : C(w', w_{1}, \dots, w_{k}) > 0 \}|</script></p>
<script type="math/tex; mode=display">P_{KN}(w) = \frac{N_{1+}(\bullet w)}{N_{1+}(\bullet \bullet)}</script>
<p>The unigram Kneser-Ney probability is the number of unique words the unigram
follows divided by all bigrams. The Kneser-Ney unigram probability can be
extended to k-grams, where 1 <= k < N, as such:</p>
<script type="math/tex; mode=display">P_{KN}(w_{n} | w_{i}, \dots, w_{n - 1}) = \frac{N_{1+}(\bullet w_{i}, \dots,
w_{n}) - D}{N_{1+}(\bullet w_{i}, \dots, w_{n - 1}, \bullet)} +
\lambda(w_{i}, \dots, w_{n - 1})P_{KN}(w_{n} | w_{i + 1}, \dots, w_{n - 1})</script>
<p>Note that the above equation does NOT apply to the highest order; we have no data
on the ‘word histories’ for the highest order N-grams. When i = 1 in the above
equation, we instead use normal counts discounted and interpolated with the remaining Kneser-Ney probabilities:</p>
<script type="math/tex; mode=display">P_{KN}(w_{n} | w_{1}, \dots, w_{n - 1}) = \frac{C(w_{1}, \dots,
w_{n}) - D}{\sum_{w' \in L} C(w_{1}, \dots, w_{n - 1}, w')} +
\lambda(w_{1}, \dots, w_{n - 1})P_{KN}(w_{n} | w_{2}, \dots, w_{n - 1})</script>
<p>Side note:
In reality, there are normally three different discount values, <script type="math/tex">D_{k, 1}</script>,
<script type="math/tex">D_{k, 2}</script>,
and <script type="math/tex">D_{k, 3+}</script>, computed for each k-gram order (1 <= k <= N). <script type="math/tex">D_{k, i}</script>
is used if <script type="math/tex">C(w_{N - k + 1}, \dots, w_{N}) = i</script>. The closed-form estimate for the optimal discounts (see <a href="http://www.cs.berkeley.edu/~klein/cs294-5/chen_goodman.pdf">Chen &
Goodman</a>) is</p>
<script type="math/tex; mode=display">D_{k, i} = i - (i + 1)Y_{k}\frac{N_{k, i + 1}}{N_{k, i}}</script>
<p>where <script type="math/tex">Y_{k} = \frac{N_{k, 1}}{N_{k, 1} + 2N_{k, 2}}</script>. If k = n,
<script type="math/tex">N_{k, i} = |\{w_{N - k + 1}, \dots, w_{N} : C(w_{N - k + 1}, \dots, w_{N}) =
i\}|</script> Otherwise, <script type="math/tex">N_{k, i} = |\{w_{N - k + 1}, \dots, w_{N} : N_{1+}(\bullet w_{N - k + 1}, \dots, w_{N}) = i\}|</script></p>
<p>The use of multiple discount values is what the ‘modified’ part of ‘modified’
Kneser-Ney smoothing is.</p>
<h2>Language Modeling Toolkits</h2>
<p>How do you actually create a Kneser-Ney language model? I put a pretty
bare-bones, unoptimized <a href="https://github.com/smilli/kneser-ney">implementation of Kneser-Ney smoothing on Github</a> in the hopes that it
would be easy to learn from / use for small datasets.</p>
<p>But there exist several free and open-source language modeling toolkits that are
much more optimized for memory/performance. I recommend <a href="https://kheafield.com/code/kenlm/">KenLM</a>.
It’s written in c++, but there’s <a href="http://victor.chahuneau.fr/notes/2012/07/03/kenlm.html">also an article on how
to use KenLM in Python</a>. Others include <a href="https://code.google.com/p/berkeleylm/">BerkeleyLM</a>, <a href="http://www.speech.sri.com/projects/srilm/">SRILM</a>, <a href="https://code.google.com/p/mitlm/">MITLM</a>.</p>
<h2>Further Reading</h2>
<ul>
<li>NLP Courses
<ul>
<li><a href="https://class.coursera.org/nlp/lecture">Stanford NLP
Coursera</a></li>
<li><a href="http://www.cs.columbia.edu/~cs4705/">Columbia's NLP class</a>: Michael Collins' lecture notes are
really good.</li>
</ul>
</li>
<li>Smoothing
<ul>
<li><a href="http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf">
Stanford NLP Smoothing Tutorial</a>: Easy explanations of
different smoothing techniques.</li>
<li><a href="http://www.cs.berkeley.edu/~klein/cs294-5/chen_goodman.pdf">An
Empirical Study of Smoothing Techniques for Language Modeling
</a>: Compares performance of different smoothing techniques.</li>
<li><a href="http://www.aclweb.org/anthology/D07-1090.pdf">
Stupid Backoff
</a>: An extremely simplistic type of smoothing that does as
well as Kneser-Ney smoothing for very large datasets.</li>
</ul>
</li>
<li>Language Model Estimation
<ul>
<li><a href="https://kheafield.com/professional/edinburgh/estimate_paper.pdf">Scalable
Modified Kneser-Ney Language Model Estimation</a>: This is the paper
that explains how KenLM does language model estimation. Section
three, "Estimation Pipeline", is really helpful.</li>
<li><a href="http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf">Faster
and Smaller N-Gram Language Models</a>: BerkeleyLM paper</li>
<li><a href="http://www.aclweb.org/anthology/W09-1505">Tightly Packed
Tries: How to Fit Large Models into Memory,
and Make them Load Fast, Too</a></li>
</ul>
</li>
</ul>
Tue, 30 Jun 2015 16:00:00 +0000
http://smithamilli.com/blog/kneser-ney/
http://smithamilli.com/blog/kneser-ney/