Language modeling is important for almost all natural language processing tasks: speech recognition, spelling correction, machine translation, etc. Today I’ll go over Kneser-Ney smoothing, a historically important technique for language model smoothing.

## Language Models

A language model estimates the probability of an n-gram from a training corpus. The simplest way to get a probability distribution over n-grams from a corpus is to use the MLE. That is, the probability of an n-gram \((w_{1}, \dots, w_{n})\) is simply the number of times it appears divided by the number of n-grams. Usually we’re interested in the conditional probability of the last word, given the context of the last (n-1) words:

\(P(w_n | w_1, \dots, w_{n - 1}) = \frac{C(w_1, \dots , w_n )}{\sum_{w' \in L}C(w_1, \dots , w')}\) where C(x) is the number of times that x appears and L is the set of all possible words.

The problem with the MLE arises when the n-gram you want a probability for was not seen in the data; in these cases the MLE will simply assign a probability of zero to the sequences. This is an inevitable problem for language tasks because no matter how large your corpus is it’s impossible for it to contain all possibilities of n-grams from the language.

(About a month ago I also wrote about how to use a trigram character model to generate pronounceable anagrams. Can you see why smoothing was unnecessary for a character model?)

## Kneser-Ney Smoothing

The solution is to “smooth” the language models to move some probability towards unknown n-grams. There are many ways to do this, but the method with the best performance is interpolated modified Kneser-Ney smoothing. I’ll explain the intuition behind Kneser-Ney in three parts:

### Absolute-Discounting

To retain a valid probability distribution (i.e. one that sums to one) we must remove some probability mass from the MLE to use for n-grams that were not seen in the corpus. Absolute discounting does this by subtracting a fixed number D from all n-gram counts. The adjusted count of an n-gram is \(A(w_{1}, \dots, w_{n}) = C(w_{1}, \dots, w_{n}) - D\).

### Interpolation

After we’ve assured that we have probability mass to use for unknown n-grams, now we still need to figure out how to actually estimate the probability of unknown n-grams.

A clever way to do this is to use lower order models. Suppose your language model estimates the probabilities of trigrams. When you come across an unknown trigram, e.g. (‘orange’, ‘cat’, ‘pounced’), although the trigram may be unknown, the bigram suffix, (‘cat’, ‘pounced’), may be present in the corpus. So, when creating a language model, we don’t merely calculate the probabilities of all N-grams, where N is the highest order of the language model, we estimate probabilities for all k-grams where \(k \in {1, \dots, N}\).

Interpolation recursively combines probabilities of all lower-order models to get the probability of an n-gram:

\[P_{s}(w_{n} | w_{i}, \dots, w_{n - 1}) = \frac{C(w_i, \dots , w_n ) - D}{\sum_{w' \in L}C(w_i, \dots , w')} + \gamma(w_{i}, \dots, w_{n - 1})P_{s}(w_{n} | w_{i + 1} \dots, w_{n - 1})\]The recursion stops at the unigram model: \(P_{s}(w) = \frac{C(w)}{\sum_{w' \in L} C(w')}\)

\(\gamma(w_{i}, \dots, w_{n -1})\) is known as the back-off weight. It is simply the amount of probability mass we left for the next lower order model.

\[\gamma(w_{i}, \dots, w_{n -1}) = \frac{D \cdot |\{(w_{i}, \dots, w_{n -1}, w') : C(w_{i}, \dots, w_{n -1}, w') > 0 \}| }{\sum_{w' \in L}C(w_i, \dots , w')}\]After interpolating the probabilities, if a sequence has any k-gram suffix present in the corpus, it will have a non-zero probability.

It’s also easier to see why absolute discounting works so well now. Notice how the fewer words there are that follow the context (the sequence of words we’re conditioning on), the lower the associated back-off weight for that context is. This makes sense since if there are only a few words that follow a given contexts, it’s less likely that a new word following the context is valid.

### Word Histories

This is the part that is actually attributed to Kneser & Ney. When predicting the probability of a word given a context, we not only want to take into account the current context, but the number of contexts that the word appears in. Remember how absolute discounting works well because if there are only a few words that come after a context, a novel word in that context should be less likely? It also works the other way. If a word appears after a small number of contexts, then it should be less likely to appear in a novel context.

The quintessential example is ‘San Francisco’. Francisco alone may have a high count in a corpus, but it should never be predicted unless it follows ‘San’. This is the motivation for replacing the MLE unigram probability with the ‘continuation probability’ that estimates how likely the unigram is to continue a new context.

Let \(N_{1+}(\bullet w_{1}, \dots, w_{k}) = |\{(w', w_{1}, \dots, w_{k}) : C(w', w_{1}, \dots, w_{k}) > 0 \}|\)

\[P_{KN}(w) = \frac{N_{1+}(\bullet w)}{N_{1+}(\bullet \bullet)}\]The unigram Kneser-Ney probability is the number of unique words the unigram follows divided by all bigrams. The Kneser-Ney unigram probability can be extended to k-grams, where 1 <= k < N, as such:

\[P_{KN}(w_{n} | w_{i}, \dots, w_{n - 1}) = \frac{N_{1+}(\bullet w_{i}, \dots, w_{n}) - D}{N_{1+}(\bullet w_{i}, \dots, w_{n - 1}, \bullet)} + \lambda(w_{i}, \dots, w_{n - 1})P_{KN}(w_{n} | w_{i + 1}, \dots, w_{n - 1})\]Note that the above equation does NOT apply to the highest order; we have no data on the ‘word histories’ for the highest order N-grams. When i = 1 in the above equation, we instead use normal counts discounted and interpolated with the remaining Kneser-Ney probabilities:

\[P_{KN}(w_{n} | w_{1}, \dots, w_{n - 1}) = \frac{C(w_{1}, \dots, w_{n}) - D}{\sum_{w' \in L} C(w_{1}, \dots, w_{n - 1}, w')} + \lambda(w_{1}, \dots, w_{n - 1})P_{KN}(w_{n} | w_{2}, \dots, w_{n - 1})\]Side note: In reality, there are normally three different discount values, \(D_{k, 1}\), \(D_{k, 2}\), and \(D_{k, 3+}\), computed for each k-gram order (1 <= k <= N). \(D_{k, i}\) is used if \(C(w_{N - k + 1}, \dots, w_{N}) = i\). The closed-form estimate for the optimal discounts (see Chen & Goodman) is

\[D_{k, i} = i - (i + 1)Y_{k}\frac{N_{k, i + 1}}{N_{k, i}}\]where \(Y_{k} = \frac{N_{k, 1}}{N_{k, 1} + 2N_{k, 2}}\). If k = n, \(N_{k, i} = |\{w_{N - k + 1}, \dots, w_{N} : C(w_{N - k + 1}, \dots, w_{N}) = i\}|\) Otherwise, \(N_{k, i} = |\{w_{N - k + 1}, \dots, w_{N} : N_{1+}(\bullet w_{N - k + 1}, \dots, w_{N}) = i\}|\)

The use of multiple discount values is what the ‘modified’ part of ‘modified’ Kneser-Ney smoothing is.

## Language Modeling Toolkits

How do you actually create a Kneser-Ney language model? I put a pretty bare-bones, unoptimized implementation of Kneser-Ney smoothing on Github in the hopes that it would be easy to learn from / use for small datasets.

But there exist several free and open-source language modeling toolkits that are much more optimized for memory/performance. I recommend KenLM. It’s written in c++, but there’s also an article on how to use KenLM in Python. Others include BerkeleyLM, SRILM, MITLM.

## Further Reading

- NLP Courses
- Stanford NLP Coursera
- Columbia's NLP class: Michael Collins' lecture notes are really good.

- Smoothing
- Stanford NLP Smoothing Tutorial: Easy explanations of different smoothing techniques.
- An Empirical Study of Smoothing Techniques for Language Modeling : Compares performance of different smoothing techniques.
- Stupid Backoff : An extremely simplistic type of smoothing that does as well as Kneser-Ney smoothing for very large datasets.

- Language Model Estimation
- Scalable Modified Kneser-Ney Language Model Estimation: This is the paper that explains how KenLM does language model estimation. Section three, "Estimation Pipeline", is really helpful.
- Faster and Smaller N-Gram Language Models: BerkeleyLM paper
- Tightly Packed Tries: How to Fit Large Models into Memory, and Make them Load Fast, Too