Demo, Source

Remember when long ago you were amazed by a plot twist when a character in a book or video game or movie had a name that was an anagram of something else? You know, when you realize that the main villain, “Ivel”, is an anagram of “Evil”?!?! #Shock

When I was younger, I thought hiding secret messages in usernames sounded super cool. One of the ways of hiding a message in my username was to make it an anagram of something else.

For the first time in a long tine, I was registering an account on a forum (online anonymous discussion? gasp!) and was reminded of how cool I used to think secret anagrams were.

Instead of just thinking of anagrams myself, this time I thought it would be fun to create an anagram generator. But, I didn’t want junk that was unpronounceable, so I had to filter out sequences of characters that had very low probability.

The probability of a sequence of characters is . A trigram character model estimates the conditional probability of a character given its predecessors by only taking into account the last two characters. In other words we estimate as .

Likewise, in a general n-gram model, we would predict the conditional probability of a character by using the last (n - 1) characters. I tried different values of n, but found that a trigram model seemed to work best.

I trained the model on text from the Gutenberg corpus, simply because it’s easily accessible through NLTK. I restricted the words used for training to those that contained at most one uppercase letter at the beginning of the word to avoid acronyms.

def valid_word(word):
    return word.isalpha() and word[1:].islower()

We get all trigrams from the valid words and say that \[P(c_3 | c_2, c_1) = \frac{C(c_1, c_2, c_3)}{\sum_{c \in L}C(c_1, c_2, c)}\] where C(x) is the number of times that x appears and L is the set of all possible characters.

When creating language models, it often makes implementation easier when you pad your tokens. What I mean by this is that instead of finding the character trigrams of “boat”, which would just be “(‘b’, ‘o’, ‘a’), (‘o’, ‘a’, ‘t’)”, we pad the left and right of boat by n - 1 dummy symbols, where n is the order of your model.

So in a trigram model, n=3, we pad the left and right by 2 dummy symbols. Here I’ve set the dummy symbol to be an underscore. So we find the trigrams of “__boat__”, which are “(‘_’, ‘_’, b’), (‘_’, ‘b’, ‘o’), (‘b’, ‘o’, ‘a’), (‘o’, ‘a’, ‘t’), (‘a’, ‘t’, ‘_’), (‘t’, ‘_’, ‘_’)”. This way we can easily calculate the probability of the first character as .

def generate_ngram_model(n=3):
    """Generate an ngram character model"""
    char_ngrams = chain.from_iterable(
        ngrams(w.lower(), n=n, pad_left=True,pad_right=True, pad_symbol='_')
        for w in gutenberg.words() if valid_word(w))
    cond_freq_dist = defaultdict(lambda: defaultdict(int))
    for ngram in char_ngrams:
        cond_freq_dist[''.join(ngram[:-1])][ngram[-1]] += 1
    cond_prob_dist = defaultdict(lambda: dict())
    for cond, char_freqs in cond_freq_dist.items():
        total_freq = sum(freq for freq in char_freqs.values())
        for char, freq in char_freqs.items():
            cond_prob_dist[cond][char] = math.log(freq/total_freq)
    return cond_prob_dist

You’ll notice that I’m actually saving the log probabilities; this is to avoid underflow issues. Instead of calculating , which will be an extremely small number, we calculate .

After that I dumped the conditional probability distribution for the model into a json file. I wrote some Javascript to generate all possible anagrams in which the minimum transition probability of any was 0.0035. (With the exception of the first letter, which I allowed to be anything, to avoid never picking letters like ‘z’.)

Unfortunately, generating all possible anagrams is factorial running time in the number of characters. But if you want, you can also generate anagrams as a Markov chain based on the trigram character model, and just keep trying until you get one that works.

You can take a look at all of the remaining code here. The Javascript is slightly more complicated than what I’ve outlined because I extended the anagrams to allow for multiple words, and due to the running time, I also added a timeout parameter to prevent stalling for too long.

And of course, play with the generator yourself! It’s not perfect, but it works decently well.