Nov 30, 2012 3 min read

Javascript search, a couple of words about the Trie structures and the Levenshtein distance

You have to love science! No, really, you gotta love it!

I was working on a search module and I was looking for faster ways to do the search and make a "as-you-type" autocomplete something feasible, as well as adding some similar results in case the original term was not found. I'm very proud to say that I came up with something similar to the Trie structures on my own, though I have to admit, not as proudly, that it wasn't nearly as efficient.

A Trie structure is a very neat way to store dictionary/index entries, invented by Edward Fredkin and you can read more about it on Wikipedia. I've barely started my Trie quest so I'm not qualified to talk about it, but the linked Wikipedia entry is very helpful. And just what is so great about the Trie? Think about the following scenario:

You have a list of keywords, a couple of thousands, and you want to let the user search for a keyword. If the keyword is found, you return the matching entry. Something along these lines, right?

var dictionary = [];
dictionary['aa'] = 'a value';
dictionary['ab'] = 'another value';

If the user searches for aa, all you have to do is check if the aa entry exists.

var mySearchedWord = 'aa';
if (dictionary[mySearchedWord]) {
	alert(dictionary[mySearchedWord]);
} else {
	alert('no results found');
}

This is probably the fastest way to implement a simple search. But what if the user made a typo and he actually wanted to search for aa? After all, aa, ab... they are very much alike. How can you implement a system which will know to search for ab when aa is not found, or suggest to the user ab as an alternative?

If aa and ab are to abstract for you, I'll give you another example. If you search for code, wouldn't you like the entries for codes or coding too? They are related, no? Or if you search for porgram, wouldn't you say there's a chance you actually searched for program and made a typo?

The key to detect similar words is something called the Levenshtein distance. Basically, it represents the number of edits required to get the word B from the word A. Example: the levenshtein distance between string and stang is 2.

from string to sting (removed the first r)
from sting to stang (changed i to a)

A total of two edits makes a Levenshtein distance of 2.

To realize just how powerful this search is, look at these numbers: on this page you can find a txt file containing 109.584 english words. It took my Python script close to 8 seconds to parse that simple words list and convert it to a javascript dictionary in the following format:

dictionary[0] = "a";
dictionary[1] = "aah";
dictionary[2] = "aahed";
dictionary[3] = "aahing";
dictionary[4] = "aahs";
dictionary[5] = "aardvark";
dictionary[6] = "aardvarks";
dictionary[7] = "aardwolf";

Then, in half a second (524 ms) I was able to generate a Trie object.

With that Trie, I was able to do this:

Of course, a more sane search for similar words, with a Levenshtein distance of 1 for word will take 4 ms and give you the next results:

cord
ford
lord
sword
ward
woad
wold
wood
word
words
wordy
wore
work
world
worm
worn
wort

I don't know about you, but I find this amazing! If I can do THIS in less than a second, with a dictionary of over 100.000 words, it will run incredibly smooth for my list of only a couple of hundreds or thousand words.

If you want to take a look behind the hood yourselves, you can check out my Trie implementation, Brie-JS, on my Github page.

Let me know what you think!

Popular tags