Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Levenshtein distance in production (2020) (vishnubharathi.codes)
83 points by williamsmj on June 7, 2021 | hide | past | favorite | 24 comments


> Problem statement: the Levenshtein distance is a string metric for measuring the difference between two sequences

Another variant is "I have a bunch of words (a dictionary) and one query word, and want to find all words from the dictionary that are close to the query word".

This leads to another interesting class of problems, because you can do clever things where you precompute search structures (e.g. Levenshtein automata [0], [1]) from the dictionary. The similarity queries then run (much) faster. In production, performance matters.

We recently merged a PR like that into Gensim [2].

This gave a ~1,500x speed-up compared to naively comparing all pairwise strings with Levenshtein distance. A difference between the training step running for months (=unusable) and minutes.

[0] http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levensht...

[1] Mihov, Stoyan & Schulz, Klaus. (2004). Fast Approximate Search in Large Dictionaries: https://www.aclweb.org/anthology/J04-4003.pdf

[2] https://github.com/RaRe-Technologies/gensim/pull/3146


Yea pairwise string matching for problems like record linkage is O(n^2) and is really infeasible for larger data.

I once had to deal with a problem like this (matching business entity names against a large database) and ended up preprocessing the names into vectors of trigram (3-letter groups) counts, performing TF-IDF on the trigrams and then taking cosine similarity (which is simply matrix multiplication which though O(mnp) is much faster)

The new solution also gave speedups in the 1000-2000x ballpark and had some nice properties:

1. Inverse document frequency naturally filtered out less important "mispellings" (e.g. PTE LTD vs PRIVATE LIMITED) 2. Trigrams maintain order within words but are less concerned about order between words, which happened to matter less in names


That's similar to what I do on our translation memory at the Commission. The issue we have is that we search for sentences, not words and the medium length of sentences in the database is around 120 characters and we have around 1.5 billion sentences in the database. A pairwise Levenshtein would be completely impossible as added to that we have to take care of replaceables in the segments (dates, numbers, months, weekdays, etc). To accelerate the search we use fuzzy keys which are trigram counts based and have them organized in a ternary tree. For the fuzzy distance, a simple difference calculation between 2 fuzzy keys is close enough to Levenhstein distance that we don't need more fancy metrics (for short sentences it is relatively bad but short sentences are mostly irrelevant for translation memories). Our fuzzy index reduces our search space for the Levenshtein distance calculation by 4 to 5 order of magnitudes (a sentence search is done on a space of 100K-300K sentences, after filtering the number of candidates rarely go beyond 100).


Interesting! What I noticed when approaching the problem was that there is quite little information on scaling up. I also don't think there are good out-of-the-box solutions covering a wide range of use cases. Dedup (basically cross-product) and linkage (highly dependent on the relative sizes of your search set and backing data) have very different optimizations when your data is large


There's a quite amusing blog post about implementing the automaton for Lucene: http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is...

As an aside, I know one of the authors of the original paper personally. Quite interesting guy, and the remark about the paper being "nearly unintelligible" is fitting.


Dramatic post ;) It'd be interesting to see concrete benchmarks of the Lucene implementation, on some public dataset we could try outside of Lucene too.

Btw I didn't find the Schulz & Mihov paper that cryptic. You can check its implementation in Python [0], pretty straightforward IMO.

But I should note that in the end, we chose a simpler approach: the FastSS index. FastSS bypasses constructing / intersecting Levenshtein automata altogether, and is super fast [1].

[0] https://github.com/antoinewdg/pyffs

[1] Boytsov, Leonid. (2011). Indexing methods for approximate dictionary searching: Comparative analysis. http://boytsov.info/pubs/sisap2012.pdf


I probably have some very old code somewhere I wrote for fun where I modified the Levenhstein edit distance DP algo to take keyboard layout into account. For example if you have "doga", the edit distance with "dogs" and "yoga" is the same. But if you're on an QWERTY or AZERTY keyboard, "dogs" is a much more likely typo than "yoga" for "s" is close to "a" while "y" is nowhere to be seen near the letters you entered. Something like that. So basically by taking the keyboard layout into account suggestions can be ranked more efficiently.

Pretty sure the modification was relatively simply and the algo was still a dynamic programming algo. I probably still have the code somewhere.

EDIT: memory being fuzzy too but as I remember it the Levenshtein edit distance ain't that great for typos for a typo where you invert two characters counts as 2 instead of 1? It's probably overused for it's DP implementation is beautiful...


> memory being fuzzy too but as I remember it the Levenshtein edit distance ain't that great for typos for a typo where you invert two characters counts as 2 instead of 1

In which case you want Damerau–Levenshtein distance instead of Levenshtein distance. Damerau–Levenshtein distance counts swapping adjacent characters as a single editing operation instead of two consecutive operations.



I used fuzzywuzzy [1], a python-based fuzzy string matching calculator that is based on Levenshtein's edit-distance for a parking enforcement product I built.

The product used an iOS client to capture license plates. The app would capture a single plate many times, de-duplicate using an edit-distance threshold matching plates up to a time period lookback. Then among those plates, send the one with the highest likely correct read up to the server.

The web app would look to see if that plate had been seen before or a plate similar had been seen before. If so, it would join a "plate group" which let you see the history of vehicle sightings.

A set of rules could be created to show vehicle in violation of a posted parking time limit. It worked when a person on patrol walked the lot twice and the app saw the "same" vehicle (thanks to edit distance)

I had Cummings Properties in New England and National Cathedral in DC as customers, but sold the product last year.

I did not like parking enforcement or the amount and type of data it handled.

The product was hard to sell into facility management.

Parking problems are a hot potato issue on private lots since they generally can’t ticket cars so it just takes time and money to deal with.

That said the other solutions require manual entry or even clipboards. So people would get desperate to reduce effort involved.

I really enjoyed building and iterating on the product, though, and learned a lot.

I wrote some about Vladimir Losifovich Levenshtein in a final high level blog entry about edit distance. There's a 2002 photo of Levenshtein at a conference if you're curious. [2]

[1] https://github.com/seatgeek/fuzzywuzzy

[2] https://blog.easyalpr.com/2019/09/16/calculating-license-pla...


Oh the memories. I remember using this to do entity reconciliation (~deduplication in this case) of multiple data sources based on unstructured textual user input in production some years ago. It was a nightmare but the Levenshtein function (which is available in Postgresql and where I used to pre-filter some of the entities) and other string similarities helped a bunch. Good times :') OP makes a great job of explaining the code behind it, great article!


I have also done this to help manage AdWords exclusions.

In the end I just extended mysql to have a Levenshtein function my Perl POC was a bit slow :-)


A fairly simple case. We made a design error at the start treating jobs as separate entities, when in reality the will be grouped and generally flow together. (Jobs in a production context are various objects made with the same set of specs. Jobs in a customer context often include some objects that are fancier than others.)

We took the lazy route and provided a means of linking jobs to a parent. I tweaked the weighting somewhat because some changes were more likely to be related than others and displayed the top 20 choices. The next step was going to be an input box to let the user pick--but it didn't get written because the desired job always showed up. That state persisted for years before it finally didn't cough up the desired job in the top 20 list. Looking over the problem case they had done something with the name that hadn't originally been envisioned that caused it to score badly--but even then it turned out to be in slot 21. Someday it will need an input box but the quick fix was to simply add another menu option to the list.


You can get an awesome speedup for Levenshtein distance if your application only needs to compare for an edit distance of 1 (often you don't need more than that).

Just tweak the algorithm to remove the extra branches beyond that. This can really help where you have a few hundred thousand strings in memory and where you only have a few ms to spare for things like auto-complete.


I used it for street address matching.

I've got a DB with sales tax data for about half the US states, and an API for looking up tax given a street address. I wanted to see if it was feasible to handle tax rate lookup ourselves instead of using a third party service.

Suppose the database contains the address "123 NW Maple Road, Hooterville, MI, 65026".

There are all kinds of variations in what people will give for that address. Some people might enter "Rd" instead of "Road". Some might might get the qualifier wrong and enter "street" or "avenue" instead of "road" (or abbreviations of "street" or "avenue"). Sometimes people put the directional prefix in the wrong place, so it becomes "Maple Road NW" instead of "NW Maple Road".

I found that people always get the street number right and usually get the zip code right.

What I would do then is take the address given, and lookup in the database all addresses that have that street number and zip code. For each found address, put it in the usual form (<number> <directional_prefixes> <street_name> <qualifiers> <directional_postfixes>, <city>, <state>, <zip>), and compute the Levenshtein distance between that and the input.

Whichever gave the lowest Levenshtein distance is the one I returned the tax rate for. I found that this worked quite well--way better than I expected it to.

Sometimes there would be more than one address with minimal distance, but it almost always turned out in that case that they all had the same tax rate.

Examples of where this would fail would be street names like "Martin Luther King". People will often write that as "MLK". All you need then is to have an "Elk" street in the same city and zip code, that has an address with the same number, and there isn't much hope of picking the right one.


I remember using this to compare people’s top 10 lists (convert them to strings, see how far off everyone was off from each other, and ultimately the ‘winning’ list (the distance should be near zero to when lists are compared)).

Really didn’t need to, but I was bored. Worked well enough. Adderral is a helluva drug.


I was somewhere with no internet connection and an old laptop with Golang on it. I needed a string similarity function. Remembering vaguely hearing there was some kind of 'distance' algorithm to do this I hit on the idea of treating the digits a multiple dimensions and calculating the pythagorean distance. If you fancy a laugh I suggest you try this approach, the numbers get astonishly huge very quickly. I started taking the logs of numbers and even then getting overflow errors. I wished I'd seen this!


The key insight you missed is that it doesn't matter how far away in the ASCII table a letter is. Just assigning a 0 distance if equal or a 1 if different, per character, would have made things a lot easier.


I guess Go's arbitrary precision numbers are a bit of a pain to work with.


Fun fact, Needleman Wunsch[1] algorithim for aligning genetic sequences is just Levenshtein distance with scores being assigned to different types of edits.

[1] https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algor...


I use it in production in PHP. First a pass with Soundex, then Levenshtein. It's used to cleanup certain errors in CSV files, that would otherwise not group on the same header.


I've used it in unit testing. Where the check is "this string exists in an array" and the failure message includes the nearest result from the array.


ghci also gives you a list of names if it can't find the variable you typed wrong. I'm not certain but i assume they use Levenshtein aswell. I definitivly use Demarau-Levenshtein for the same feature in pjass.


GHC also uses Demarau-Levenshtein [0]. The fuzzy lookup function gets re-used for a couple places too (including whatever you encountered in GHCi).

[0]: https://github.com/ghc/ghc/blob/25977ab542a30df4ae71d9699d01...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: