Fast Tokenizers with StringScanner (2023)

JonChesterfield · on May 16, 2024

Lexers can be much faster than this.

A regex can be translated to a finite state machine. N regex can also be compiled to a (bigger) state machine. The longest-match-wins, ties go to the earlier regex lexer semantic needs a register/int value to track where the current favourite match ended and a second to track which regex matched. Can probably drop one of those registers if preferred.

That takes a pointer into your string, returns the index into your N regex or an indication of failure. Either it moves the pointer along or returns a length as well.

No dynamic allocation. Constant stack space. Couple of machine registers in the interface. Linear time through the input array (slightly handwaving that since some bytes get read multiple times to disambiguate which match is preferred, but it's bounded by a constant factor of the N regex).

Probably read a byte of the input at a time. The C output of re2c is especially clear for seeing why the compiled machine works.

A nice quirk of the above is equivalent regex turn into the same minimal state machine so it doesn't matter how the regex are written. Literals are the same performance as regex specifying that literal.

Joker_vD · on May 16, 2024

> Probably read a byte of the input at a time.

Which is an important bottleneck; SIMD-based lexers give you another order(s?) of magnitude of speed up. But writing it by hand is just... ugh. The guys behind simdjson did that but not everyone is as great at those things as they are.

10000truths · on May 16, 2024

That speedup is not unconditional. It only occurs in portions of the input string that don't trigger any state transitions. And the length of those portions has to be some reasonable multiple of the size of your vector register. What constitutes a state transition will depend highly on the structure of your input string and the information you wish to extract from it. As an example, if I want to build a nested array/map of things from this input JSON:

  {"a": "bcd", "ef": [{},{},{"a":12,"b":100},{"a":12,"b":100},{"a":12,"b":100}, ...]}

SIMD will not provide an appreciable performance boost because the data type changes every few characters. There's no string or whitespace or number in that JSON that is large enough to speed up scanning.

JonChesterfield · on May 16, 2024

I'm still obsessing about that one. If you read a byte, there's one or two unlikely branches and then a tail call to one of up to 256 distinct subsequent states (or equivalently a read of a table of size up to 256). If you read two bytes, up to 64k subsequent states.

I suspect that reading multiple bytes is faster when the number of distinct subsequent states is smallish, and reading one byte (or maybe even four bits) is faster if the number of distinct states is near the upper bound.

It's also relatively likely that optimal involves reading N bytes, where some out edges consume N and others consume fewer than N. E.g. check eight bytes in case they all match some literal sequence you're scanning for, but if they don't, branch on the first two bytes from that word.

This has really strong rabbit hole vibes to it. What exactly is the right cost model for the target architecture, how should one trade icache for dcache, can any of the branches realistically be predicted for the distribution of input data and so forth. Also how many cycles should go on perfect-hashing the input byte(s) before the transition.

But even if you stick with re2c style read-a-byte-then-switch, and compile with something that knows how to do tail calls for your architecture, the code is going to establish a pretty solid performance baseline. Quite a lot of mispredicted branches but no syscalls, no stack or heap usage.

hiccuphippo · on May 16, 2024

Here's a recent talk of someone experimenting with SIMD to optimize the Zig parser https://www.youtube.com/watch?v=FDiUKafPs0U

cstrahan · on May 16, 2024

And along these lines, one can use an older copy of the Ragel state machine generator (Ruby was eventually removed, IIRC).

See for example: http://thingsaaronmade.com/blog/a-simple-intro-to-writing-a-...

https://github.com/adrian-thurston/ragel

https://en.m.wikipedia.org/wiki/Ragel

Joker_vD · on May 16, 2024

> Matching string literals is kind of gross,

Only because the escape sequence for a quote is \" instead of, say, \q. Then it'd have been a very straightforward regex "[^"]*" to match a string literal. On a slightly related note, having the escape sequence for a backslash to be \z or something like that instead of \\ mitigates the "leaning toothpick syndrome" quite a lot. Both of these choices are unpopular for no particular reason IMHO other than "it's different from what all the C-like languages do".

im3w1l · on May 16, 2024

I think that for matching this should work ([^"\\]|\\.)* If you want to validate and interpret the backslash escapes then it's a little more work especially if you accept \uXXXX