Don't miss how this works. It's not a server-side application - this code runs entirely in your browser using SQLite compiled to WASM, but rather than fetching a full 22GB database it instead uses a clever hack that retrieves just "shards" of the SQLite database needed for the page you are viewing.
I watched it in the browser network panel and saw it fetch:
It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead.
this does not caches the data right? it would always fetch from network? by any chance do you know of solution/extension that caches the data it would make it so much more efficient.
Thanks! I'm glad you enjoyed the sausage being made. There's a little easter egg if you click on the compact disc icon.
And I just now added a 'me' view. Enter your username and it will show your comments/posts on any day. So you can scrub back through your 2006 - 2025 retrospective using the calendar buttons.
I almost got tricked into trying to figure out what was Easter eggy about August 9 2015 :-) There's a clarifying tooltip on the link, but it is mostly obscured by the image's "Archive" title attribute.
Oh, shit that was the problem! You solved the bug! I was trying to figure out why the right tooltip didn't display. A linked wrapped in an image wrapped in an easter egg! Or something. Ha, thank you. Will fix :)
edit: Fixed! Also I just pushed a new version with a Dec 29th Data Dump, so ... updates - yay!
Yes — PMTiles is exactly that: a production-ready, single-file, static container for vector tiles built around HTTP range requests.
I’ve used it in production to self-host Australia-only maps on S3. We generated a single ~900 MB PMTiles file from OpenStreetMap (Australia only, up to Z14) and uploaded it to S3. Clients then fetch just the required byte ranges for each vector tile via HTTP range requests.
It’s fast, scales well, and bandwidth costs are negligible because clients only download the exact data they need.
Hadn't seen PMTiles before, but that matches the mental model exactly! I chose physical file sharding over Range Requests on a single db because it felt safer for 'dumb' static hosts like CF. - less risk of a single 22GB file getting stuck or cached weirdly. Maybe it would work
My only gripe is that the tile metadata is stored as JSON, which I get is for compatibility reasons with existing software, but for e.g. a simple C program to implement the full spec you need to ship a JSON parser on top of the PMTiles parser itself.
At that point you're just io bound, no? I can easily parse json at 100+GB/s on commodity hardware, but I'm gonna have a much harder time actually delivering that much data to parse.
Look into using duckdb with remote http/s3 parquet files. The parquet files are organized as columnar vectors, grouped into chunks of rows. Each row group stores metadata about the set it contains that can be used to prune out data that doesn’t need to be scanned by the query engine. https://duckdb.org/docs/stable/guides/performance/indexing
LanceDB has a similar mechanism for operating on remote vector embeddings/text search.
> Look into using duckdb with remote http/s3 parquet files. The parquet files are organized as columnar vectors, grouped into chunks of rows. Each row group stores metadata about the set it contains that can be used to prune out data that doesn’t need to be scanned by the query engine. https://duckdb.org/docs/stable/guides/performance/indexing
But, when using this on frontend, are portions of files fetched specifically with http range requests? I tried to search for it but couldn't find details
Yes, you should be able to see the byte range requests and 206 responses from an s3 compatible bucket or http server that supports those access patterns.
There was a UK government GitHub repo that did something interesting with this kind of trick against S3 but I checked just now and the repo is a 404. Here are my notes about what it did: https://simonwillison.net/2025/Feb/7/sqlite-s3vfs/
From reading the TIL, it doesn't appear as if Simon used LLM for a large portion of what he did; only the initial suggestion to check the archive, and the web tool to make his process reproducible. Also, if you read the script from his chat with Claude code, the prompt really does the heavy lifting.
Sure, the LLM fills in all the boilerplate and makes an easy-to-use, reproducible tool with loads of documentation, and credit for that. But is it not more accurate to say that Simon is absurdly efficient, LLM or sans LLM? :)
Nothing smart with HTTP range requests yet - I have https://lite.datasette.io which runs the full Python server app in the browser via WebAssembly and Pyodide but it still works by fetching the entire SQLite file at once.
i played around with this a while back. you can see a demo here. it also lets you pull new WAL segments in and apply them to the current database. never got much time to go any further with it than this.
This is somewhat related to a large dataset browsing service a friend and I worked on a while back - we made index files, and the browser ran a lightweight query planner to fetch static chunks which could be served from S3/torrents/whatever. It worked pretty well, and I think there’s a lot of potential for this style of data serving infra.
I tried to implement something similar to optimize sampling semi-random documents from (very) large datasets on Huggingface, unfortunately their API doesn't support range requests well.
A recent change is I added date spans to the shard checboxes on query view so it's easier to zero dates you want if you have that in mind. Because if your copy isn't local all those network pulls take a while.
The sequence of shards you saw when you paginated to days is faciliated by the static-manifest which maps HN item ID ranges to shards, and since IDs are increasing and a pretty good proxy of time (a "HN clock"), we can also map the shards that we cut up by ID to the time spans their items cover. An in memory table sorted by time is created from the manifest on load so we can easily look up which shard we need when you pick a day.
Funnily enough, this system was thrown off early on by a handful of "ID/timestamp" outliers in the data: items with weird future timestamps (offset by a couple years), or null timestamps. To cleanse our pure data from this noise, and restore proper adjacent-in-time shard cuts we just did a 1/99 percentile grouping and discarded the outliers leaving shards with sensible 'effective' time spans.
Sometimes we end up fetching two shards when you enter a new day because some items' comments exist "cross shard". We needed another index for that and it lives in cross-shard-index.bin which is just a list of 4-byte item IDs that have children in more than 1 shard (2-bytes), which occurs when people have the self-indulgence to respond to comments a few days after a post has died down ;)
Thankfully HN imposes a 2 week horizon for replies so there aren't that many cross-shard comments (those living outside the 2-3 days span of most, recent, shards). But I think there's still around 1M or so, IIRC.
The GitHub page is no longer available, which is a shame because I'm really interested in how this works.
How was the entirety of HN stored in a single SQLite database? In other words, how was the data acquired? And how does the page load instantly if there's 22GB of data having to be downloaded to the browser?
I watched it in the browser network panel and saw it fetch:
As I paginated to previous days.It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead.
The interactive SQL query interface at https://hackerbook.dosaygo.com/?view=query asks you to select which shards to run the query against, there are 1636 total.