Pypi has fewer than one million projects. The searchable content for each packag...

froh · 2026-01-01T04:03:12 1767240192

I wonder how a PyPi search index could be statically served and locally evaluated on `pip search`?

firesteelrain · 2026-01-01T04:25:20 1767241520

PyPI servers would have to be constantly rebuilding a central index and making it available for download. Seems inefficient

ptx · 2026-01-01T16:15:35 1767284135

Debian is somehow able to manage it for apt.

firesteelrain · 2026-01-01T17:17:53 1767287873

1. Debian is local first via client side cache

2. apt repositories are cryptographically signed, centrally controlled, and legally accountable.

3. apt search is understood to be approximate, distro-scoped, and slow-moving. Results change slowly and rarely break scripts. PyPI search rankings change frequently by necessity

4. Turning PyPI search into an apt-like experience would require distributing a signed, periodically refreshed global metadata corpus to every client. At PyPI’s scale, that is nontrivial in bandwidth, storage, and governance terms

5. apt search works because the repository is curated, finite, and opinionated

froh · 2026-01-01T18:52:09 1767293529

isn't this an incrementally updatable tree that is managed with a Merkle tree? git-like, essentially?

firesteelrain · 2026-01-01T19:21:22 1767295282

The install side is basically Merkle-friendly (immutable artifacts, append-only metadata, hashes, mirrors). Search isn’t. Search results are derived, subjective, and frequently rewritten (ranking tweaks, spam/malware takedowns, popularity signals). That’s more like constantly rebasing than appending commits.

You can Merklize “what files exist”; you can’t realistically Merklize “what should rank for this query today” without freezing semantics and turning CLI search into a hard API contract.

froh · 2026-01-02T08:01:48 1767340908

are you saying PyPi search is spammed o-O ?

firesteelrain · 2026-01-02T12:42:14 1767357734

Yes, it was subject to abuse so they had to shutdown the XML-RPC API

froh · 2026-01-01T18:48:05 1767293285

that depends on how it can be downloaded incrementally.

woodruffw · 2026-01-01T04:55:07 1767243307

The searchable context for a distribution on PyPI is unbounded in the general case, assuming the goal is to allow search over READMEs, distribution metadata, etc.

(Which isn’t to say I disagree with you about scale not being the main issue, just to offer some nuance. Another piece of nuance is the fact that distributions are the source of metadata but users think in terms of projects/releases.)

bastawhiz · 2026-01-01T15:34:52 1767281692

> assuming the goal is to allow search over READMEs, distribution metadata, etc.

Why would you build a dedicated tool for this instead of just using a search engine? If I'm looking for a specific keyword in some project's very long README I'm searching kagi, not npm.

I'd expect that the most you should be indexing is the data in the project metadata (setup.py). That could be unbounded but I can't think of a compelling reason not to truncate it beyond a reasonable length.

woodruffw · 2026-01-01T15:43:15 1767282195

You would definitely use a search engine. I was just responding to a specific design constraint.

(Note PyPI can’t index metadata from a `setup.py` however, since that would involve running arbitrary code. PyPI needs to be given structured metadata, and not all distributions provide that.)

coldtea · 2026-01-01T17:42:48 1767289368

>The searchable context for a distribution on PyPI is unbounded in the general case, assuming the goal is to allow search over READMEs, distribution metadata, etc.

Even including those, it's what? Sub-20-30GB.