The Modern Document Processing Stack

gwern · on Feb 2, 2025

This would benefit from examples. What's a gnarly set of documents that this will process to clean useful Markdown, which a much simpler stack like 'pdftotext' would fail on, and what would this buy me over just running Zerox or another OCR tool directly?

marcelmarais · on Feb 2, 2025

This should make the use case a bit clearer. It's basically a starting point / wrapper of a few tools when you know you'll probably build something custom later so want to invest 0 time in the beginning but need something that's workable: https://www.differentiated.io/blog/the-modern-document-proce...

gwern · on Feb 2, 2025

That doesn't really answer my question. Like, I have a website, and I have many references; I also use LLM embeddings for nearest-neighbors recommendations of references to each other.

What might this... do... for me? Don't dump a bunch of JS which is how I would 'do' whatever it does. What does it do? Like, can I dump the URL 'https://pmc.ncbi.nlm.nih.gov/articles/PMC4543385/' into it and get out nice usable clean text of the abstract, say? What about a complicated PDF like https://gwern.net/doc/psychiatry/anxiety/2025-he.pdf (these are the last two references I added)? What do I get? Do I have to install the whole darn thing just to see what it does?

alhirzel · on Feb 3, 2025

Surprised to not see Pandoc.