Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Modern Document Processing Stack (github.com/marcelmarais)
9 points by marcelmarais on Feb 2, 2025 | hide | past | favorite | 4 comments


This would benefit from examples. What's a gnarly set of documents that this will process to clean useful Markdown, which a much simpler stack like 'pdftotext' would fail on, and what would this buy me over just running Zerox or another OCR tool directly?


This should make the use case a bit clearer. It's basically a starting point / wrapper of a few tools when you know you'll probably build something custom later so want to invest 0 time in the beginning but need something that's workable: https://www.differentiated.io/blog/the-modern-document-proce...


That doesn't really answer my question. Like, I have a website, and I have many references; I also use LLM embeddings for nearest-neighbors recommendations of references to each other.

What might this... do... for me? Don't dump a bunch of JS which is how I would 'do' whatever it does. What does it do? Like, can I dump the URL 'https://pmc.ncbi.nlm.nih.gov/articles/PMC4543385/' into it and get out nice usable clean text of the abstract, say? What about a complicated PDF like https://gwern.net/doc/psychiatry/anxiety/2025-he.pdf (these are the last two references I added)? What do I get? Do I have to install the whole darn thing just to see what it does?


Surprised to not see Pandoc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: