Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This gets those cases right.

https://github.com/KnowSeams/KnowSeams

(On a beefy machine) It gets 1 TB/s throughput including all IO and position mapping back to original text location. I used it to split project gutenberg novels. It does 20k+ novels in about 7 seconds.

Note it keeps all dialog together- which may not be what others want, but was what i wanted.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: