Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Heuristically finding content in html (hydrogen18.com)
4 points by hydrogen18 on Feb 12, 2017 | hide | past | favorite | 3 comments


Nice write up.

This reminds me of goose-ng [0].

There are a number of implementations and ports of the heuristic approach out there, including for Go and Scala.

It works nicely, I'm a fan!

https://github.com/jaytaylor/goose-ng


Nicely written post! I've had this issue when trying to focus on just the content of the page. It would be interesting to see if you can find any deeper filtering ideas from other browsers - like Firefox has a "reader view" that strips away cruft from pages. Or even looking at the source code of some open-source feed reader apps like Flym - they do a great job of scraping off content and caching it online in a searchable format.


Thank you for the feedback. I intend to go back and try and find a better basis for the mathematics I used.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: