Heuristically finding content in html

jaytaylor · on Feb 12, 2017

Nice write up.

This reminds me of goose-ng [0].

There are a number of implementations and ports of the heuristic approach out there, including for Go and Scala.

It works nicely, I'm a fan!

darshandsoni · on Feb 12, 2017

Nicely written post! I've had this issue when trying to focus on just the content of the page. It would be interesting to see if you can find any deeper filtering ideas from other browsers - like Firefox has a "reader view" that strips away cruft from pages. Or even looking at the source code of some open-source feed reader apps like Flym - they do a great job of scraping off content and caching it online in a searchable format.

hydrogen18 · on Feb 12, 2017

Thank you for the feedback. I intend to go back and try and find a better basis for the mathematics I used.