Nicely written post! I've had this issue when trying to focus on just the content of the page.
It would be interesting to see if you can find any deeper filtering ideas from other browsers - like Firefox has a "reader view" that strips away cruft from pages. Or even looking at the source code of some open-source feed reader apps like Flym - they do a great job of scraping off content and caching it online in a searchable format.
This reminds me of goose-ng [0].
There are a number of implementations and ports of the heuristic approach out there, including for Go and Scala.
It works nicely, I'm a fan!
https://github.com/jaytaylor/goose-ng