I run some forums, some of them are quite large. Recently the big increase in scraping by the search engines (Bing has had the greatest increase) caused me to question why.
It used to be that the cost of scraping came with the benefit of being search engine listed which drove traffic, but that feels less true than it used to (for a lot of reasons).
But now the cost of scraping doesn't feel in the favour of a website.
Scraping and bots are for search engines listing, technology tests / experiments, advert / audience measurements, brand protection, IP tracking, copyright enforcement, screenshots for links on other websites (i.e. Facebook), Pinterest linkbacks, training of LLMs (my hypothesis on Bing's massive increase), spam, etc, etc.
With the search engine value lowered by less traffic, yet a solid community still growing via word of mouth... the rest of those things offer no value to me or the community. So I asked the community, what do you want to do here? Leave them all? Ban some? Ban all? Some midway thing?
Almost unanimously the community (who fund the costs by donations, and at least 30% of all traffic and costs were known to be associated to bots) chose to block every bot.
So that's what we've done.
We've blocked every major hosting and Cloud ASN, or put a challenge up to the few known to be proxies (i.e. Google Data Saver), and we've blocked hundreds of bot user agents, we've blocked requests where no Accept header was present where it should be, we've blocked TLS ciphers that aren't modern web browsers — I looked at requests by Python, Go, Curl, Wget, etc... and blocked everything that obviously differed from a valid browser.
In the end we blocked about 40% of our traffic, and so far not a single real human has said (and it's a tight-knit but large community with lots of ways of contacting me) that they've had any issue at all.
We appear to have reduced our traffic and associated costs, with no loss to us at all.
About a year ago I noticed one of my websites going down about weekly for an hour or so. My website had one endpoint that was available in a few hundred thousand versions. Ment for the user, for the bots it was just a few thousand variants. Setup in sitemap, setup in robots and including the right meta tags. Ment to update every few months.
But we'll, not with the bing bot. It ignored my timeouts and queried hundreds of thousands, for him identical, pages every single week. Not one connection, not two or three but about 10 IPs hammering my servers at once. No second between request, not even pausing when the server is going down. Something even 'bad bots' usually do.
I assumed it was just any bot calling itself Bing. But no, it was their IP ranges.
I blocked nearly all of their IPs. Which appears to be the only way to make sure it doesn't ddos me again. Bing is like 1% of my traffic, not even worth the hustle.
Yeah, Bing has gone completely nuts the last six months or so. They'll happily send the equivalent of a small DDoS at sites hosted on shared hosting, knocking them completely offline for a while. Nuts.
Could have been in the last 6 months for me too. And yes its crazy, most my sites have between 8 and 16 concurrent database connections available. In real world this works for thousands of daily users. But for a bing bot it's simply not enough.
Why aren't you caching your forum as static pages for users who aren't logged in at the very least? E.g., rebuild the cache every x time as a cron task, but even then, every page load shouldn't be incurring database overhead if someone isn't even logged in. Equally, you can force a rebuild of cache for x relevant pages when someone posts a new thread or comment.
Otherwise, if someone alt+clicks a bunch of a category's threads as they look interesting then you're going to have a bad time.
I am not the forum guy and my content is not really static. It's cached, but for this specific sub page I need at least one small query. Only a few requests are rebuilding the page, but if every 10th is rebuilt and 9 other bots are hammering mostly cached pages I still have the same issue.
The bot was never ment to query the same pages thousands of times. These pages were identical to him. There were already bot specific rules programmed in.
This website already is heavily cached and optimized. Even thought it only has 16 database connections there maybe is one timeout every few months. Users usually don't open tabs much faster than the short requests take, only bing does.
Really the time that went into optimizing it makes me kinda sad when someone questions that is was a effort Vs payoff thing. The bot ignored all rules I gave him and barely brought any benefit for me in terms of traffic. There is no payoff here, only effort.
This happened to me every week or so, not entirely predictable, on a small website that supports only a handful of users. Being an enterprise site with some corporate backing it's on quite a decent server, but it wasn't a match. The DDOS would leave things in a semi recovered state at best.
And you'll abandon those forums one day, and a month after, i'll search for some niche problem, someone will link to that forum, and there will be no archive(.org) page, no google cached page, no nothing to get the answer.
I've only ever shuttered one forum, and that was at the behest of the community itself. A decade in and it had become a toxic place, everyone agreed that they wanted to give multiple new things a try and that the place in question should be deleted. Not archived, not available forever, but deleted and nothing kept of it. I obliged.
When the day comes that I shutter another I'll ask the active members at the time what they want to happen to their data. They may desire to leave it as a resource, they may want to delete it, if there's a clear majority in the decision I'll go with whatever they desire. I value the choice of those whose data it is, who contributed to creating it, over anyone else's hypothetical needs.
> When the day comes that I shutter another I'll ask the active members at the time what they want to happen to their data.
You might not have that chance; unless you have a co-admin with full access to everything, the reason for the forum to shut down might be because you're no longer there.
And? In this hypothetical situation where they “aren’t around”, I don’t think that people searching for answers to tech support issues are high on their list of concerns, and - probably - almost certainly those of the forum members.
"And" the availability of information is important.
Respectfully, on most forums, I don't care about the community, I care about the content, that's why I'm there, to have discourse and generate meaningful value in the form of knowledge. If someone passes, yes, that sucks, but that's life, we're all snuffing it at some point. However, the world carries on spinning, and that information should continue to be available, especially if the forum is for a niche and frequently generates useful information.
If a forum is becoming "toxic" then that sounds like a moderation problem.
It seems like these kind of forums are not exactly your area of interest, if they're community focused. Information gets snuffed out all the time, with every death of a person we lose large piece of information, but hoarding, and especially expecting others to hoard or assist hoarding is not the correct approach to what is essentially a _you_ problem, so grab that terabyte disk and make a mirror yourself if you are so inclined. Nobody is or should be required to let corpo behemoths in for your convenience and to comply with your questionable opinions.
I'm part of several digital archivism projects. My personal disk array is 54TB of data. That's without even getting into 1PB+ of data on LTO carts.
Last time I checked, Archive.org et al weren't a "corpo behemoth", but consuming server resources is exactly what a normal user does.
Site owners should get with the times and serve up cached static pages to users who aren't logged in. Even then, they should be serving up cached static pages and rebuilding cache for relevant pages when someone posts new content when it comes to forums. Not being able to handle a few crawlers is an administration problem. Why should the community/public suffer for someone's inability to configure a server appropriately?
We live in primarily free societies where individual has the right to decide upon their actions. Telling people that there is only one "correct" way of doing things is obnoxious and toxic and reflects upon your inability to see your opinion for what it is, an opinion.
My opinion is that "screw crawlers and scrapers" is a valid opinion. If i'm hosting a playground, it's my playground and my rules. If you want to play elsewhere, please do. If you want to preserve data, please do, but not at my expense.
Disagree with that? feel free to, but don't think that you are somehow in the right, because if you go with this shit to court, you will be laughed out of the door.
You have no inherent right to other people's data, regardless of how they shared it or the visibility of it at the time they shared it. You are not owed the sum of human knowledge.
If people wished for their content to be available to all forever they'd run a blog and pay to ensure it is available, and would proactively seek to get it archived.
People on forums aren't doing that, and the data of any given individual is a contextless collection of semi-random mumblings on different topics because without the fullness of a conversation involving others none of it makes sense.
It is within that context that a forum admin can decide what to do, they have been granted right (by T&C) to the collection of all the forum members comments which restores the context and gives meaning to the content. Every individual on the forums I operate can obtain their own data, but it would be meaningless by itself.
As the operator of the collection of content I get to determine what best to do with that, and sometimes that may be to delete it all. Sometimes that may be to seek to archive it. And on this occasion it is to treat this knowledge as having valuable to those already participating in the community and to not be shared beyond that.
Elsewhere you said this:
> Call it what most forums are: an ad-supported business. People generate content for the owner for free because they too derive value from the information that others share. The middleman is just a middleman
But the 300+ forums I run have no adverts, they are not a business, they are non-profit. Their value (if you want to measure everything in a capitalist way) is social, to help those in the community.
The purpose of the forums I run isn't to expand the sum of human knowledge, or to make myself personally wealthy of the back of the efforts of others, the purpose is to help be a remedy to adult loneliness by connecting people by their shared interests in geographically small areas such that it builds relationships and forms bonds.
Yes there is a hell of a lot of expertise captured here around those interests... but no-one has any inherent right to it.
This tangent is in relation to my shuttering one forum.
That forum was around a music band in the UK, and the audience of the forum turned out to be lower than expected - University age. They were emotionally immature, over-shared online, slept with each other, had relationships and break-ups... all in public. The music forum did have lots of music info on it, but it was intertwined with a lot of very highly personal information posted at a time when a reasonable expectation of the internet was ephemerality.
It was totally right to protect the individuals future selves from their past selves, and I would delete again.
There are certainly downsides to hording data. At the very least, information takes up space. It also tends to suck up mental bandwidth: you have to keep organizing, de-duplicating, and migrating to newer formats. It's much easier to just delete it. Just like it's much easier to throw old ratty tshirts. IMO, data hoarding is just as much of a mental disorder as hoarding physical stuff.
This idea that all information must be preserved for forever is also at odds with privacy. See, e.g., the right to be forgotten.
I think that the reason that many people don't put much effort into archiving information is a cultural one. Most people simply haven't given much thought to the question of fate of information or knowledge they happen to find, and the importance of preserving that knowledge for health of society's discourse.
Why are forum admins beholden to archive their data in perpetuity in case someone wants free advice or knowledge?
Do you maintain a freely-available repository of all of your knowledge and experience, in case someone else wants to consult it one day?
While the openness of the (now-ending) early days of the internet was liberating and allowed knowledge sharing on an unprecedented scale, the downside is the huge devaluing of that knowledge and skills.
I do actually but it is up to the person. The main reason for me to encourage it is that if knowledge is reserved for the high priests it will eventually be lost. How many civilizations did we build by now? No one knows! We haven't the records. The stuff people must have figured out. Of course may would pretend it wasn't a big deal but all those deleted forums had plenty of insights to offer. Practical ones and historically valuable.
The real value of knowledge doesn't change if you duplicate it or make it widely available. On the long term, blocking access and rent seeking doesn't create value, it destroys it. It seems useful for the individual who wants to pay their bills or for the one with insatiable greed but in the end it will makes us stupid.
For example: I would like a high quality UV-B lamp that isn't INSANELY expensive. They are pretty ordinary lamps but developing the coating is very expensive. The work has been done tho, lots of times, over and over again. Most results are just bad.
About 35% of the US and about 1 billion globally have vitamin D deficiency, 50% has an insufficiency: Fatigue, Not sleeping well, Bone pain or achiness, Depression or feelings of sadness, Hair loss, Muscle weakness, Loss of appetite, Getting sick more easily, etc
Great loss of economic productivity or more opportunity for me? You decide!
If I contribute time in answering questions or solving problems, like with mailing lists still being available to view, something that I intentionally put into the public domain with the intent of helping people should remain available. Just because a forum exists as a business to someone doesn't mean that the content has no value to the general public. The forum itself has no value, only the only content has value, which is what draws in the traffic to make money in the first place.
Call it what most forums are: an ad-supported business. People generate content for the owner for free because they too derive value from the information that others share. The middleman is just a middleman.
To not allow that content to be indexed/cached/archived/mirrored whilst making money off of it is pretty scummy in the long-term. There's tons of forums I used to visit whose information is now forever lost, that included a lot of very useful programs for niche bits of kit, which is now otherwise very expensive e-waste.
> Why are forum admins beholden to archive their data in perpetuity in case someone wants free advice or knowledge?
Because otherwise their work was wasted.
> Do you maintain a freely-available repository of all of your knowledge and experience, in case someone else wants to consult it one day?
I would if I could, I’ve already contributed what knowledge, bandwidth, and money I can to the Internet Archive. What about you?
> While the openness of the (now-ending) early days of the internet was liberating and allowed knowledge sharing on an unprecedented scale, the downside is the huge devaluing of that knowledge and skills.
I cannot even process how wrong this is. Objectively the preservation of knowledge and skills is a good thing, and you cannot devalue knowledge, which is itself priceless.
This argument really makes no sense. If I tell Bob how to fix his transmission down at the local diner, but nobody records the conversion, that wasn't wasted work. But fixed his transmission: mission accomplished.
So this data will not be losted forever? Also, do you mean that all data and all posts made by users should belong to admins only and only admins should decide what to do with it?
They're not, but blocking all bots, also blocks others, that want to archive all that data forever, be it a private person using wget, or a service like archive.org.
Why would I, it's online, I know where to find it... until it's gone from there. Also, that would meen I'd have to archive it before I actually needed it archived. And archiving would have to be done manually. And after it's gone, and the only proof of it existing is a text somewhere else saying that the solution to my problem is here -> LINK and the link is that, the data is gone. Not even on archive.org.
Have we really come to a phase of internet use, where everytime you see something, you have to manually save it, and on every post (even here or on reddit, facebook r wherever) a link is not good enough, but you have to copy-paste the whole block of text just to make it a bit future-proof?
And the perpetual tale of the forum post that says "this has been asked before, use the search" and the first search result is this person saying to use search
Bizarrely, I can't remember the last time duckduckgo (basically bing) gave me a forum as a search result, though it used to regularly give me results from them. Maybe it's the admins blocking crawlers but it feels more like a conscious decision.
I’ve been wondering how much of it comes down to optimizing for ad impressions. If you search, get a result, and it answers your question they sell one page of keyword ads. If you go back and forth a dozen times, they sell a dozen times as many impressions.
Given my usual behavior is to check two pages then add !g, where I check two pages and decide I don't need more info, I don't think that's a strong move.
I’m not saying it’s smart, just that I could easily imagine someone chasing the wrong metric or trying to balance revenue against the likelihood that you’ll stop using them. For example, in your scenario that’s still twice as many impressions so unless you make Google tour primary maybe that’s a win.
I am definitely happy you asked your users what they thought and made your decision. But saying "no human complained" might not be a good metric if people use Google or whatever to discover your site or it's info. People don't complain about things they don't know exist.
If you aren’t doing so already, I highly suggest working alongside the Internet Archive to preserve the information on your forums. One day they will close down, and your users will want to see their posts, refer to now broken bookmarks, and generally access the information.
How do you have costs that are directly attributable to scraping? Unless you are using a serverless platform that bills per request or your pages are large enough that egress bandwidth is expensive enough, I’m not convinced most sites would save much doing this.
I'm not really sure how you arrive at the conclusion that only serverless platforms result in costs? It's not just 40% of cpu or egress, it's 40% of database load, 40% of logging, 40% of APM/instrumentation.
40% is 40%. Maybe 40% of their cost isn't enough to warrant whatever time these efforts cost them, but for many people out there it will be.
Sounds like the original commenter had a reasonable case, but I just don’t think it’s likely to save anything for small sites on traditional stacks.
If you are running on i.e. EC2 and RDS instances, you’re not saving anything by using 40% less of the CPU, unless you can actually downsize the instance as a result. Read-only traffic is also not that hard to scale out, but with forums etc, you can be stuck with some legacy systems for sure.
It's a multi-tenant platform (about 300 forums, with the biggest being around 250K visitors per month), the database is on a vertically scaled box that is too excessive now the traffic has reduced, but I was able to delete a few of the Linodes that were horizontally scaling the API and Web UI (the Web UI is just a client of the API hence those could be saved too).
I've also noticed that my cache hit rate is extraordinary now, which I assume is because humans read recent stuff and bots read the long-tail of old stuff.
As someone who does targeted scraping of forums, I can say having a good open API and caching is probably the best way to decrease load.
If you use Cloudflare, turn off their anti-bot stuff. It is far more efficient to let them just serve bots from the cache than having scrapers use tricks to bypass them and go directly to your origin server.
I designed most of, and built a chunk of, the WAF and firewall stuff at Cloudflare. That includes wirefilter (a wireshark display filter inspired firewall), and coupled with Cloudflare using maxmind you get to block ASNs in addition to other characteristics of the request.
With that context, I used bgp.he.net to look up the big ones I know and then wrote the rules.
You can try out our free IP to Country ASN database[0]. You can just grep the IP addresses by looking up the ASN or AS domains. Then just extract the IP address range and you should be good to go. [1]
The paid databases comes with AS type (hosting, ISP, business etc.) and we have a VPN detection database as well.
While I perfectly understand, I’m a bit worried about my own web browser ( Offpunk ) which uses python-requests and is thus very often associated with being a bot.
The browser has the goal of being light and downloading only the text and pictures (no css, no js). So we have the same goal here.
Shameless plug, if you do not want to spend the time aggregating all datacenter IP addresses you can use the IPDetective.io API to easily detect of an IP address comes from a datacenter, VPN, proxy or botnet.
It used to be that the cost of scraping came with the benefit of being search engine listed which drove traffic, but that feels less true than it used to (for a lot of reasons).
But now the cost of scraping doesn't feel in the favour of a website.
Scraping and bots are for search engines listing, technology tests / experiments, advert / audience measurements, brand protection, IP tracking, copyright enforcement, screenshots for links on other websites (i.e. Facebook), Pinterest linkbacks, training of LLMs (my hypothesis on Bing's massive increase), spam, etc, etc.
With the search engine value lowered by less traffic, yet a solid community still growing via word of mouth... the rest of those things offer no value to me or the community. So I asked the community, what do you want to do here? Leave them all? Ban some? Ban all? Some midway thing?
Almost unanimously the community (who fund the costs by donations, and at least 30% of all traffic and costs were known to be associated to bots) chose to block every bot.
So that's what we've done.
We've blocked every major hosting and Cloud ASN, or put a challenge up to the few known to be proxies (i.e. Google Data Saver), and we've blocked hundreds of bot user agents, we've blocked requests where no Accept header was present where it should be, we've blocked TLS ciphers that aren't modern web browsers — I looked at requests by Python, Go, Curl, Wget, etc... and blocked everything that obviously differed from a valid browser.
In the end we blocked about 40% of our traffic, and so far not a single real human has said (and it's a tight-knit but large community with lots of ways of contacting me) that they've had any issue at all.
We appear to have reduced our traffic and associated costs, with no loss to us at all.