I run some forums, some of them are quite large. Recently the big increase in sc...

herbst · on May 14, 2023

About a year ago I noticed one of my websites going down about weekly for an hour or so. My website had one endpoint that was available in a few hundred thousand versions. Ment for the user, for the bots it was just a few thousand variants. Setup in sitemap, setup in robots and including the right meta tags. Ment to update every few months.

But we'll, not with the bing bot. It ignored my timeouts and queried hundreds of thousands, for him identical, pages every single week. Not one connection, not two or three but about 10 IPs hammering my servers at once. No second between request, not even pausing when the server is going down. Something even 'bad bots' usually do.

I assumed it was just any bot calling itself Bing. But no, it was their IP ranges.

I blocked nearly all of their IPs. Which appears to be the only way to make sure it doesn't ddos me again. Bing is like 1% of my traffic, not even worth the hustle.

2000UltraDeluxe · on May 14, 2023

Yeah, Bing has gone completely nuts the last six months or so. They'll happily send the equivalent of a small DDoS at sites hosted on shared hosting, knocking them completely offline for a while. Nuts.

herbst · on May 14, 2023

Could have been in the last 6 months for me too. And yes its crazy, most my sites have between 8 and 16 concurrent database connections available. In real world this works for thousands of daily users. But for a bing bot it's simply not enough.

hammyhavoc · on May 15, 2023

Why aren't you caching your forum as static pages for users who aren't logged in at the very least? E.g., rebuild the cache every x time as a cron task, but even then, every page load shouldn't be incurring database overhead if someone isn't even logged in. Equally, you can force a rebuild of cache for x relevant pages when someone posts a new thread or comment.

Otherwise, if someone alt+clicks a bunch of a category's threads as they look interesting then you're going to have a bad time.

herbst · on May 15, 2023

I am not the forum guy and my content is not really static. It's cached, but for this specific sub page I need at least one small query. Only a few requests are rebuilding the page, but if every 10th is rebuilt and 9 other bots are hammering mostly cached pages I still have the same issue.

The bot was never ment to query the same pages thousands of times. These pages were identical to him. There were already bot specific rules programmed in.

This website already is heavily cached and optimized. Even thought it only has 16 database connections there maybe is one timeout every few months. Users usually don't open tabs much faster than the short requests take, only bing does.

Really the time that went into optimizing it makes me kinda sad when someone questions that is was a effort Vs payoff thing. The bot ignored all rules I gave him and barely brought any benefit for me in terms of traffic. There is no payoff here, only effort.

arpa · on May 15, 2023

Effort vs payoff is my educated guess.

hammyhavoc · on May 15, 2023

Few minutes of work for a permanent payoff? It's a no-brainer.

victor106 · on May 14, 2023

Same here, we decided to actively block Bing. It’s the most useless egress traffic we have to pay for. Go away Bing.

anakaine · on May 15, 2023

This happened to me every week or so, not entirely predictable, on a small website that supports only a handful of users. Being an enterprise site with some corporate backing it's on quite a decent server, but it wasn't a match. The DDOS would leave things in a semi recovered state at best.

ajsnigrutin · on May 14, 2023

And you'll abandon those forums one day, and a month after, i'll search for some niche problem, someone will link to that forum, and there will be no archive(.org) page, no google cached page, no nothing to get the answer.

buro9 · on May 14, 2023

I've only ever shuttered one forum, and that was at the behest of the community itself. A decade in and it had become a toxic place, everyone agreed that they wanted to give multiple new things a try and that the place in question should be deleted. Not archived, not available forever, but deleted and nothing kept of it. I obliged.

When the day comes that I shutter another I'll ask the active members at the time what they want to happen to their data. They may desire to leave it as a resource, they may want to delete it, if there's a clear majority in the decision I'll go with whatever they desire. I value the choice of those whose data it is, who contributed to creating it, over anyone else's hypothetical needs.

cesarb · on May 14, 2023

> When the day comes that I shutter another I'll ask the active members at the time what they want to happen to their data.

You might not have that chance; unless you have a co-admin with full access to everything, the reason for the forum to shut down might be because you're no longer there.

KyeRussell · on May 14, 2023

And? In this hypothetical situation where they “aren’t around”, I don’t think that people searching for answers to tech support issues are high on their list of concerns, and - probably - almost certainly those of the forum members.

hammyhavoc · on May 15, 2023

"And" the availability of information is important.

Respectfully, on most forums, I don't care about the community, I care about the content, that's why I'm there, to have discourse and generate meaningful value in the form of knowledge. If someone passes, yes, that sucks, but that's life, we're all snuffing it at some point. However, the world carries on spinning, and that information should continue to be available, especially if the forum is for a niche and frequently generates useful information.

If a forum is becoming "toxic" then that sounds like a moderation problem.

arpa · on May 15, 2023

It seems like these kind of forums are not exactly your area of interest, if they're community focused. Information gets snuffed out all the time, with every death of a person we lose large piece of information, but hoarding, and especially expecting others to hoard or assist hoarding is not the correct approach to what is essentially a _you_ problem, so grab that terabyte disk and make a mirror yourself if you are so inclined. Nobody is or should be required to let corpo behemoths in for your convenience and to comply with your questionable opinions.

hammyhavoc · on May 15, 2023

I'm part of several digital archivism projects. My personal disk array is 54TB of data. That's without even getting into 1PB+ of data on LTO carts.

Last time I checked, Archive.org et al weren't a "corpo behemoth", but consuming server resources is exactly what a normal user does.

Site owners should get with the times and serve up cached static pages to users who aren't logged in. Even then, they should be serving up cached static pages and rebuilding cache for relevant pages when someone posts new content when it comes to forums. Not being able to handle a few crawlers is an administration problem. Why should the community/public suffer for someone's inability to configure a server appropriately?

arpa · on May 15, 2023

We live in primarily free societies where individual has the right to decide upon their actions. Telling people that there is only one "correct" way of doing things is obnoxious and toxic and reflects upon your inability to see your opinion for what it is, an opinion.

My opinion is that "screw crawlers and scrapers" is a valid opinion. If i'm hosting a playground, it's my playground and my rules. If you want to play elsewhere, please do. If you want to preserve data, please do, but not at my expense. Disagree with that? feel free to, but don't think that you are somehow in the right, because if you go with this shit to court, you will be laughed out of the door.

buro9 · on May 15, 2023

You have no inherent right to other people's data, regardless of how they shared it or the visibility of it at the time they shared it. You are not owed the sum of human knowledge.

If people wished for their content to be available to all forever they'd run a blog and pay to ensure it is available, and would proactively seek to get it archived.

People on forums aren't doing that, and the data of any given individual is a contextless collection of semi-random mumblings on different topics because without the fullness of a conversation involving others none of it makes sense.

It is within that context that a forum admin can decide what to do, they have been granted right (by T&C) to the collection of all the forum members comments which restores the context and gives meaning to the content. Every individual on the forums I operate can obtain their own data, but it would be meaningless by itself.

As the operator of the collection of content I get to determine what best to do with that, and sometimes that may be to delete it all. Sometimes that may be to seek to archive it. And on this occasion it is to treat this knowledge as having valuable to those already participating in the community and to not be shared beyond that.

Elsewhere you said this:

> Call it what most forums are: an ad-supported business. People generate content for the owner for free because they too derive value from the information that others share. The middleman is just a middleman

But the 300+ forums I run have no adverts, they are not a business, they are non-profit. Their value (if you want to measure everything in a capitalist way) is social, to help those in the community.

The purpose of the forums I run isn't to expand the sum of human knowledge, or to make myself personally wealthy of the back of the efforts of others, the purpose is to help be a remedy to adult loneliness by connecting people by their shared interests in geographically small areas such that it builds relationships and forms bonds.

Yes there is a hell of a lot of expertise captured here around those interests... but no-one has any inherent right to it.

tourmalinetaco · on May 15, 2023

There are no downsides to preserving information, but there are all the downsides in the world to losing it.

buro9 · on May 15, 2023

This tangent is in relation to my shuttering one forum.

That forum was around a music band in the UK, and the audience of the forum turned out to be lower than expected - University age. They were emotionally immature, over-shared online, slept with each other, had relationships and break-ups... all in public. The music forum did have lots of music info on it, but it was intertwined with a lot of very highly personal information posted at a time when a reasonable expectation of the internet was ephemerality.

It was totally right to protect the individuals future selves from their past selves, and I would delete again.

patrick451 · on May 16, 2023

There are certainly downsides to hording data. At the very least, information takes up space. It also tends to suck up mental bandwidth: you have to keep organizing, de-duplicating, and migrating to newer formats. It's much easier to just delete it. Just like it's much easier to throw old ratty tshirts. IMO, data hoarding is just as much of a mental disorder as hoarding physical stuff.

This idea that all information must be preserved for forever is also at odds with privacy. See, e.g., the right to be forgotten.

chromoblob · on May 15, 2023

I think that the reason that many people don't put much effort into archiving information is a cultural one. Most people simply haven't given much thought to the question of fate of information or knowledge they happen to find, and the importance of preserving that knowledge for health of society's discourse.

What if we made archivism more fashionable?..

fdoifdsfdsa · on May 14, 2023

Why are forum admins beholden to archive their data in perpetuity in case someone wants free advice or knowledge?

Do you maintain a freely-available repository of all of your knowledge and experience, in case someone else wants to consult it one day?

While the openness of the (now-ending) early days of the internet was liberating and allowed knowledge sharing on an unprecedented scale, the downside is the huge devaluing of that knowledge and skills.

trifurcate · on May 14, 2023

> Do you maintain a freely-available repository of all of your knowledge and experience, in case someone else wants to consult it one day?

No, but I try to maintain some of it, and I see the value in maintaining as much of it as possible.

6510 · on May 14, 2023

I do actually but it is up to the person. The main reason for me to encourage it is that if knowledge is reserved for the high priests it will eventually be lost. How many civilizations did we build by now? No one knows! We haven't the records. The stuff people must have figured out. Of course may would pretend it wasn't a big deal but all those deleted forums had plenty of insights to offer. Practical ones and historically valuable.

The real value of knowledge doesn't change if you duplicate it or make it widely available. On the long term, blocking access and rent seeking doesn't create value, it destroys it. It seems useful for the individual who wants to pay their bills or for the one with insatiable greed but in the end it will makes us stupid.

For example: I would like a high quality UV-B lamp that isn't INSANELY expensive. They are pretty ordinary lamps but developing the coating is very expensive. The work has been done tho, lots of times, over and over again. Most results are just bad.

About 35% of the US and about 1 billion globally have vitamin D deficiency, 50% has an insufficiency: Fatigue, Not sleeping well, Bone pain or achiness, Depression or feelings of sadness, Hair loss, Muscle weakness, Loss of appetite, Getting sick more easily, etc

Great loss of economic productivity or more opportunity for me? You decide!

hammyhavoc · on May 15, 2023

If I contribute time in answering questions or solving problems, like with mailing lists still being available to view, something that I intentionally put into the public domain with the intent of helping people should remain available. Just because a forum exists as a business to someone doesn't mean that the content has no value to the general public. The forum itself has no value, only the only content has value, which is what draws in the traffic to make money in the first place.

Call it what most forums are: an ad-supported business. People generate content for the owner for free because they too derive value from the information that others share. The middleman is just a middleman.

To not allow that content to be indexed/cached/archived/mirrored whilst making money off of it is pretty scummy in the long-term. There's tons of forums I used to visit whose information is now forever lost, that included a lot of very useful programs for niche bits of kit, which is now otherwise very expensive e-waste.

tourmalinetaco · on May 15, 2023

> Why are forum admins beholden to archive their data in perpetuity in case someone wants free advice or knowledge?

Because otherwise their work was wasted.

> Do you maintain a freely-available repository of all of your knowledge and experience, in case someone else wants to consult it one day?

I would if I could, I’ve already contributed what knowledge, bandwidth, and money I can to the Internet Archive. What about you?

> While the openness of the (now-ending) early days of the internet was liberating and allowed knowledge sharing on an unprecedented scale, the downside is the huge devaluing of that knowledge and skills.

I cannot even process how wrong this is. Objectively the preservation of knowledge and skills is a good thing, and you cannot devalue knowledge, which is itself priceless.

patrick451 · on May 16, 2023

> Because otherwise their work was wasted.

This argument really makes no sense. If I tell Bob how to fix his transmission down at the local diner, but nobody records the conversion, that wasn't wasted work. But fixed his transmission: mission accomplished.

deely3 · on May 14, 2023

So this data will not be losted forever? Also, do you mean that all data and all posts made by users should belong to admins only and only admins should decide what to do with it?

ipaddr · on May 15, 2023

Those are the terms usually.

ajsnigrutin · on May 15, 2023

They're not, but blocking all bots, also blocks others, that want to archive all that data forever, be it a private person using wget, or a service like archive.org.

ipaddr · on May 15, 2023

You can manually copy locally the content you may need

ajsnigrutin · on May 15, 2023

Why would I, it's online, I know where to find it... until it's gone from there. Also, that would meen I'd have to archive it before I actually needed it archived. And archiving would have to be done manually. And after it's gone, and the only proof of it existing is a text somewhere else saying that the solution to my problem is here -> LINK and the link is that, the data is gone. Not even on archive.org.

Have we really come to a phase of internet use, where everytime you see something, you have to manually save it, and on every post (even here or on reddit, facebook r wherever) a link is not good enough, but you have to copy-paste the whole block of text just to make it a bit future-proof?

ipaddr · on May 16, 2023

Yes, whatever you see on the internet is temporary. Save it if it matters because at some point it won't exist.

arpa · on May 15, 2023

we never left that phase dude.

tenebrisalietum · on May 14, 2023

if the "worth" skill X depends on no one knowing how to do X then it's not truly valuable, just gatekept.

bakugo · on May 14, 2023

If it's a forum, I assume not all of the content is created by the owner.

throwwwaway69 · on May 15, 2023

And the perpetual tale of the forum post that says "this has been asked before, use the search" and the first search result is this person saying to use search

boomboomsubban · on May 14, 2023

Bizarrely, I can't remember the last time duckduckgo (basically bing) gave me a forum as a search result, though it used to regularly give me results from them. Maybe it's the admins blocking crawlers but it feels more like a conscious decision.

acdha · on May 14, 2023

I’ve been wondering how much of it comes down to optimizing for ad impressions. If you search, get a result, and it answers your question they sell one page of keyword ads. If you go back and forth a dozen times, they sell a dozen times as many impressions.

boomboomsubban · on May 14, 2023

Given my usual behavior is to check two pages then add !g, where I check two pages and decide I don't need more info, I don't think that's a strong move.

acdha · on May 15, 2023

I’m not saying it’s smart, just that I could easily imagine someone chasing the wrong metric or trying to balance revenue against the likelihood that you’ll stop using them. For example, in your scenario that’s still twice as many impressions so unless you make Google tour primary maybe that’s a win.

gleenn · on May 14, 2023

I am definitely happy you asked your users what they thought and made your decision. But saying "no human complained" might not be a good metric if people use Google or whatever to discover your site or it's info. People don't complain about things they don't know exist.

jacquesm · on May 14, 2023

That's fine though. People will discover the site because other people recommend it. Probably higher quality too.

tourmalinetaco · on May 15, 2023

If you aren’t doing so already, I highly suggest working alongside the Internet Archive to preserve the information on your forums. One day they will close down, and your users will want to see their posts, refer to now broken bookmarks, and generally access the information.

iancarroll · on May 14, 2023

How do you have costs that are directly attributable to scraping? Unless you are using a serverless platform that bills per request or your pages are large enough that egress bandwidth is expensive enough, I’m not convinced most sites would save much doing this.

qeternity · on May 14, 2023

I'm not really sure how you arrive at the conclusion that only serverless platforms result in costs? It's not just 40% of cpu or egress, it's 40% of database load, 40% of logging, 40% of APM/instrumentation.

40% is 40%. Maybe 40% of their cost isn't enough to warrant whatever time these efforts cost them, but for many people out there it will be.

iancarroll · on May 15, 2023

Sounds like the original commenter had a reasonable case, but I just don’t think it’s likely to save anything for small sites on traditional stacks.

If you are running on i.e. EC2 and RDS instances, you’re not saving anything by using 40% less of the CPU, unless you can actually downsize the instance as a result. Read-only traffic is also not that hard to scale out, but with forums etc, you can be stuck with some legacy systems for sure.

buro9 · on May 14, 2023

It's a multi-tenant platform (about 300 forums, with the biggest being around 250K visitors per month), the database is on a vertically scaled box that is too excessive now the traffic has reduced, but I was able to delete a few of the Linodes that were horizontally scaling the API and Web UI (the Web UI is just a client of the API hence those could be saved too).

I've also noticed that my cache hit rate is extraordinary now, which I assume is because humans read recent stuff and bots read the long-tail of old stuff.

mike_d · on May 14, 2023

As someone who does targeted scraping of forums, I can say having a good open API and caching is probably the best way to decrease load.

If you use Cloudflare, turn off their anti-bot stuff. It is far more efficient to let them just serve bots from the cache than having scrapers use tricks to bypass them and go directly to your origin server.

Spivak · on May 14, 2023

Yeah there's basically no way I would ever read what you just wrote and choose to do anything but block harder even at higher cost.

Bots end running around CF would guarantee turning on authenticated origin pull.

guggle · on May 14, 2023

> We've blocked every major hosting and Cloud ASN

How did you do that ? Years ago there was a script to block AWS and I made substantial savings by running it.

buro9 · on May 14, 2023

I designed most of, and built a chunk of, the WAF and firewall stuff at Cloudflare. That includes wirefilter (a wireshark display filter inspired firewall), and coupled with Cloudflare using maxmind you get to block ASNs in addition to other characteristics of the request.

With that context, I used bgp.he.net to look up the big ones I know and then wrote the rules.

reincoder · on May 16, 2023

You can try out our free IP to Country ASN database[0]. You can just grep the IP addresses by looking up the ASN or AS domains. Then just extract the IP address range and you should be good to go. [1]

The paid databases comes with AS type (hosting, ISP, business etc.) and we have a VPN detection database as well.

[0] https://ipinfo.io/developers/ip-to-country-asn-database

[1] https://community.ipinfo.io/t/filtering-asn-database/395

ploum · on May 14, 2023

While I perfectly understand, I’m a bit worried about my own web browser ( Offpunk ) which uses python-requests and is thus very often associated with being a bot.

The browser has the goal of being light and downloading only the text and pictures (no css, no js). So we have the same goal here.

tenebrisalietum · on May 14, 2023

change the user agent

AndrewCopeland · on May 14, 2023

Shameless plug, if you do not want to spend the time aggregating all datacenter IP addresses you can use the IPDetective.io API to easily detect of an IP address comes from a datacenter, VPN, proxy or botnet.

NetOpWibby · on May 14, 2023

Would you happen to have a list of your blocks?

pers0n · on May 14, 2023

Can you share the blocklist?