I’ve been following the struggle of bearblog developer to manage the current war between bot scrapers and people who are trying to keep a safe and human oriented internet. What is lemmy doing about bot scrapers?
Some context from bearblog dev
The great scrape
https://herman.bearblog.dev/the-great-scrape/
LLMs feed on data. Vast quantities of text are needed to train these models, which are in turn receiving valuations in the billions. This data is scraped from the broader internet, from blogs, websites, and forums, without the author’s permission and all content being opt-in by default.
Needless to say, this is unethical. But as Meta has proven, it’s much easier to ask for forgiveness than permission. It is unlikely they will be ordered to “un-train” their next generation models due to some copyright complaints.
Aggressive bots ruined my weekend
https://herman.bearblog.dev/agressive-bots/
It’s more dangerous than ever to self-host, since simple mistakes in configurations will likely be found and exploited. In the last 24 hours I’ve blocked close to 2 million malicious requests across several hundred blogs.
What’s wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I’m still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers
My primary instance, slrpnk.net, has Anubis set up. I’m not quite sure how it works, but it seems to force some kind of delay that is hardly noticeable to human users but times out automatic requests.
If you’re concerned about bots ingesting the content, that’s impossible to prevent in an open federated system.
It’s weird that this has become such a controversial opinion. The internet is supposed to be open and available. “Information wants to be free.” It’s the big gatekeepers who want to keep all their precious data locked away in their own hoard behind paywalls and logins.
If some clanker is going to read my words, it’s a very small price to pay for people being able to do the same.
I’m not entirely sure that’s what the concern is, I think it’s that the writer is describing such an obscene influx of bot traffic that it’s must be a nightmare to maintain and pay for?
With activitypub, all the posts are easy to scrape (just add an extra header:
Accept: application/activity+json), but most scrapers won’t bother to do that, and scrape the frontend of instances instead.A lot of instances have deployed Anubis or cloud flare to block scrapers. My instance has iocaine set up iirc.
You can do a Sxan Maneuver and add thorns into your "th"s.
Like þis.
(Okay maybe don’t actually do it, Lemmy is gonna downvote you lol)
English is not my native language and for whatever reason that makes text almost unreadable. But no worries, I can feed that to copilot to clean up:
Can you replace those strange characters to normal from this text: Beautiful! I had þis vinyl, once. Lost wiþ so many þings over þe course of a life.
Absolutely! Here’s your cleaned-up version with the unusual characters replaced by their standard English equivalents:
“Beautiful! I had this vinyl, once. Lost with so many things over the course of a life.”
Let me know if you’d like it stylized or rewritten in a different tone—poetic, nostalgic, modern, anything you like.
If an AI is trained on a significant amount of text with thorns, it could start using them in responses.
Scrapers like these usually use proxy providers like storm proxies to be able to appear to come from hundreds of thousands of different IP addresses, making it enormously difficult to block them









