X answered how do i deal with ai scraper bots overload…

??? · 7h

how do i deal with ai scraper bots overloading a wiki? i'm currently using cloudflare to filter them, but i'd prefer not to use that if possibel

step 0 is blocking the netblocks of alibaba, aws, facebook, and microsoft. this is like the "wipe it down with a bleach wipe" step of cleaning. those networks offer you very little value (mostly from users using shitty VPNs) while being a huge source of garbage.

next, look at your server logs. there are a few different types of scrapers you might be dealing with:
* bots that hit recent changes linked, user contributions, what links here: these are explicitly ddos bots.
* bots that hit every url (frequently including printable=yes): these are the shitty scraper bots. very reminiscent of early crawlers that ignored robots.txt, browser hints, and common sense. the kind that will download all 1000 years of a calendar's individual days if you let it.
* bots that just hit article / file pages: a nicer scraper or "download the entire page and all links" tool. one of the tells of these on tcrf is loading image pages for stuff like the stub icon or feature icons: almost nobody legitimately does that, but a bot will get them all in a row.

to deal with the problem, there are a few options
* use anubis or a similar anti-bot "captcha" measure
* install the CrawlerProtection extension, which blocks access to some special pages, mainly the most abused ones: https://www.mediawiki.org/wiki/Extension:CrawlerProtection
* install the BlockAI extension, which is similar to what i do for tcrf in terms of minor analysis and locking down everything: https://www.mediawiki.org/wiki/Extension:BlockAI

a lot of it will come down to your specific threat profile. tcrf has faced a lot of abuse, from normal scraper bots, to intentionally-targeted ddos waves. for most wikis, restricting the worst special pages is probably enough.

??? · 7h

X · 4h