[Feature]: Add poisoned/fake pages to disrupt/track AI scraping #336

jonbarrow · 2024-09-07T18:42:25Z

Checked Existing

I have checked the repository for duplicate issues.

What feature do you want to see added?

Add poisoned pages to the site which disrupt the scraping done by AI companies

Why do you want to have this feature?

AI companies are doing ever increasing scraping, taking user information, content, etc. without the consent of users or site owners, while also eating into our bandwidth. While we are not the biggest community/site on the planet, we do have quite the number of users and we create tons of user generated content. This content is becoming more and more accessible via our website, which could make it a target for this kind of scraping.

By adding fake/poisoned pages to our site we can try to accomplish 2 things:

Track which requests are coming from automated scrapers (if there's requests to pages which are not publically available, we can be reasonably certain they're from a bot). This both helps with more accurate visit tracking but also could allow us to block requests from these sources
Attempt to disrupt this scraping by giving the bot bad data

While this won't topple OpenAI or bring Facebook to its knees, it could allow us to at least mitigate our users' data from being used like this.

Any other details to share? (OPTIONAL)

The idea comes from this user on Twitter https://twitter.com/Sync1211/status/1831825065937400253 who says, quote:

That's why my site hosts un-indexed pages of poisoned data.

Mostly images edited via Nightshade and archived Reddit posts with the comment section of a completely different post.

If we decided to go through with this idea, this may be a good place to start.

MatthewL246 · 2024-09-08T03:30:09Z

In case you weren't aware, Cloudflare has a setting to block AI scrapers. While it doesn't accomplish all of the goals presented here, it's a quick one-click method and might be worth turning on if AI scraping is something you're concerned about.

jonbarrow · 2024-09-16T21:34:42Z

In case you weren't aware, Cloudflare has a setting to block AI scrapers. While it doesn't accomplish all of the goals presented here, it's a quick one-click method and might be worth turning on if AI scraping is something you're concerned about.

I was not aware of this. I'll look into that. Though as mentioned it doesn't achieve all goals

jonbarrow added awaiting-approval Topic has not been approved or denied feature A feature request labels Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add poisoned/fake pages to disrupt/track AI scraping #336

[Feature]: Add poisoned/fake pages to disrupt/track AI scraping #336

jonbarrow commented Sep 7, 2024 •

edited

Loading

MatthewL246 commented Sep 8, 2024

jonbarrow commented Sep 16, 2024

[Feature]: Add poisoned/fake pages to disrupt/track AI scraping #336

[Feature]: Add poisoned/fake pages to disrupt/track AI scraping #336

Comments

jonbarrow commented Sep 7, 2024 • edited Loading

Checked Existing

What feature do you want to see added?

Why do you want to have this feature?

Any other details to share? (OPTIONAL)

MatthewL246 commented Sep 8, 2024

jonbarrow commented Sep 16, 2024

jonbarrow commented Sep 7, 2024 •

edited

Loading