Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Add poisoned/fake pages to disrupt/track AI scraping #336

Open
1 task done
jonbarrow opened this issue Sep 7, 2024 · 2 comments
Open
1 task done

[Feature]: Add poisoned/fake pages to disrupt/track AI scraping #336

jonbarrow opened this issue Sep 7, 2024 · 2 comments
Labels
awaiting-approval Topic has not been approved or denied feature A feature request

Comments

@jonbarrow
Copy link
Member

jonbarrow commented Sep 7, 2024

Checked Existing

  • I have checked the repository for duplicate issues.

What feature do you want to see added?

Add poisoned pages to the site which disrupt the scraping done by AI companies

Why do you want to have this feature?

AI companies are doing ever increasing scraping, taking user information, content, etc. without the consent of users or site owners, while also eating into our bandwidth. While we are not the biggest community/site on the planet, we do have quite the number of users and we create tons of user generated content. This content is becoming more and more accessible via our website, which could make it a target for this kind of scraping.

By adding fake/poisoned pages to our site we can try to accomplish 2 things:

  • Track which requests are coming from automated scrapers (if there's requests to pages which are not publically available, we can be reasonably certain they're from a bot). This both helps with more accurate visit tracking but also could allow us to block requests from these sources
  • Attempt to disrupt this scraping by giving the bot bad data

While this won't topple OpenAI or bring Facebook to its knees, it could allow us to at least mitigate our users' data from being used like this.

Any other details to share? (OPTIONAL)

The idea comes from this user on Twitter https://twitter.com/Sync1211/status/1831825065937400253 who says, quote:

That's why my site hosts un-indexed pages of poisoned data.

Mostly images edited via Nightshade and archived Reddit posts with the comment section of a completely different post.

If we decided to go through with this idea, this may be a good place to start.

@jonbarrow jonbarrow added awaiting-approval Topic has not been approved or denied feature A feature request labels Sep 7, 2024
@MatthewL246
Copy link
Member

In case you weren't aware, Cloudflare has a setting to block AI scrapers. While it doesn't accomplish all of the goals presented here, it's a quick one-click method and might be worth turning on if AI scraping is something you're concerned about.

@jonbarrow
Copy link
Member Author

In case you weren't aware, Cloudflare has a setting to block AI scrapers. While it doesn't accomplish all of the goals presented here, it's a quick one-click method and might be worth turning on if AI scraping is something you're concerned about.

I was not aware of this. I'll look into that. Though as mentioned it doesn't achieve all goals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-approval Topic has not been approved or denied feature A feature request
Projects
None yet
Development

No branches or pull requests

2 participants