Have you ever wondered if that blog post you wrote years ago, or even a casual chat with an AI assistant, is quietly helping train the next big model? I explored to try to get to the root of the issue.
The quiet harvest
Large language models are built on enormous piles of data—text, images, code, forum posts, news articles, basically anything publicly available online. Companies scrape the web at scale because that’s how these systems learn patterns and language. Most of us never explicitly agreed to have our content used this way; it just happened as the default. Lately, though, pushback has grown. Tools are popping up that let individuals check if their work appears in common training datasets and request removal. At the same time, lawsuits are shining a light on the other side: data you actively share with AI services (like prompts in ChatGPT) isn’t always as sealed-off as the privacy policies suggest.
A recent example comes from the ongoing New York Times copyright lawsuit against OpenAI. In January 2026, a federal judge ordered OpenAI to turn over around 20 million anonymized conversation logs—user prompts and model responses—to the plaintiffs’ lawyers during discovery. The data is de-identified, meaning no names attached, but it still means millions of real interactions are now in the hands of opposing counsel, protected by court orders but no longer solely with OpenAI.
Why data flows so freely
A few reasons this keeps happening:
- Scale demands it. Training cutting-edge models still relies heavily on broad web crawls, and until recently there was no standardized “do not use” signal beyond robots.txt (which many crawlers ignored).
- Legal gray zones. Copyright law hasn’t fully caught up, so companies argue fair use for training while publishers and creators fight back—leading to discovery phases where internal data gets exposed.
- Service design. When you chat with a consumer AI tool, your inputs help improve the system unless you opt out (and even then, logs are kept for safety or debugging). Lawsuits can override normal retention rules, forcing preservation and sharing under controlled conditions.
It’s like handing your notebook to a helpful librarian who promises to keep it private—except if someone sues the library, parts of it might end up reviewed by the other side’s attorneys.
The fix
Good news: you have more control than you might think. Here are two straightforward moves that actually work today.
First, tackle the scraped-web side. Visit sites like haveibeentrained.com (run by LAION, creators of one of the biggest open datasets) or spawning.ai (focused on art and images). Search for your content—your name, website, portfolio—and submit removal requests where supported. More companies are honoring these now, partly because lawsuits have made ignoring them riskier. It’s not retroactive for every model already trained, but it helps prevent future use.
Second, for data you actively send to AI services: read the fine print and choose wisely. OpenAI’s enterprise or API plans explicitly don’t use your inputs for training, and consumer ChatGPT lets you turn off chat history (which also disables training use). When possible, avoid pasting truly sensitive info—treat it like emailing a stranger. And remember the lawsuit lesson: even anonymized logs can surface in legal battles, so the safest data is the data you never share.
Awareness plus these small habits go a long way. You’re not powerless; you’re just learning the new rules of the game.