Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data

stopthatgirl7 · 2 years ago

Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data

TWeaK · 2 years ago

And just the other day I had people arguing to me that it simply wasn’t possible for ChatGPT to contain significant portions of copyrighted work in its database.

@KingRandomGuy@lemmy.world · 2 years ago

Not sure what other people were claiming, but normally the point being made is that it’s not possible for a network to memorize a significant portion of its training data. It can definitely memorize significant portions of individual copyrighted works (like shown here), but the whole dataset is far too large compared to the model’s weights to be memorized.

@5BC2E7@lemmy.world · 2 years ago

yea this “attack” could potentially sink closedAI with lawsuits.

@NevermindNoMind@lemmy.world · 2 years ago

This isn’t just an OpenAI problem:

We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT…

If a model uses copyrighten work for training without permission, and the model memorized it, that could be a problem for whoever created it, open, semi open, or closed source.