ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

  • TWeaK
    link
    fedilink
    English
    322 years ago

    And just the other day I had people arguing to me that it simply wasn’t possible for ChatGPT to contain significant portions of copyrighted work in its database.

    • @KingRandomGuy@lemmy.world
      link
      fedilink
      English
      102 years ago

      Not sure what other people were claiming, but normally the point being made is that it’s not possible for a network to memorize a significant portion of its training data. It can definitely memorize significant portions of individual copyrighted works (like shown here), but the whole dataset is far too large compared to the model’s weights to be memorized.

      • @NevermindNoMind@lemmy.world
        cake
        link
        fedilink
        English
        62 years ago

        This isn’t just an OpenAI problem:

        We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT…

        If a model uses copyrighten work for training without permission, and the model memorized it, that could be a problem for whoever created it, open, semi open, or closed source.