• NotAPenguin
    link
    fedilink
    42 years ago

    The article doesn’t explain how that’s the case at all.

    Aren’t all the big AI models trained on publicly available data?

    • Hot Saucerman
      cake
      link
      fedilink
      English
      1
      edit-2
      2 years ago

      Books3 is the definition of “not publicly available” because it’s all from pirated material downloaded from private torrent tracker Bibliotik.

      Books3 is literally why several of AI groups are being sued by various authors like Sarah Silverman and George R.R. Martin.

      Books3 was always illicitly obtained material which put into question whether an LLM using it could really fall under Fair Use. (It most likely does, but it’s still a legal question that hasn’t been answered yet.)

      Books3 Link: https://huggingface.co/datasets/the_pile_books3

      Books3 Description from Link:

      This dataset is Shawn Presser’s work and is part of EleutherAi/The Pile dataset.

      This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI’s mysterious “books2” dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it’s “all of libgen”, but it’s purely conjecture.