A long form response to the concerns and comments and general principles many people had in the post about authors suing companies creating LLMs.
A long form response to the concerns and comments and general principles many people had in the post about authors suing companies creating LLMs.
Most of this stems from a misunderstand of how LLM work.
The original work is not stored anywhere. No copy of it has been made. Just tons and tons of statistics used to inform models.
Since there is no copy there is no violation of copyright. Again, no copy of the book is getting made. The content of the books is not stored “verbatim”. The book is not copied. I don’t know how many other ways to put this.
Summarizing a book also does not require one to have “read” it, contrary to the complaint. I never read “The DaVinci Code”, but I can give a summary of it.
With assertions in the complaint being clearly false it’s hard to take it seriously and it’ll get chucked the first time a judge has to deal with it.
Maybe Silverman would have a point if it were standard practice to pay royalties to people you get inspiration from. But she doesn’t pay everyone who wrote anything she read, said anything she heard, or other comedians who influenced her. So why should someone influenced by her pay?
If I read 100,000 books how do you determine “which one” I got inspiration from? Same situation here.
Copyright doesn’t apply just to stuff copied verbatim though, it applies to a lot more. It really doesn’t matter if it is or isn’t stored verbatim. Translations and derivative works are not exact copies and still fall under copyright. Copyright even applies to broad things such as “a concept of a character” and this can result in some pretty strange arguments some copyright holders might use, such as “Sherlock Holmes that doesn’t smile is public domain, but Sherlock Holmes who shows emotion is copyright infringement” as described here.
It doesn’t matter if an exact copy of the book was made. It matters if the core information that book carried was taken as a whole and used elsewhere. And even though the data was transformed as statistical information, the information is still there in that model. The model itself is basically just an “unauthorized translation” of hundreds of thousands of works into a very esoteric format.
The whole argument of “inspiration” is also misleading. Inspiration is purely a human trait. We’re not talking about humans being inspired. We’re talking about humans using copyrighted material to create a model, and about computers using that model to create content. Unless you’d argue that humans should be considered the same thing as machines in the eyes of the law, this argument simply doesn’t work.
Look up RAM copy doctrine. It is pretty easy to argue they are making a copy.
Aptly put 👏