Apparently, stealing other people’s work to create product for money is now “fair use” as according to OpenAI because they are “innovating” (stealing). Yeah. Move fast and break things, huh?
“Because copyright today covers virtually every sort of human expression—including blogposts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials,” wrote OpenAI in the House of Lords submission.
OpenAI claimed that the authors in that lawsuit “misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence.”
I’m gonna say those circumstances changed when digital copies and the Internet became a thing, but at least we’re having the conversation now, I suppose.
I agree that ML image and text generation can create something that breaks copyright. You for sure can duplicate images or use copyrighted characterrs. This is also true of Youtube videos and Tiktoks and a lot of human-created art. I think it’s a fascinated question to ponder whether the infraction is in what the tool generates (i.e. did it make a picture of Spider-Man and sell it to you for money, whcih is under copyright and thus can’t be used that way) or is the infraction in the ingest that enables it to do that (i.e. it learned on pictures of Spider-Man available on the Internet, and thus all output is tainted because the images are copyrighted).
The first option makes more sense to me than the second, but if I’m being honest I don’t know if the entire framework makes sense at this point at all.
The infraction should be in what’s generated. Because the interest by itself also enables many legitimate, non-infracting uses: uses, which don’t involve generating creative work at all, or where the creative input comes from the user.
I don’t disagree on principle, but I do think it requires some thought.
Also, that’s still a pretty significant backstop. You basically would need models to have a way to check generated content for copyright, in the way Youtube does, for instance. And that is already a big debate, whether enforcing that requirement is affordable to anybody but the big companies.
But hey, maybe we can solve both issues the same way. We sure as hell need a better way to handle mass human-produced content and its interactions with IP. The current system does not work and it grandfathers in the big players in UGC, so whatever we come up with should work for both human and computer-generated content.