I feel like this move has nothing to do with investors and everything to do with setting the standard for big corps like Microsoft and Google to be able to scrape their massive amount of data to train next gen AIs. They know they have HUGE amount of data from now and for years and years ago. Content, created by others, then sold for enormous profit.
I mean AI is already stealing all art and images on the web without paying anything. They could just literally scrape and pay nothing. Web scraping isn’t illegal, they already do it, why would they pay anyone? Unless the law catches up about the rights to manufacture AI content based on ill-gotten data, then why would they pay what they don’t have to?
Its not even considered stealing to make training data, you can disagree, but that has to be Maschine readable by law, everyone knows that its Algorithms scraping the web for data, if they see a image that says don’t scrape they don’t take it but most images don’t have such a attachment.
Could you please point me to legal definitions, in court or otherwise, that say it is not violating my copyright license to directly use my artwork in any shape or form for a non-fair use product? As in, a service you pay money for to create things based on the training data it has taken from me, is not fair use. Or point me to the legal definitions where I lose my copyright by posting things online? Allowing to scrape is not the same thing as giving derivative copyright license permissions. You aren’t disagreeing with me, you’re disagreeing with my legal rights.
hmm. That talks of data mining not of derivative work from the mined data. … I’ll leave the discussion to others, though.
What effect would a Creative Commons “ND” clause have? Is it reserving the right to make derivatives? And would machine generated stuff even count as a derivative?
DuckDuckGo translate:
Copyright and Related Rights Act (Copyright Act) § 44b Text and Data Mining
(1) Text and data mining is the automated analysis of one or more digital or digitized works in order to obtain information, in particular about patterns, trends and correlations.
(2) Reproduction of lawfully accessible works for text and data mining is permitted. The
reproductions must be deleted when they are no longer required for text and data mining.
(3) Uses pursuant to subsection (2) sentence 1 shall only be permitted if the rights holder has not reserved them. A reservation of use for works accessible online is only effective if it is made in machine-readable form.
Translation:
(1) Text and data mining is the automated analysis of one or more digital or digitized works in order to obtain information, in particular about patterns, trends and correlations.
(2) Reproduction of lawfully accessible works for text and data mining is permitted. The reproductions must be deleted when they are no longer required for text and data mining.
(3) Uses pursuant to subsection (2) sentence 1 shall only be permitted if the rights holder has not reserved them. A reservation of use for works accessible online is only effective if it is made in machine-readable form.
None of that says anything about creating profitable derivative work. In fact is specifies patterns, trends and correlations, which does not lead me to believe it is protecting visual works created from this data, those kind of things are only used to inform things, like information science.
Algorithms trained with that training data are considered to make new works, so your only change to stop your work getting into a Algorithm is by going against the ones that make the training data. And they will always pull the science card.
And respectively, the algorithm muches up the training data so extremely that less than 0,1% of a training data picture is inside the result, therefore its legally a new work, however that work itself doesn’t fall under copyright protection because that is only a thing for human art. And as already stated, the training data is harvested according to the law i mentioned.
What do you mean by stealing?
The data remains, all they do is learning from something which is public
What is different to Googles approach, they are just watching and learning
Why is it treated so differently when it essentially does nothing new, but uses the data in a different way
they are just watching and learning Why is it treated so differently
Because it isn’t human. It isn’t watching and learning, it is being fed my creative content as data that I have not allowed nor have been compensated for, which is then turned around and sold as a service. My work is being consumed for commercial uses by an inhuman who does not have fair use education rights, with the sole intent to create a profitable product, and I’m getting nothing. I have legal rights, no matter where I post my work, to retain my copyrights and I have the right to not consent to improper use of my works that do not align with the licenses I have chosen to give it. Websites ask for a licenses in their ToS to be able to even just display and share my artwork when I upload it. When I create an image, I am given ownership of it’s copyright to control the use, distribution, and right to create derivatives. This isn’t a fuzzy area, it’s very clear. If an artist did not consent to their artwork being used as training data for a non-fair use reason, it is stealing their works.
And no, it’s not fair use under education. Copyright exists for human protection and uses. It isn’t being used for ‘learning’ it’s used as data to be repackaged and sold. Google’s use of it showing up in search is to link back to posts that contain my work, retain my copyright, and are not derivatives. If you mean by captchas, yeah capchas are pretty bullshit.
I feel the bigger problem with these AIs is more how they are solely being used to improve profits and productivity, these only affect the capital owners. None of that is going to improve the laborer (i.e., the artist, the coder, the writer, the people who create value from capital). This is only going to get worse. We are being normalized to automation and AI with the use of self-checkout.
Also, about Reddit training data, I think they are too late to the party. The weights they were needed for are made. I do not think they are the exclusive source of specialized information, and (I hope) they are going to find out. They are just going to further show how silly the free market and the stock market are. The people who require the data will probably have other ways of getting it. r/datahoarders and people like that come to mind. Reddit is only making new data hard to access which, which they are not (and hopefully never) an exclusive source of.
Yeah, AI can totally exist and be useful, but currently it’s in the hands of tech dudes and admins who have a terrible track record with developing things responsibly and over hyping and masking flaws. It’s used to make a profit at the colossal detriment to humans. It’s used to hurt us currently, not help at all.
I think the training data from reddit probably only used the API because it was easier and free. And if no longer free, there’s nothing pointing to them actually paying for it. It’s not like reddit is the only data, they very much likely already have web scrapers for other uses that they can just tune for reddit.
I think the biggest problem is the license we have all chosen to give our artwork explicitly doesn’t cover this. Your work isn’t being copied by AI, it’s training AI and sometimes being emulated by AI, but there are literally 0 laws about reading copywrited work unless you break down a barrier to do so like a paywall, and there are no laws on derivative work otherwise we wouldn’t have Pokémon knock offs and such. Lastly artists post a lot of content online freely to entities who do in turn claim control over their distribution.
Ultimately, I think reddit as the owner over the distribution of our data (yuk) did the right thing by making paid api access, but it was stupid of them to do it at normal human scales and not just at bot scraping scales and then using their TOS to give them the ability to sue if they aren’t paid for training data.
That’s an interesting but definitely plausible take on the whole thing.
12000$ for 50mio requests is B2B pricing. For a company like Openai/Microsoft that’s not even worth thinking about if you get all of that precious training data for it…
The thing I worry about whenever someone mentions this angle: What about Lemmy content? As the community moves away from the commercial platforms in favor of Lemmy, Bluesky, Mastodon etc. Then does that lower the legal barrier for AI companies to train on all this content for free? Is that shift in the legal vulnerability of public content something that users consider? Is that desirable to most users? Are people thinking about that?
Open source and federation mean open source and federation, I don’t see why it shouldn’t be free and legal to scrape for Lemmy and Mastadon. However maybe the servers could issue rate limits and suspicious block lists so they don’t go down due to scrapes.
What I don’t understand is why Reddit didn’t institute the following: All api requests are free up to 100,000 per month per user token. Also in our terms of service you can not use us to train AI models without paying this fee.
I’m with you on that. AI is the future. Just because xxx big corp is doing AI training for their closed source product doesnt mean that open source models won’t also benefit. If you post to a public space you should expect it to be read.
Interesting. I wonder if they already got an offer that matches their new API pricing, and they decided to up everything to match that cost and avoid being sued later.
Like, there seems to be some urgency between them announcing and upping the price. What was it? Is this the reason? A confirmed, extremely wealthy and extremly naive buyer?
If he thinks locking down the API is going to stop them, he’s bumped his head. These companies have more than enough manpower to write and maintain an HTML scraper for Reddit.
Creating a web scraper vs actually maintaining one that is effective and works is two different things. It’s very easy to fight web scraping if you know what you are doing.
You are right. You would need a team of skilled scrapers and network engineers though would know how to get around rate limiters with some kind of external load balancer or something along those lines.
Rate limiters work on IP source. This is easily bypassed with a rotating proxy. There are even SaaS that offer this. The trick is to not use large subnets that can be easily blocked. You have to use a lot of random /32 IPs to be effective.
Could they be doing that already because of the still open API of Reddit and that will soon change? I just feel like it’s easier for them currently and it will be tougher once the API changes are implemented.
No. Search engines fetch pages using plain old HTTP GET requests, same as how browsers fetch pages. There is some difficulty in parsing the HTML and extracting meaningful content, but it’s too late: the HTML is already stored on Google/Microsoft servers, ready for extraction, and there’s nothing Reddit can do to stop them.
Reddit can make future content harder to extract, but not without also making it invisible to search engines, which would cause Reddit to disappear from Google Search and Bing.
That’s why I say trying to charge money for AI training data is a fool’s errand. These facts make it impossible. That doesn’t mean Spez won’t try, but it does mean he won’t succeed.
I feel like this move has nothing to do with investors and everything to do with setting the standard for big corps like Microsoft and Google to be able to scrape their massive amount of data to train next gen AIs. They know they have HUGE amount of data from now and for years and years ago. Content, created by others, then sold for enormous profit.
I mean AI is already stealing all art and images on the web without paying anything. They could just literally scrape and pay nothing. Web scraping isn’t illegal, they already do it, why would they pay anyone? Unless the law catches up about the rights to manufacture AI content based on ill-gotten data, then why would they pay what they don’t have to?
Its not even considered stealing to make training data, you can disagree, but that has to be Maschine readable by law, everyone knows that its Algorithms scraping the web for data, if they see a image that says don’t scrape they don’t take it but most images don’t have such a attachment.
Could you please point me to legal definitions, in court or otherwise, that say it is not violating my copyright license to directly use my artwork in any shape or form for a non-fair use product? As in, a service you pay money for to create things based on the training data it has taken from me, is not fair use. Or point me to the legal definitions where I lose my copyright by posting things online? Allowing to scrape is not the same thing as giving derivative copyright license permissions. You aren’t disagreeing with me, you’re disagreeing with my legal rights.
https://www.gesetze-im-internet.de/urhg/__44b.html
German law and many of the data mining companys are German.
hmm. That talks of data mining not of derivative work from the mined data. … I’ll leave the discussion to others, though.
What effect would a Creative Commons “ND” clause have? Is it reserving the right to make derivatives? And would machine generated stuff even count as a derivative?
DuckDuckGo translate:
The stuff the algorithm makes is considered a new thing because the impact of each individual pice of training data is basically unmeasurably small.
Where has that been proven legally?
Translation: (1) Text and data mining is the automated analysis of one or more digital or digitized works in order to obtain information, in particular about patterns, trends and correlations.
(2) Reproduction of lawfully accessible works for text and data mining is permitted. The reproductions must be deleted when they are no longer required for text and data mining.
(3) Uses pursuant to subsection (2) sentence 1 shall only be permitted if the rights holder has not reserved them. A reservation of use for works accessible online is only effective if it is made in machine-readable form.
None of that says anything about creating profitable derivative work. In fact is specifies patterns, trends and correlations, which does not lead me to believe it is protecting visual works created from this data, those kind of things are only used to inform things, like information science.
Algorithms trained with that training data are considered to make new works, so your only change to stop your work getting into a Algorithm is by going against the ones that make the training data. And they will always pull the science card.
Where are algorithms considered as being new and legally allowed derivative works in relation to visual works of art?
Germany.
And respectively, the algorithm muches up the training data so extremely that less than 0,1% of a training data picture is inside the result, therefore its legally a new work, however that work itself doesn’t fall under copyright protection because that is only a thing for human art. And as already stated, the training data is harvested according to the law i mentioned.
What do you mean by stealing? The data remains, all they do is learning from something which is public
What is different to Googles approach, they are just watching and learning Why is it treated so differently when it essentially does nothing new, but uses the data in a different way
And no, it’s not fair use under education. Copyright exists for human protection and uses. It isn’t being used for ‘learning’ it’s used as data to be repackaged and sold. Google’s use of it showing up in search is to link back to posts that contain my work, retain my copyright, and are not derivatives. If you mean by captchas, yeah capchas are pretty bullshit.
I feel the bigger problem with these AIs is more how they are solely being used to improve profits and productivity, these only affect the capital owners. None of that is going to improve the laborer (i.e., the artist, the coder, the writer, the people who create value from capital). This is only going to get worse. We are being normalized to automation and AI with the use of self-checkout.
Also, about Reddit training data, I think they are too late to the party. The weights they were needed for are made. I do not think they are the exclusive source of specialized information, and (I hope) they are going to find out. They are just going to further show how silly the free market and the stock market are. The people who require the data will probably have other ways of getting it. r/datahoarders and people like that come to mind. Reddit is only making new data hard to access which, which they are not (and hopefully never) an exclusive source of.
Yeah, AI can totally exist and be useful, but currently it’s in the hands of tech dudes and admins who have a terrible track record with developing things responsibly and over hyping and masking flaws. It’s used to make a profit at the colossal detriment to humans. It’s used to hurt us currently, not help at all.
I think the training data from reddit probably only used the API because it was easier and free. And if no longer free, there’s nothing pointing to them actually paying for it. It’s not like reddit is the only data, they very much likely already have web scrapers for other uses that they can just tune for reddit.
I think the biggest problem is the license we have all chosen to give our artwork explicitly doesn’t cover this. Your work isn’t being copied by AI, it’s training AI and sometimes being emulated by AI, but there are literally 0 laws about reading copywrited work unless you break down a barrier to do so like a paywall, and there are no laws on derivative work otherwise we wouldn’t have Pokémon knock offs and such. Lastly artists post a lot of content online freely to entities who do in turn claim control over their distribution.
Ultimately, I think reddit as the owner over the distribution of our data (yuk) did the right thing by making paid api access, but it was stupid of them to do it at normal human scales and not just at bot scraping scales and then using their TOS to give them the ability to sue if they aren’t paid for training data.
That’s an interesting but definitely plausible take on the whole thing. 12000$ for 50mio requests is B2B pricing. For a company like Openai/Microsoft that’s not even worth thinking about if you get all of that precious training data for it…
The thing I worry about whenever someone mentions this angle: What about Lemmy content? As the community moves away from the commercial platforms in favor of Lemmy, Bluesky, Mastodon etc. Then does that lower the legal barrier for AI companies to train on all this content for free? Is that shift in the legal vulnerability of public content something that users consider? Is that desirable to most users? Are people thinking about that?
Open source and federation mean open source and federation, I don’t see why it shouldn’t be free and legal to scrape for Lemmy and Mastadon. However maybe the servers could issue rate limits and suspicious block lists so they don’t go down due to scrapes.
What I don’t understand is why Reddit didn’t institute the following: All api requests are free up to 100,000 per month per user token. Also in our terms of service you can not use us to train AI models without paying this fee.
I’m with you on that. AI is the future. Just because xxx big corp is doing AI training for their closed source product doesnt mean that open source models won’t also benefit. If you post to a public space you should expect it to be read.
Interesting. I wonder if they already got an offer that matches their new API pricing, and they decided to up everything to match that cost and avoid being sued later.
Like, there seems to be some urgency between them announcing and upping the price. What was it? Is this the reason? A confirmed, extremely wealthy and extremly naive buyer?
The API pricing likely has something to do with the revenue per user calculations. Reddit is aiming for IPO but their valuation tanked over the past two years. That might be why the admins have decided to strong arm this with such a short notice.
If big tech trains AI using reddit interactions, out species is doomed 🤣
If he thinks locking down the API is going to stop them, he’s bumped his head. These companies have more than enough manpower to write and maintain an HTML scraper for Reddit.
That opens them up to massive legal problems if they do. AI companies are going to need to prove their training data was legit obtained.
Man, even I can do that and I’m hardly a programmer.
Creating a web scraper vs actually maintaining one that is effective and works is two different things. It’s very easy to fight web scraping if you know what you are doing.
Right, but these are big companies with lots of talented programmers on hand. If anyone can overcome such an obstacle, it’s them.
Also, Google and Microsoft already have a search index full of Reddit content to scrape.
You are right. You would need a team of skilled scrapers and network engineers though would know how to get around rate limiters with some kind of external load balancer or something along those lines.
Rate limiters work on IP source. This is easily bypassed with a rotating proxy. There are even SaaS that offer this. The trick is to not use large subnets that can be easily blocked. You have to use a lot of random /32 IPs to be effective.
That problem is already solved. Google and Microsoft are already fetching every single page on Reddit for search engine indexing.
Could they be doing that already because of the still open API of Reddit and that will soon change? I just feel like it’s easier for them currently and it will be tougher once the API changes are implemented.
No. Search engines fetch pages using plain old HTTP
GET
requests, same as how browsers fetch pages. There is some difficulty in parsing the HTML and extracting meaningful content, but it’s too late: the HTML is already stored on Google/Microsoft servers, ready for extraction, and there’s nothing Reddit can do to stop them.Reddit can make future content harder to extract, but not without also making it invisible to search engines, which would cause Reddit to disappear from Google Search and Bing.
That’s why I say trying to charge money for AI training data is a fool’s errand. These facts make it impossible. That doesn’t mean Spez won’t try, but it does mean he won’t succeed.
It would be better to just leave the api running as it is now because if you change nothing the data flow will not significantly change either.