I know there’s other plausible reasons, but thought I’d use this juicy title.
What does everyone think? As someone who works outside of tech I’m curious to hear the collective thoughts of the tech minds on Lemmy.
I know there’s other plausible reasons, but thought I’d use this juicy title.
What does everyone think? As someone who works outside of tech I’m curious to hear the collective thoughts of the tech minds on Lemmy.
Yeah I’ve done a tiny bit of AI stuff for what I do (biology) and I think it’s very sus they can build such a strong model out of data which costs lots of money. The reason the algos in my field of biology are so strong is because the NCBI has the genomes of everything that’s be sequenced FOR FREE, because obviously you don’t want people patenting genomes and it should all be free for science, etc.
Which begs the question how the a start up that started out as a non-profit get that much user data and keep costs low? I know you can buy user data and I’m not sure how much it is to buy a bunch of google docs from a data broker, but if you buy from hackers who just data breached or used some illegal crawler you can probably cut that to prices a nonprofit could afford.
It doesn’t have to be nefarious. The API change at Twitter and Reddit were ostensibly about the fact that OpenAi et. al. pretty much downloaded all their content for free.
Throw in the fact that you can ingest all of wikipedia for free and you have a shitload of knowledge at your disposal.
I was under the impression that they crawled web sources but it seems like lots of copyrighted work was used.
I hadn’t heard of getting “illegal” data sets before so I looked into it and it sounds like they might have done that. Wow.
Link for the curious: https://www.theverge.com/2023/7/9/23788741/sarah-silverman-openai-meta-chatgpt-llama-copyright-infringement-chatbots-artificial-intelligence-ai