you literally have access to all the code in the world
I’d like to believe that they were honorable enough to not secretly train on code without people’s permission. But realistically they totally did exactly that, but just made the AI Model this incompetent through some other engineering blunder.
Also, random side thought - training only on public repos probably yields you way higher code quality as opposed to training on both public and private repos? I assume we all have some very messy private repos that we’re too embarrassed to publish because the code quality is absolute shit … right?
They didn’t check licenses in any way, as it did reproduce the famous quake fast inverse square root function, comments included. And quake, like majority of github projects, is published under GPL, which requires all copies and modifications to be published under GPL as well, after which all sane enterprises have banned copilot usage.
Though, we’re not living in sane times anymore. Chatgpt, gemini, deepseek, claude, all reproduce copylefted code left and right. Realistically, Stallman should’ve been rolling in cash by now…
If I, as a human, read a piece of Open source code it solves a problem in a unique and new way, and then I myself write my own closed source code that solves the problem in the same way, I have not violated a license. The license is for the code itself, not a patent for the specific way the code solves the problem. And since the code in the closed source product is written by me and not copy pasted from the open source project, I have not violated the license per
So what about AI? If you train the AI on a piece of code, and it outputs the same or similar code, do we treat that as if the human copy pasted the code? Or do we treat it as if the human used what they learned from the first program and wrote something similar?
There is already an AI company taking advantage of this. They advertise that that if you want to use open source code in a closed source product, you hire them-- their AI will parse through the open source code and spit out a list of specifications that is specifically not code. Another AI on a completely different system that has never had access to the open source code will then take that specification and spit out program code that is functionally identical and does the exact same thing but is a completely new creation. The result is that you essentially rewrite the open source code but without the copyleft restrictions.
This is going to be an issue that laws and courts will have to address. Especially if, in your example, the code produced by the AI was actually identical to the GPL quake code. Because while a human copying the functionality is never going to write the exact same code line by line, the machine might be.
There is a bit of underlying problem, though. Case in point: I was recently asked to make a js function that converts any string into a hex color, which I promptly just copied off stackoverflow, but noticed they now intercept Ctrl-C to add CC-BY-SA banner. I’d usually include a link whenever I copy something off somewhere anyway, if not for licensing sake then at least for ease of future code navigation. But this got me curious, I asked ChatGPT the same question and it provided the exact same snippet, but without any attribution. And so did gemini and qwen 3.5. You can tell its a copy because the original uses a very specific bit-shift by 5 to mix numbers up a bit, but it doesnt have to be exactly 5, or even shifted at all for that matter. Which got me even more curious, so I went to github and searched for “string to hex” and found the exact same snippet in projects licensed under a variety of licenses, which I’m pretty sure are all just careless copies of that stackoverflow snippet, though it’s not improbable that one of them is actually the one that got copied into that anwser in the first place. Now, for that particular one, I doubt that it reaches threshold of originality to be held in court, but it highlights an issue that makes this double AI conversion trick you described dubious and not-so-bulletproof as second would probably give the same result anyway, just because it almost definitely had to be trained on the copyleft infringing copies… Unless they actually clean-labbed an enormous dataset for it themselves.
Oh, and just as a final experiment, I did try to actually code it up myself. Despite already seeing an implementation, and without that much room for variety, without even trying, I made something that bears no resemblance to that snippet whatsoever, and most likely would’ve ended with the same code without ever seeing that snippet the first place. And from experience interviewing developers and live-coding with them, no two people write the same code, even for simplest of tasks. So… yeah, nah, I don’t buy the “AI does the same thing humans do” any more than the “forklift does the same thing as humans do”. I hope courts do as well, or, at least, this AI craze finally leads to copyright abolishment, it doesn’t really make any sense for both to exist…
I’d like to see the end of software patents, which IMHO are a much bigger problem than copyright.
You make a good point with the exact text bit though. It would certainly put the AI company on the defensive if the author of that original GPL snippet decided to go after openai etc for copyright infringement and license violation. I’d actually kinda like to see that happen.
I’m always so extremely confused about the trope of the personal project having shit quality… Like, if I’m doing something for myself, that’s exactly the place where I wanna do something amazing, like literally all my private projects have much higher quality than my work ones - because in the work ones I’m forced to use stupid conventions, old tools, am not supposed to touch “legacy” code, etc etc etc
As such, since companies have their private code on GitHub, that’s where I would expect the shittiness to come from, not personal private projects.
Like, if I’m doing something for myself, that’s exactly the place where I wanna do something amazing,
That’s always my intention with my personal projects too! But that always results in “Wow I just learned how to do this thing much better, let me refactor the whole project to do it perfectly everywhere” followed by my Adderall running out. So there’s just so many half-done refactors I either forget about or abandon because I get a new idea the next day, but that’s totally just a skill issue.
You’re right though, the code I write at work is much worse, but my Company hosts their own GitLab instance so the code we write can’t even be used to poison Copilot :(
I would love my personal projects to be of the highest quality but unfortunately i need to pay bills so i have to prioritize my work projects that get me paid
Maybe they meant abandoned projects that never quite got through the todo list but you’re right. Even my abandoned projects are generally better than the legacy I’ve seen lol
I’d like to believe that they were honorable enough to not secretly train on code without people’s permission. But realistically they totally did exactly that, but just made the AI Model this incompetent through some other engineering blunder.
Also, random side thought - training only on public repos probably yields you way higher code quality as opposed to training on both public and private repos? I assume we all have some very messy private repos that we’re too embarrassed to publish because the code quality is absolute shit … right?
Lol. Lmao
They didn’t check licenses in any way, as it did reproduce the famous quake fast inverse square root function, comments included. And quake, like majority of github projects, is published under GPL, which requires all copies and modifications to be published under GPL as well, after which all sane enterprises have banned copilot usage.
Though, we’re not living in sane times anymore. Chatgpt, gemini, deepseek, claude, all reproduce copylefted code left and right. Realistically, Stallman should’ve been rolling in cash by now…
This brings up an interesting question with AI
If I, as a human, read a piece of Open source code it solves a problem in a unique and new way, and then I myself write my own closed source code that solves the problem in the same way, I have not violated a license. The license is for the code itself, not a patent for the specific way the code solves the problem. And since the code in the closed source product is written by me and not copy pasted from the open source project, I have not violated the license per
So what about AI? If you train the AI on a piece of code, and it outputs the same or similar code, do we treat that as if the human copy pasted the code? Or do we treat it as if the human used what they learned from the first program and wrote something similar?
There is already an AI company taking advantage of this. They advertise that that if you want to use open source code in a closed source product, you hire them-- their AI will parse through the open source code and spit out a list of specifications that is specifically not code. Another AI on a completely different system that has never had access to the open source code will then take that specification and spit out program code that is functionally identical and does the exact same thing but is a completely new creation. The result is that you essentially rewrite the open source code but without the copyleft restrictions.
This is going to be an issue that laws and courts will have to address. Especially if, in your example, the code produced by the AI was actually identical to the GPL quake code. Because while a human copying the functionality is never going to write the exact same code line by line, the machine might be.
There is a bit of underlying problem, though. Case in point: I was recently asked to make a js function that converts any string into a hex color, which I promptly just copied off stackoverflow, but noticed they now intercept Ctrl-C to add CC-BY-SA banner. I’d usually include a link whenever I copy something off somewhere anyway, if not for licensing sake then at least for ease of future code navigation. But this got me curious, I asked ChatGPT the same question and it provided the exact same snippet, but without any attribution. And so did gemini and qwen 3.5. You can tell its a copy because the original uses a very specific bit-shift by 5 to mix numbers up a bit, but it doesnt have to be exactly 5, or even shifted at all for that matter. Which got me even more curious, so I went to github and searched for “string to hex” and found the exact same snippet in projects licensed under a variety of licenses, which I’m pretty sure are all just careless copies of that stackoverflow snippet, though it’s not improbable that one of them is actually the one that got copied into that anwser in the first place. Now, for that particular one, I doubt that it reaches threshold of originality to be held in court, but it highlights an issue that makes this double AI conversion trick you described dubious and not-so-bulletproof as second would probably give the same result anyway, just because it almost definitely had to be trained on the copyleft infringing copies… Unless they actually clean-labbed an enormous dataset for it themselves.
Oh, and just as a final experiment, I did try to actually code it up myself. Despite already seeing an implementation, and without that much room for variety, without even trying, I made something that bears no resemblance to that snippet whatsoever, and most likely would’ve ended with the same code without ever seeing that snippet the first place. And from experience interviewing developers and live-coding with them, no two people write the same code, even for simplest of tasks. So… yeah, nah, I don’t buy the “AI does the same thing humans do” any more than the “forklift does the same thing as humans do”. I hope courts do as well, or, at least, this AI craze finally leads to copyright abolishment, it doesn’t really make any sense for both to exist…
I’d like to see the end of software patents, which IMHO are a much bigger problem than copyright.
You make a good point with the exact text bit though. It would certainly put the AI company on the defensive if the author of that original GPL snippet decided to go after openai etc for copyright infringement and license violation. I’d actually kinda like to see that happen.
I’m always so extremely confused about the trope of the personal project having shit quality… Like, if I’m doing something for myself, that’s exactly the place where I wanna do something amazing, like literally all my private projects have much higher quality than my work ones - because in the work ones I’m forced to use stupid conventions, old tools, am not supposed to touch “legacy” code, etc etc etc
As such, since companies have their private code on GitHub, that’s where I would expect the shittiness to come from, not personal private projects.
That’s always my intention with my personal projects too! But that always results in “Wow I just learned how to do this thing much better, let me refactor the whole project to do it perfectly everywhere” followed by my Adderall running out. So there’s just so many half-done refactors I either forget about or abandon because I get a new idea the next day, but that’s totally just a skill issue.
You’re right though, the code I write at work is much worse, but my Company hosts their own GitLab instance so the code we write can’t even be used to poison Copilot :(
I would love my personal projects to be of the highest quality but unfortunately i need to pay bills so i have to prioritize my work projects that get me paid
Maybe they meant abandoned projects that never quite got through the todo list but you’re right. Even my abandoned projects are generally better than the legacy I’ve seen lol