Copilot is the perfect machine for clean room design and license/copyright laundering. It is unethical and unfair to the open source community.
I do not care if it breaks code to bits and recomposes them again regurgitated by <YOUR-LATEST-AI-TECHNIQUE-HERE> in a way that is untraceable: it would not work without learning from our open source code. Code produced by this method should be automatically licensed under the most restrictive license of its input used for learning.
I wholeheartedly agree. It's just obvious that this is harvesting work done by the free software community. It will be very obvious in retrospect, but hard to see now. If you consider that people will code more abstractly from here on, using AI code generation and understanding, automating workflows, the real value is in the way software is _used_, over the original source code. This is what GitHub has stolen (the representation of software as defined by its usage). Like a function can be defined by a formula or by its domain and range, so does software have multiple reprsentations. The representation of how software is used as being just as important as how it was written will become obvious in the future. GitHub should start by serving a model trained on ONLY free software because right now, in order to remain pure and keep separate from SAAS and copilot, we are losing productivity. It's not fair on open source!
That is NOT the point. You are allowed to learn whatever you want. What is horribly unethical is not recognizing the life-long effort of the people who wrote the original code and designed the original algorithms. Programmers are not machines. The human *knows* the open source that she/he is reading and she/he can acknowledge it in their own code (either public or private).
What is the copyright of code written with copilot? Copilot learns the code and forgets authors.
Would you agree if I take your open source project, learn piece by piece, rewrite it from scratch and put my name on it without a single word about your work?
> Would you agree if I take your open source project, learn piece by piece, rewrite it from scratch and put my name on it without a single word about your work?
If it was indeed written from scratch, I see no reason (although it’d feel nice) to credit my original work. Having multiple implementations of an idea is always a great thing.
How do you separate the implementation from the algorithm/idea? I do not believe that you'd be fine if you invest a significant period of your life on some idea that someone else copies without at least some credit (i.e., replacing your name by theirs?). Nobody works like this unless your time worths nothing or your idea is trivial. Open source would be ruined if everyone believed that copying smart code without recognizing the authors is ethical.
Would this kind of copying be fine in software and not in other scientific papers or other industrial processes? Would it be fine if I train copilot on a patent database and start creating new patents (at a rate in which is would be unpractical to determine that it is regurgitating prior art)?
> Open source would be ruined if everyone believed that copying smart code without recognizing the authors is ethical.
Open source would be ruined if it were easier to build upon past works with lower barriers to research and licensing?
> Would this kind of copying be fine in software and not in other scientific papers or other industrial processes?
Scientific papers are more about collecting and experimenting with novel data- and referencing an explicit paper trail of past results. It's not really comparable. Fiction is a better match.
> Would it be fine if I train copilot on a patent database and start creating new patents (at a rate in which is would be unpractical to determine that it is regurgitating prior art)?
This is a problem with the patent system, not copilot, and is also isn't a capability that copilot actually has. You're describing a different system entirely.
> Open source would be ruined if it were easier to build upon past works with lower barriers to research and licensing?
Why is recognizing someone else's work so much pain?
The whole point is that copilot forgets who wrote the code and who is the author of the whole idea (unfortunately few programmers write it but sometimes it is there is you are patient enough to read documentation). Thus a copilot's user cannot know who deserves the credit.
This whole discussion is like if you train an AI to pick apples from a supermarket and leave them on the street waiting for someone else to take them home, and pretending that nobody is stealing anything.
> Why is recognizing someone else's work so much pain?
Because its basically impossible to completely and accurately attribute the origin of all your knowledge. And it is impossible to verify that the source you think is the originator of your knowledge is the original creator of that knowledge. Odds are they learned it from someone else. It really doesn't matter, at all.
> This whole discussion is like if you train an AI to pick apples from a supermarket and leave them on the street waiting for someone else to take them home, and pretending that nobody is stealing anything.
No, because in this case the supermarket has lost apples. This is more like accusing street performers singing popular songs without permission of the songwriter of being thieves. Or an engineer studying a bridge and leveraging techniques used in that bridge.
> Because its basically impossible to completely and accurately attribute the origin of all your knowledge. And it is impossible to verify that the source you think is the originator of your knowledge is the original creator of that knowledge. Odds are they learned it from someone else. It really doesn't matter, at all.
Honestly? It has happened many times to me, and others. See: all the various code hosting sites. It's not worth the stress/getting worked up over it. People "steal" ideas from each other all the time, and people come to the same conclusion and ideas independently all the time too. I have more important stuff to worry about than "someone took my idea for a game and reimplemented it from scratch!"
This is a pretty stupid hill to die on. Humans read code and forget authors too. Nobody cites 100% of the origin of their knowledge when writing new code. Most people don't cite anything. You could write a script that says "this repo is similar to these repos" based on copilots embedding space and it would be far superior to any typical human attribution.
Computers don't have a private life, and applying the word "learning" to what they do is a convenient metaphor.
Computers read, process according to predefined algorithms, and output. A computer "learns" code when it comes over a wire in pieces over a bus, and writes code when it transmits it over a bus to a another device.
> Copilot is the perfect machine for clean room design and license/copyright laundering.
How's that? The entire point of a cleanroom re-implementation is that the, er, entities (historically, human programmers) writing the code have provably not seen the code being copied. Which is rather contrary to how copilot has seen approximately all the code.
From my point of view, the 'clean' part of 'cleanroom' means erasing all traceability to the original product. Not seeing the code is a good way to do that. But if you are a machine it is easy unsee whatever you want (e.g., identifiers, copyright forms, authors). Copilot (or any other system that learns from code) have one of these three evolution paths here to move forward:
- includes some sort of traceability (not usable for laundering then)
- becomes very good at "unseeing" the origin
- do not learn from code but rediscovers algorithms (without recognition to humans)
If you have a verbatim nontrivial snippet of a codebase, how does it matter whether it was copy pasted or copiloted? It can’t give “deniability” just because it looks like a black box.
I do not care if it breaks code to bits and recomposes them again regurgitated by <YOUR-LATEST-AI-TECHNIQUE-HERE> in a way that is untraceable: it would not work without learning from our open source code. Code produced by this method should be automatically licensed under the most restrictive license of its input used for learning.