The position of the FSF is severely misrepresented by the title. Open the full article, you'll see that all FSF says is GitHub Copilot is proprietary software and SaaS, and all forms of proprietary software and SaaS are unacceptable and unjust. What about the copyright issue of machine learning, then? FSF says it's a new thing with many open questions, they are not really sure, right now they are calling for whitepapers from the public to hear your comments [0].
I think it's a reasonable position to take. Reducing the scope of fair use to strengthen copyleft is a double-edged sword, as it simultaneously makes copyright laws more restrictive, such a ruling can potentially be used by proprietary software vendors against the FOSS community in various ways. It's an issue that requires careful considerations.
> as it simultaneously makes copyright laws more restrictive, such a ruling can potentially be used by proprietary software vendors against the FOSS community in various ways.
Could it? Copyright law is FOSS's only protection. That's why it's witty - copyright law against copyright. Weakening copyright law in an ad hoc way is absolutely not good for FOSS. It's fine to rewrite copyright in a way that explicitly allows things like Copilot, as long as FOSS gets to copy bits of proprietary code, too.
Otherwise, after some appeals court judgement that the FOSS community failed to participate in (or even worse, subelements participated in on the wrong side) we're going to end up with a copyright practice that looks like the NFL exception in monopoly law.
> It's fine to rewrite copyright in a way that explicitly allows things like Copilot, as long as FOSS gets to copy bits of proprietary code, too.
This is exactly what I was thinking about. If Copilot is fair use, it means that all proprietary source code, as long as they're publicly available to read, will be free to use as training materials for a hypothetical free and open source machine learning project, which I think would be a good thing. An example is a proprietary program released under a restrictive "source available" license, you can read it but not reuse it under any circumstances (and I believe these projects are already included in Copilot's training data). This is why I said fair use can be a good thing and a ruling to reduce the scope of fair use can potentially be used by proprietary software vendors against the FOSS community.
It would be even better if training from all forms of available proprietary binary code can be fair use, too. It may allow the creation of powerful static binary analysis or code generation tools by learning from essentially all free-to-download proprietary software without copyright restrictions. However, the situation of proprietary binary code is more complicated here. Reverse engineering proprietary binary code is explicitly permitted by the US copyright laws, but the "no reverse engineering" clause in EULA overrides it, and this can be a bad thing. It makes FOSS's fair use right meaningless, meanwhile giving proprietary software vendors a free pass to ignore FOSS licenses.
Thus the outcome is unclear, it may go either way, this is why I said such an issue requires careful considerations.
> This is exactly what I was thinking about. If Copilot is fair use, it means that all proprietary source code, as long as they're publicly available to read, will be free to use as training materials for a hypothetical free and open source machine learning project, which I think would be a good thing. An example is a proprietary program released under a restrictive "source available" license, you can read it but not reuse it under any circumstances (and I believe these projects are already included in Copilot's training data). This is why I said fair use can be a good thing and a ruling to reduce the scope of fair use can potentially be used by proprietary software vendors against the FOSS community.
FWIW this seems to be the current interpretation of copyright laws when it comes to machine learning, at least in the US. The only questions I've really seen about the legality of Copilot is about it reproducing code and whether that reproduction is fair use or not. But few are arguing that training the model itself on any available source is violating fair use.
> FWIW this seems to be the current interpretation of copyright laws when it comes to machine learning, at least in the US.
I think this is a sensible take. An AI should be able to learn to program from any source code it can see, just like a human.
> But few are arguing that training the model itself on any available source is violating fair use.
People argue this all the time on HN.
But these same people seem to believe it is just pasting bits of code it has seen before together, so I suspect they don't have the technical or legal understanding to comment sensibly.
I disagree that copyright is FOSS only protection.
But it is true that this proprietary product extracts is value on the basis of open source software exclusively.
Yes, it would be nice to have the source of autopilot in exchange, but I think far more important would be for third parties to have the same access to the code to provide similar tools.
Otherwise, I hope copilot makes it big. It'll create a new generation of developers that are dependent on these tools to do their work. Also it'll lower the barrier for non-software engineers to participate in writing code. SO copy pasting on steroids.
The resulting mediocre spaghetti will break at record-breaking rates; cleaning up the mess will be highly lucrative!
As a freelancer, every time a client decides for a cheaper alternative, I make very clear I would be delighted to work with them in the future anyway. It rarely fails, one or more years later, the clients calls me back because their cheap alternative turned out to suck and be expensive eventually. Last month, a client from Luxembourg called after 6 years of total silence. They still had me in their listing. 3 years ago, one called me because 2 years prior, the 50k quote they rejected from me turned into a 400K bill from my competitor, and still no release yet.
My rates have been steadily increasing for years thanks to this. Before, geeks were at a disadvantage because people didn't know better, and teams with a good marketing would destroy us. But now, they have been burned so many times. And it pays because more and more devs coming to the market are becoming dependent on their tooling. Now, more often than not, I work with teams that have been copy/pasting git commands not knowing what they do, that have never, ever looked the source code of their framework or don't know how to use a debugger. The HN bubbles tends to blind us to the reality of the corporate world.
Yesterday I did a deployment, but was not allowed to touch the machine. Instead, they made me call a guy sharing a screen of a Vista machine, while he was sshing prod using cmd.exe, and I had to dictate him the instructions to debug the deployment on their custom linux setup. A near retired sysadmin that couldn't type with 10 fingers, pressing 30 times the up arrow to find a bash command in the history every time. He could click on WinSCP very well though.
This 20 minutes job turned into an afternoon of billing.
Though I suppose that's what I look like as a Python expert to an old timer from the 80s that can code in assembly, debug using strace and understand L1 cache :)
People are scared we are going to get automatized by AI.
I am preparing for the most lucrative decade I ever worked in.
> Near retired sysadmin that couldn't type with 10 fingers.
Hey, I feel called out on this. I type with 2/3 fingers and I'm quite fast even if not like a full fledged 10 fingers typist. At my age I don't think I will ever learn typing with 10 fingers, but I think I don't need it either.
There's a fluency effect that happens when you're able to touch type without looking, similar to when you master a language enough to speak without thinking. You become more efficient because you dissolve the barrier between your thoughts and their expression.
I know there are lots of arguments intending to counter the importance of touch-typing in programming ("most of my time is spent thinking"), but I think those miss the point. Faster typing is just as valuable whether you're programming, or writing an email, or responding to a message.
Where in their reply did it say they weren't a touch typist? They most likely touch type with 2-3 fingers. Not that shocking if you learned to use computers without typing classes.
While I guess it's theoretically possible to memorize the positions of all the keys using only a few fingers (so that you're typing purely by touch, without looking), I haven't ever seen that in the field.
Usually when someone says "touch typing" they refer to the standard "home row" approach, using all fingers. I could have made that more explicit.
Well I do know all the keys position and I can write without looking at the keyboard, typing with just 2 fingers. (I just switched keyboard last week from a 70% to a 100% so I'm getting used to it)
I type at roughly 150-160wpm depending on what I'm typing and for the length I'm typing (this range is for 30s and 60s tests).
I touch type.
I use 4 fingers on my left hand and 3 fingers on my right to type. (This is counting thumbs on both hands)
I've never really understood the home row because if my hands are on the keyboard I'm actively typing (or playing video games), in which case all my fingers are busy actually doing things, so I basically just put my hands down where I'm going to start typing anyway. When I'm actively typing I'm not wasting time moving my hands back to the home row, either.
Do it. After 20 years of coding and thinking I was fast I was embarrassed into it after watching a teenager type faster than me (they had just completed a typing course). It won't speed up your programming but it will make your communication and documentation a lot smoother. It really only takes a few weeks of forcing yourself to touch type for emails etc. (Don't start with programming, it will be too frustrating).
Maybe it was just because they were a teenager. I'm not far past teenager myself, and I started typing before getting typing classes in school. On random excerpts (typeracer.com) I get 80-100wpm with 98% accuracy, typing with about 4 fingers.
Haha! I have 10 fingers, 5 of them almost always near my chin as part of a misoptimized supporting structure, 4 of the rest 5 usually spend most of their time holding/transporting a mug, leaving only 1 left .... the most busy and important one of course: I use it to poke the mouse/touchpad so the screen stays on.
Only during the most rare occasion, those 10 fingers will give up their self-designated posts and come to this massive array of buttons to do their exercise of pressing stuff, something like "su apt install", "dock run", "cd.." etc.
I'd say type fast != work fast, so I don't mind if someone is a slow button presser :)
The curse of having access to a computer years before your first typing class and never being able to un-learn your self-taught method. I know the pain.
My self-taught typing method uses “whichever finger is currently closest to the relevant key”, and has a lot of hand movement – might have something to do with my being a pianist.
Honestly if I had taken a typing class, I would probably have carpel tunnel by now. My hands attack the keyboard at an angle over the left shift/ctrl and right arrow keys. Typists, at least when I was in school, are taught to attack straight up through the ZXC and M,. keys, which keep your wrists at a bad angle.
I made a conscious effort to unlearn bad habits in my 20s and it was pretty easy. I definitely press some keys with the wrong fingers, though. (1 is especially annoying to press with your pinky, so I don't do it.)
I’ve met hunt-and-peckers who can rival my 90wpm touch-typing (and I use the term loosely - I use all 10 fingers and don’t need to look at the keyboard, but also don’t keep them on the home row nor do I follow any formal typing methodology) . Never underestimate the speed that a lot of practice can give.
Have you ever taken the plunge though and really tried to go through with learning touch typing? You may not find it impossible after all. Then again typing isn’t really a huge bottleneck when it comes to coding.
I find touch typing with all fingers is overrated.
I type with only two/three fingers. I don't hunt and peck, but I'm no touch typist either. I can type faster just with my index fingers than some people with all of their fingers.
That said, typing speed is not critical. I mean, if you're really slow I guess it matters, but it's no measure of the quality of your work. The brain is the bottleneck here, and all the slowness happens in the design/troubleshoot/think space anyway.
I type in this weird hybrid .. thing. I find it much more comfortable, though more error prone, than traditional home-row on traditional keyboards.
I use nearly all fingers, but where i really suffer is that i find key-combinations, notably alt+ to be really awkward because my hands are at a steep angle relative to the straight keyboard.
I live in Kakoune (vim-like) so "touch typing" is my bread and butter, but home-row just feels so bad to me.
I keep meaning to try a split keyboard with home-row. I suspect that's the root of my issue, and that my odd typing pattern is a result of trying to manually replicate a split keyboard. /shrug
I can recommend a small split ortholinear keyboard. I built one as a weekend project and it was pretty fun, doable with basic microcontroller and solder skills (except for the ~50 SMD LEDs which I couldn’t be bothered to do). Pretty happy with it, it’s comfortable and I do feel that I type better than on a regular keyboard. You can buy kits containing everything you need, I got this one:
With small keyboards, you’re trading off physical distance between keys for having to press more buttons simultaneously. Might fix your problems with reaching key combinations, but the combinations themselves do become more complicated as well.
I've wanted to pursue them but my hope was to find software to do stateful/modal transitions. As a fake example, instead of pressing Alt+Z you'd press Alt then Z. It becomes a lot like modal editors, which is my favorite style of editing - i add a lot of usermode stuff into Kakoune to avoid key combinations.
So far i've not felt i could get software to do the modal editing i speak of reliably in all of my environments. I'm on NixOS right now, and i didn't want to manage the software. It definitely is interesting though!
I did. Have probably put ~50 hours total into (free) typing courses during the last year or so.
I'm still painfully slow. Maybe 50wpm tops for natural language, embarrassingly much less for programming.
Thing is, I'm now even slower than I was before "taking the plunge", and I can't even go back to my old loose method I've nurtured for 20 years!
On the flipside, as you mentioned, typing in itself isn't a huge bottleneck, especially with autocomplete, and I'm much faster navigating the ide. So maybe it's a net positive after all.
Hey, whatever works. If you do want to make the jump though, my suggestion would be to get a keyboard with no key labels (like a Das Keyboard), you'll learn touch typing pretty quick. Unless you are somehow touch typing with 2-3 fingers, which in that case hats off to you!
>Now, more often than not, I work with teams that have been copy/pasting git commands not knowing what they do, that have never, ever looked the source code of their framework or don't know how to use a debugger.
Thank you for reminding me that it's about time for my yearly reread of Pro Git. It's amazing how many people look at you like you're a wizard when you just...read the documentation.
It is amazing to see how so many so-called programmers can make a living without even reading the freaking documentation. And when I cide the docs on some topic they look at me as if I was performing a magic trick.
This has also become my strategy lately. I got fed up with price wars by tons of cheaper software dev shops. I am Indian and have run them myself, but I am an engineer first - the really curious kind. I can not do price war, I do not like writing casual code just for bucks. So I started keeping just good relations with anyone who is attracted to "low cost development shop". Eventually I started seeing myself helping multiple founders with "low cost teams". I think gradually they will start billing me to guide these low cost teams. (Although that might not be happen since I am also planning to go into the founder mode).
And it's a never ending cycle. I get older but the CTOs stay the same age. They kick out the old IT, hire some foreign Sales Force squad to rebuild everything. It ends up costing 5 years worth of salary of the old team in just 1 year. Nothing ever gets past the finish line and the existing software just stays in place. CTO moves on with millions in his pocket, owners are left with nothing and end up trying to get the old team back to maintain the existing software. A few years later, another hotshot shows up.
Reminds me of the offshoring craze about a decade ago. Everyone was worried that there would be few software jobs left in America. Now it seems every company is trying to re-onshore development and desperately searching for people and driving up wages
It sounds like your customer is dealing with low maturity staff and no amount of consultation is going to fix that even if they decided to pickup co-pilot which is also unlikely if they haven't picked up ctrl-r.
> The resulting mediocre spaghetti will break at record-breaking rates; cleaning up the mess will be highly lucrative!
Maybe, maybe not though. From the perspective of a non-tech enterprise organisation we’ve moved to more and more standardised software that is “good-enough” to avoid dealing with the delays, going over budget, not quite what we wanted and expensive support of specialised software companies.
Office365 has basically replaced half our software suite, and while we do still by some extensions for them from 3rd party companies, Microsoft is simply getting more and more of our business by simply being good enough at a low enough cost.
I’m not going down on some conspiracy path here by the way. If anything, Microsoft is simply using this project to get free research for their Azure Automation services that are currently taking over all the RPA business from their much more expensive competitors. This needs janitors, but not well paid ones.
Yes this happens all the time the client asks for the world but really just wanted an improvement.
The client has a problem and asks us for a solution. We suggest a simple cost effective solution, client insists on custom software developed to their spec they have "perfected". Client lists all their nice to haves as must haves so they get their moneys worth, not realising I just charge more for more work.
The software is delivered to spec, then the client realises that their spec doesn't work in the real world because they just assumed the best and forgot about edge cases.
Non-tech companies just don't get tech, instead of seeing building software like building a house they view it more like a wizard does magic then a website appears.
Funnily enough I just used Copilot to write a reasonably huge PR (it did like 95% of the work), which was indeed mostly a copy paste job (the whole library is a SIMD library with lots of similarities between the different types and operations) and Copilot made zero mistakes when it didn't suggest something completely different whereas the human copy pasted code that was already in there had tons of mistakes that I noticed as I went through the library. So interestingly when it comes to code that is mostly copy paste, but requires some subtle changes here and there (based on the type and operation, ...), Copilot is much better at it than humans.
Funnily enough the only thing you can say is that you can copy and paste more successfully with copiolot than someone else who is unnamed and possibly unknown.
The truth of it, whatever that may be will shake out.
Interesting. Isn't that a good fit for macros though? Particularly for the future maintainers sake. With macros, future maintainers will only need to tweak macro invocations or macro bodies, instead of having to do a huge copy paste job all over again
Even when I have to look up all the weird syntax for a macro_rules! macro, they take like 10 minutes, tops. If you're taking much longer than that, you're probably trying something too ambitious: it should either be proper code generation (in Rust, a build script or proc macro) or a const fn, or a trait and proper generics.
C's even easier; you just write the C code with parens around everything, then run it through cpp, then correct any divergences from the expected code. (It does take a bit longer, though, because the compiler waits until the last minute to shout at you if you make a syntax error.)
I can see complicated C++ templates taking hours, but they're not really macros. (They're probably the correct tool for this, though.)
>Otherwise, I hope copilot makes it big. It'll create a new generation of developers that are dependent on these tools to do their work.
There's definitely other scenarios, like my preferred one of Copilot being legal itself but devs being responsible for using code generated from it, same as if they were using a more direct copy-paste or search tool.
How could they, when they don't know the origin of the code? Copilot would have to provide them with all the licenses of all of the projects that went into producing that particular snippet of code, and I'm pretty sure that's impossible beyond just listing every license every time.
Well, it's all from public repos that you can search and I imagine they can also include a tool that analyses how close to which repos the produced code is so you can check for yourself.
Morals/Ethics of code ownership aside, I think there's an interesting challenge there in developing a tool like Copilot but without developers falling into the blind "copy-paste" mode.
I think there's utility there, but the execution isn't quite right with me for encouraging the right behaviours.
> Also it'll lower the barrier for non-software engineers to participate in writing code. SO copy pasting on steroids.
It always amazes me how many people think in the future every one will know how to code, why? Every one has a car but not every one knows how to build a car, and why would they want to?
I agree 100%. I can't wait to hear the cries of failing businesses because their no-code / low code developers can no longer understand how the system works. And while waiting I'm broadening my knowledge into the field of devops too, so I can understand the whole stack from top to bottom.
That being said, if you are a reasonably skilled coder, Copilot can help you __a lot__. I've started using it a few weeks ago and it typed __a lot__ for me. The sheer amount of time I don't have to spend typing amounts to hours / week once you get up to speed with the tech.
People said the same from GUI webtools, look where are we now. I think co-pilot works for some stages, cases, but for well established software projects we will be required a different extra.
From someone who has never worked professionally with web development: where are we now?
I tried my hand at Dreamweaver back in the day (approx. 2005) and didn't like what it generated. I've written a few pages manually (first when learning HTML ca 2004, then for a personal website in 2014) and it felt much nicer.
I have made Windows applications with GUIs as part of my job and for that I've mainly used WPF written as a mix of XAML and C#, written by hand and inspected in the editor. There are graphical tools, but I've mostly found it more efficient to write what you mean directly.
But how are things actually done in the web development business, nowadays?
Wix and Shopify are both multi billion dollar companies that give casual users access to websites that would previously require dedicated devs. This is done via elaborate WYSIWYG website editing. There used to be decent business in low-end web dev, banging out high volumes of simple websites and storefronts - but that market has been in large part taken over by WYSIWYG tools.
There's still decent business in sharecropping Wix/Shopify/Salesforce/etc. Very quickly companies find that the out of the box stuff doesn't do 100% of what they want, so they need to pay to get that last bit and develop custom components..which yeah, anyone can then drag onto the appropriate pages, instead of having to pay someone who happens to know how to FTP things onto a server.
You basically open your Terminal, run dozens of NPM commands and install dozens of libs and start writing javascript, css and jsx. Then you run commands to build and deploy.
I guess it's tongue-in-cheek. But you made me imagine the scenario for a movie : once humans start going down that rabbit hole, code becomes more and more like nature : no over-arching "design" that can be reasoned about, just a sprawling mess of stochastically created spagetti that has been progressively patched.
>code becomes more and more like nature : no over-arching "design" that can be reasoned about, just a sprawling mess of stochastically created spagetti that has been progressively patched
If one were to "decompile" an existing artificial neural network model, is this basically what it'd look like inside? Or is it too crude of an analogy / a category mistake?
What happens when one all-coder needs to understand or debug another all-coder's code? The definition of "readable" may change, but I imagine there would still be ways of organizing the code that would make it easier or harder. The halting problem would seem to imply that, for any given all-coder, it's possible to obfuscate hard enough to frustrate that all-coder.
Science. The universe has no source code for us to read, so we tinker and investigate, take our best guess, see if it holds up under the test cases we're capable of running, and update that guess whenever we encounter edge cases the last guess can't explain. We've gotten quite far without perfect understanding.
I have a feeling that the distribution of complexity in the laws of the universe is likely to be very, very different from the distribution of complexity in code created by an extremely intelligent being (with machinelike memory) that isn't optimizing for simplicity.
Verifiability. I have no source for this, but on a basic level it makes sense that a clean implementation will be easier to verify over an ad hoc spaghetti code doing the same thing.
I fully agree, yet I see this tool as a mere prelude to a world where we, developers, we're gonna be obsolete. At first, the AI will produce some mess and fixing it would be lucrative. In the long run, the AI will create software based on paradigms we human couldn't understand.
I'm in the industry since 20 years and gradually at the spare time I'm learning skills in house renovation. I think in the next 10-15 years I'll loose my job as software developer do to AI and will resort to some manual labor. Hoping I'd survive till retirement
I find that very unlikely. I think the result will be similar to what has happened to the electronics industry.
For those who aren’t aware, PCB design used to be an automated task, done by software with minor tweaks. The thing is, complexity had a positive payoff, so soon we had trained technicians doing layout. Right now most PCB layout require so much technical knowledge most people working in layout are engineers with masters degree.
Of course there’s also a lot of cheap electronics where complexity doesn’t payoff and cutting development cost it’s what matters, but it’s not most of the market.
As long as you keep learning and improving, you are likely to see an increase of demand, not a decrease, although the job will be quite different.
Now that you mention it. I have a Masters in electronics, and I did a lot of layouting in the last 3 years. Not that big or complicated, but it is becoming a significant portion of my engineering time. It is perceived to be cheaper to just do the layouting in house because our system is small and benefit from fast iteration.
No way, to replace software engineers one would have to have GAI, which we are lightyears away from.
The hard part of code writing is not the “transform this logic to code”, but to come up with the logic in the first place, which is pretty much transform this and that requirement into logic first. Which does often need domain specific knowledge, and possibly interaction with the client.
Requirement logic and interaction with customer to shape it is the domain of the Business Analyst (or similar position). I can imagine that BAs in our company, equipped with a slightly better version of the Copilot, could prepare a lot of code. BAs in our company have limited knowledge of coding. Yet I'm quite certain that they are capable of selecting the right implementation proposed by the Copilot, in most cases. Without resorting to developer's help, they would just click to prepare are routine. Like automated checkouts in the supermarkets, they don't make the checkout jobs disappear completely, but they're substantially reducing the need for them.
+1 Sorry you were downvoted for a reasonable position. I would like to predict that most knowledge work will be automated, including understanding business needs and doing designs, coding, and maintenance. However, I think that we will reach the same conclusion with AGI but in ways we can’t predict right now. Just deep learning won’t get us there on its own but that is a different conversation.
It still amazes me when people doubt or underestimate what can happen in future tech.
I think a lot of people massively overestimate the current state of AI. Unless there is some fundamental breakthrough in computation, I just don’t see how complex knowledge based jobs will be replaced any time soon. Maybe in 50 years. Maybe.
I don't really understand this point of view. The day demand for software developers diminishes, but demand for manual laborers remains, I will start to automate manual labor.
Copilot is the perfect machine for clean room design and license/copyright laundering. It is unethical and unfair to the open source community.
I do not care if it breaks code to bits and recomposes them again regurgitated by <YOUR-LATEST-AI-TECHNIQUE-HERE> in a way that is untraceable: it would not work without learning from our open source code. Code produced by this method should be automatically licensed under the most restrictive license of its input used for learning.
I wholeheartedly agree. It's just obvious that this is harvesting work done by the free software community. It will be very obvious in retrospect, but hard to see now. If you consider that people will code more abstractly from here on, using AI code generation and understanding, automating workflows, the real value is in the way software is _used_, over the original source code. This is what GitHub has stolen (the representation of software as defined by its usage). Like a function can be defined by a formula or by its domain and range, so does software have multiple reprsentations. The representation of how software is used as being just as important as how it was written will become obvious in the future. GitHub should start by serving a model trained on ONLY free software because right now, in order to remain pure and keep separate from SAAS and copilot, we are losing productivity. It's not fair on open source!
That is NOT the point. You are allowed to learn whatever you want. What is horribly unethical is not recognizing the life-long effort of the people who wrote the original code and designed the original algorithms. Programmers are not machines. The human *knows* the open source that she/he is reading and she/he can acknowledge it in their own code (either public or private).
What is the copyright of code written with copilot? Copilot learns the code and forgets authors.
Would you agree if I take your open source project, learn piece by piece, rewrite it from scratch and put my name on it without a single word about your work?
> Would you agree if I take your open source project, learn piece by piece, rewrite it from scratch and put my name on it without a single word about your work?
If it was indeed written from scratch, I see no reason (although it’d feel nice) to credit my original work. Having multiple implementations of an idea is always a great thing.
How do you separate the implementation from the algorithm/idea? I do not believe that you'd be fine if you invest a significant period of your life on some idea that someone else copies without at least some credit (i.e., replacing your name by theirs?). Nobody works like this unless your time worths nothing or your idea is trivial. Open source would be ruined if everyone believed that copying smart code without recognizing the authors is ethical.
Would this kind of copying be fine in software and not in other scientific papers or other industrial processes? Would it be fine if I train copilot on a patent database and start creating new patents (at a rate in which is would be unpractical to determine that it is regurgitating prior art)?
> Open source would be ruined if everyone believed that copying smart code without recognizing the authors is ethical.
Open source would be ruined if it were easier to build upon past works with lower barriers to research and licensing?
> Would this kind of copying be fine in software and not in other scientific papers or other industrial processes?
Scientific papers are more about collecting and experimenting with novel data- and referencing an explicit paper trail of past results. It's not really comparable. Fiction is a better match.
> Would it be fine if I train copilot on a patent database and start creating new patents (at a rate in which is would be unpractical to determine that it is regurgitating prior art)?
This is a problem with the patent system, not copilot, and is also isn't a capability that copilot actually has. You're describing a different system entirely.
> Open source would be ruined if it were easier to build upon past works with lower barriers to research and licensing?
Why is recognizing someone else's work so much pain?
The whole point is that copilot forgets who wrote the code and who is the author of the whole idea (unfortunately few programmers write it but sometimes it is there is you are patient enough to read documentation). Thus a copilot's user cannot know who deserves the credit.
This whole discussion is like if you train an AI to pick apples from a supermarket and leave them on the street waiting for someone else to take them home, and pretending that nobody is stealing anything.
> Why is recognizing someone else's work so much pain?
Because its basically impossible to completely and accurately attribute the origin of all your knowledge. And it is impossible to verify that the source you think is the originator of your knowledge is the original creator of that knowledge. Odds are they learned it from someone else. It really doesn't matter, at all.
> This whole discussion is like if you train an AI to pick apples from a supermarket and leave them on the street waiting for someone else to take them home, and pretending that nobody is stealing anything.
No, because in this case the supermarket has lost apples. This is more like accusing street performers singing popular songs without permission of the songwriter of being thieves. Or an engineer studying a bridge and leveraging techniques used in that bridge.
> Because its basically impossible to completely and accurately attribute the origin of all your knowledge. And it is impossible to verify that the source you think is the originator of your knowledge is the original creator of that knowledge. Odds are they learned it from someone else. It really doesn't matter, at all.
Honestly? It has happened many times to me, and others. See: all the various code hosting sites. It's not worth the stress/getting worked up over it. People "steal" ideas from each other all the time, and people come to the same conclusion and ideas independently all the time too. I have more important stuff to worry about than "someone took my idea for a game and reimplemented it from scratch!"
This is a pretty stupid hill to die on. Humans read code and forget authors too. Nobody cites 100% of the origin of their knowledge when writing new code. Most people don't cite anything. You could write a script that says "this repo is similar to these repos" based on copilots embedding space and it would be far superior to any typical human attribution.
Computers don't have a private life, and applying the word "learning" to what they do is a convenient metaphor.
Computers read, process according to predefined algorithms, and output. A computer "learns" code when it comes over a wire in pieces over a bus, and writes code when it transmits it over a bus to a another device.
> Copilot is the perfect machine for clean room design and license/copyright laundering.
How's that? The entire point of a cleanroom re-implementation is that the, er, entities (historically, human programmers) writing the code have provably not seen the code being copied. Which is rather contrary to how copilot has seen approximately all the code.
From my point of view, the 'clean' part of 'cleanroom' means erasing all traceability to the original product. Not seeing the code is a good way to do that. But if you are a machine it is easy unsee whatever you want (e.g., identifiers, copyright forms, authors). Copilot (or any other system that learns from code) have one of these three evolution paths here to move forward:
- includes some sort of traceability (not usable for laundering then)
- becomes very good at "unseeing" the origin
- do not learn from code but rediscovers algorithms (without recognition to humans)
If you have a verbatim nontrivial snippet of a codebase, how does it matter whether it was copy pasted or copiloted? It can’t give “deniability” just because it looks like a black box.
I'm kind of wondering if this controversy might not end up being a storm in a teacup.
From what I've seen copilot really lowers the barrier to writing buggy code. If indeed it does turn out to be a tool that lends itself to machine gunning rather than shooting yourself in the foot it almost doesnt matter who owns what IP.
The relentless attempts at developer commodification will, of course, continue, but I can already sense this one ending up like the developer outsourcing craze of the mid-2000s that the Economist also got a little too excited about.
If the code ends up being non-IP infringing on the code used to train it, that would be a big win for open source / free software.
Now you can just grab any leaked code of a closed source program, feed it into your AI and get back code you can license under the GPL and nobody can do anything about it.
An easy application I can think of is ZFS; simply feed the AI all CDDL licensed code, then ask it to reproduce ZFS. Probably will have some bugs but it would be licensable under GPL if the AI is considered a whiteroom.
I think you’re missing that the law considers intent. If the devs of copilot were not trying to set up infringement, then their algorithm’s output is likely not considered infringement [1]. However, if you set out to “launder” copyrighted material then the law will take that into consideration and likely find that you violated copyright. This intent can be demonstrated in court either via your statements, or your actions (such as constructing a meaninglessly tiny training set).
Would it not be categorically intended as infringement regardless of the copyright status of the material?
It seems to me that the licensing part is the part you can't throw into a big markov chain, legally. Even if they aimed only at open-source licensed material without exception, the point where they discard all the licenses and export a 'generic' slurry is the point where they infringe by definition. If they trained on more restrictive licenses that's just doubling down: what's needed is annotation and maintenance of what bits of code came from what licensing pool. You could well have a giant pool of GPL, a giant pool of MIT (which I would be in, all the more since I maintain a very automatable code style that's easy to import from). You could accumulate a list of sources for anything you did, at whatever level of granularity is desired.
The purpose of throwing away this attribution is intent to infringe. It's constructing a machine for the explicit purpose of grinding code into sludge of intentionally small enough pieces that, if you reconstruct copyrighted code in your markov-chainy way, you've got grounds for pretending you didn't build your machine to do exactly that.
> you've got grounds for pretending you didn't build your machine to do exactly that.
I believe all laws about intent have to deal with determining who is pretending and who isn't. But these laws still exist, because there are ways to prove such things
I don't think that is so easy, a tiny training set would obviously defeat the point but the other part is that the AI can't commit copyright infringement, and I don't have to ask it to produce anything. I merely fed it copyrighted code and released it to other developers without documenting that fact. Possibly open source the entire bot, as no part of the AI would be under the restrictions of the training set.
Again, the law isn’t enforced by robots and is able to adapt such that “clever legal hacks” don’t typically work. Us programming nerds tend to think in terms of rigid, unambiguous rules that treat inputs as black boxes, but the law does not work like this.
If the AI could be shown to have copied the code it would likely to be found to be infringement.
If it was found to have generated new unique code, and merely leant how to program from the code it was trained on it likely wouldn't.
In either case, this is different to a clean-room implementation (which I think is what you said by "white room").
Clean-room implementations are supposed to protect against trade secret infringement, and are mostly used when building interop with hardware (where compatibility has special carve-outs).
If a person or AI had seen copyright code used in the project it would never be considered clean room.
But CDDL code is fine for a person or AI to learn from when building a new, incompatible implementation that doesn't share any code.
If you hire a programmer that has worked on said closed source and ask them to recreate that code in your GPL licensed program - at what point will that be considered infring on the original? Can you relicense code of you feed it through a programmer?
It’s called a transpiler; you don’t need AI for this, but it’s obviously still licensed same as original - because it is the original, only translated.
I'm not talking about a transpiler, I'm talking about feeding massive amounts of non-GPL code into an AI and then ask it to produce new code based on that. A transpiler would simply take a single codebase and translate it into a new format, the obvious difference being that such a tool has an obvious and introspectable transformation function.
If the resulting code works the same way, it’s still a transpiler. If the resulting code works in a different way… then I have to ask what exactly does it do and how’s that supposed to be useful.
Most of the relevant ZFS patents already expired, I don't think there is anything for the Oracle lawyers. Plus i live in a country where Software patents aren't recognized, so double good luck to Oracle.
Sure. Regardless of how wacky your definition of "functional" gets, it is possible and relatively easy to write bugs in Python, Scheme, Haskell, or OCaml; all of these languages confuse `x + y` and `x - y`. Idris, Agda, or Coq can catch that mistake, but still suffer "Boolean blindness" and other traditional problems.
There are plenty of bug classes which are trivial in any language; plan interference is a good example. Languages provably cannot avoid these bugs entirely, just make them less easy.
I’ve used it/do use it and it helps to fill out obvious stuff - it didn’t make me much quicker.
The part that takes the longest is working out the tests and what the code should do, the actual internals of the implementation are simple, boring, and obvious.
Automate that and it makes developing even more fun that it is today.
I tend to find if it's that obvious you're probably already using a library.
Or, if you're not, you should be.
But, if copilot instead suggests just writing out the contents of the library directly into your code base a lot of people will do just that. That'll be lots of fun when you're trying to track down obscure bugs in huge piles of murky "copilot assisted" code.
It'll be especially bad in environments where developers feel either extrinsic or intrinsic pressure to always write more SLOC and churn out more PRs because it will allow developers to create a very compelling illusion of productivity.
I have a feeling this will be one of the long term side effects of copilot. I'm actually suspicious that this dynamic will blow away all of the productivity gains and then some and might lead to companies banning its use when they realize the true costs of sifting through the GPT spew.
I think we are using “obvious” in a different way, I mean like if I want to write an if statement or something that is easy to write, it does it for me.
I run through this point with other developers a lot. There are hard technical problems out there but a great deal of difficulty in programming is in reasoning about a domain. If Copilot is good enough that it can solve problems in any domain, is it close enough to AGI that we can call it a day?
> relentless attempts at developer commodification
LOL, it's been happening since the beginning of software. So many things reduce or replace developer work - compilers, libraries, templates, free/open tools. Desire is always going to expand to contain the whole space of what's possible and then overflow.
Copilot is a fancy autocomplete tool for code. I think the controversy comes from it being trained on public repos without adhering to licensing. I used copilot and thought the best part was when it would autocomplete based on other code I was writing. Sometimes the Copilot would help me see places where I had repetitive code which could be turned into a function.
I think MS knows damn well that they've forfeited the ethics of their code generation. There's a reason they've trained the model on Github repositories instead of, say, the Windows kernel driver tree. They know their model arbitrary copy/pastes other people's code so they train it almost exclusively on other people's code that they don't care for it it gets stolen. Their assumption seems to be "if Bing can find it, it's up for grabs, no matter the license". Good luck getting the same treatment from MS if you upload the leaked XP kernel to github to make your own fork.
I'll accept the ethics of copilot when they add the source code for Windows, Azure and Office to their training set, because only then will MS truly reflect that their model doesn't cross the spirit or even letter of any licensing.
> I think MS knows damn well that they've forfeited the ethics of their code generation. There's a reason they've trained the model on Github repositories instead of, say, the Windows kernel driver tree. They know their model arbitrary copy/pastes other people's code
Microsoft can of course create Copilot using the GitHub code. It’s not publishing any derived work on its own - and this type of access to the code is likely a large part of the reason for buying GitHub in the first place.
The only ethical issue for Microsoft here is if Microsoft sells this service (they don’t - yet) and risk including nontrivial code without attribution (seems likely, given the behavior of the preview but if ms for example limits output to a few lines or prevents generating too large chunks verbatim the issue almost disappears).
Ethical/legal issues and risks for users of Copilot are much larger, such as if they use it to conjure up a nontrivial snippet and then not research the origin of it. It’s no better than copying it from the original location.
Microsoft could probably throw in parts of their closed source in copilot - but not even Microsoft controls that. Third parties have copyrights that prevent it too.
But people who keep code in public GitHub repos (I assume) let GitHub do things like train neural nets on it, and Microsoft obviously don’t keep much of the windows or office sources in public GitHub repos.
I don't think selling the service or giving it away for free makes any difference. They're not creating services out of the goodness of their hearts, and these projects rack up a lot of server costs. Even if their service is free, they're getting a return on their investments somehow.
The fast inverse square root is the most nontrivial code I can think of and it's already been found to appear in suggested snippets, with attribution nowhere to be found.
If we accept Copilot as merely a tool, we'd need to consider any developer using that tool to be immoral. There's no discernable difference between shamelessly copy/pasted code and Copilot output, so why consider the tool more than an automated clipboard?
No, I think the tool is built wrong, setting users up to fail. It's a copyright footgun to produce buggy, vulnerable, often even completely wrong code.
As for the copyrights, all code with a license has the same copyright as any private code hosted on their own servers. You can't just plug some GPL code into your project and sell it, even if you can find the code itself on Google. There is no copyright difference between the projects, it's merely a matter of availability to the scanner.
Adding Microsoft's own, proprietary, quality code to the network would be the gesture of good faith that would make me believe that the developers never intended to break any licenses and that it all just got out of hand.
I can’t see how developer ethics comes into it at all here. Either the code is trivial boilerplate and not a license issue, then there is zero ethical issues with using it in my opinion. Just like I copy 2 lines of code to open a file from any repository with any license without either ethical or IP worries. If the code is nontrivial like the fast inverse sqrt - then it’s on the user to realize that they have been fed a landmine by copilot, and it’s on them to avoid or attribute as appropriate. This is a license issue though, not an ethical one. I fail to see a situation where it’s ethical to violate a license or unethical to
use code that doesn’t violate a license.
Note though that all such examples of nontrivial regurgitation that have been presented yet have been deliberately “triggered” (as far as I know) knowing they would likely show up if copilot was fed the function header. It’s also important to remember that this is still preview software. The final version hopefully has more restricted output since this is obviously the big weakness of the system.
I agree it’s a license footgun 100%. But as I said this is the developers problem. Which is why few of us will ever be able to use it in its current form.
As for the ms sources argument - the reason ms bought GitHub is to have this kind of access to a lot of code. It’s their code to use in this way. People who committed code gave GitHub (and it’s future owners) the right. Microsoft (as far as I understand) can sell the right to view this code, for example, through GitHub fees. It’s not against the license of a GPL repo to do so. So Microsoft isn’t violating a license by mangling the code into snippets and charging for the pleasure of downloading those snippets. What’s against the license terms is for me to download the snippet, and accidentally use it in my proprietary software.
Does that make the tool bad to the point of being useless? Perhaps. Is it illegal or unethical? I don’t think so.
> You can't just plug some GPL code into your project and sell it, even if you can find the code itself on Google.
Although some people seem to think copilot can be used to “wash” licenses by giving users a black box “excuse”, I think that idea is dead in the water. Anyone who has a nontrivial-enough GPL snippet in their proprietary code has violated the license.
> "There's a reason they've trained the model on Github repositories instead of, say, the Windows kernel driver tree."
At least part of the reason has to be because only a tiny percentage of developers use C++, particularly the flavor of C++ that Visual Studio speaks, as opposed to Javascript, Python, etc. Moreover, kernel and driver code doesn't resemble boilerplate code used in desktop applications. Is this not obvious to the people who keep repeating this?
No code resembles other kinds of code. A Python data processing script resembles nothing of a Django Web server, but both are considered Python. A dotnet MVC server has completely different architectures, standard types and behaviours than a Windows Forms application.
The C and C++ boilerplate Microsoft uses is very much relevant to any driver development or native applicationd development (if that still exists) for their platform. Their example code, MSDN snippets and documentation is very influential to anyone using C++ for Windows applications. Their COM+ libraries are even more relevant because they all live in user land.
There's also plenty of MS code that's written in other languages for platforms like Azure or UWP.
The there's the C and C++ code that's out there on Github. The C style of the Linux kernel, forked over and over, is completely useless for anyone developing network tools. The GTK or QT C++ files are useless for anyone writing wxWidgets code. The conventions, behaviours and style for the source code of libcurl and Linux are as distant from each other as Windows Explorer is from the NT kernel. Yet both have been taken into account by the mighty Algorithm.
How is my shitty early Android app, still written in Java, with clearly C#-inspired naming and almost PHP-like class structure more relevant to anyone than Microsoft's own code base? At least theirs is functional and useful.
"Nobody programs like Microsoft so the code examples is useless" is not an excuse, because you can apply it to almost every project on Github in some way. The machine learning is supposed to distinguish all of that, that's the entire point.
1. they can check to see if the generated code is an exact copy of an example in the training set
2. when the code matches, they can discard it, they got many predictions for each prompt anyway.
3. My preferred option - they can display the URL of the source page together with the code, acting like a regular search engine at this point; this also solves the problem of not knowing the copyright status of the code
I'm curious if they took only the whole of Github after acquiring Github, or if they've taken 'all publically visible code everywhere'. Y'know, as an open source dev who's continued to use Github after Microsoft took it over. I'm curious if I walked right into that one or if it would've made no difference whatsoever.
I’ve been using copilot for the past couple to months, and it’s seriously becoming a part of my daily coding workflow.
The majority of suggestions are not quite what I want but then I’ve found the more I comment my code the more personalised the suggestions get and consequently (as a solo founder in my own startup) copilot finishing my code for me during late nights trying to ship features for customers before the following day is something I have become grateful for.
It’s a double edged sword because it’s enabling me to grow my business and remain self employed, but I also understand the concerns and at the end of the day it’s not something I need to do my job (like version control or an IDE for example), but more of a nice to have…
It feels like the majority of my coding consists of translating extremely complex business requirements that neither the business people nor me understand 100% into highly specific code that appears to do what we want it to do. How can Copilot help me here?
I can only speak for Tabnine, which is like a single-line mini-Copilot, but I find it just saves me keystrokes. It doesn't help me design things.
For example - code quality aside; just for the sake of demonstration - if I have a line `name = data.get("name")` and then press enter and write "ad", it'll likely suggest `address = data.get("address")`, so I can type "[Enter]ad[Tab]" and save myself a few seconds.
Repeat this for every line in a program, and those seconds add up. I'm a fast typist, but it's still nice to have intelligent autocomplete that can infer my intentions with pretty good accuracy.
I'm guessing Copilot will largely be similar, but with support for multiple lines. It'll probably be especially helpful for imperative, somewhat repetitive languages like Go, where boilerplate is common.
In my experience, copilot is good at boilerplate and common patterns, but will not help you with the novel stuff (where it will predict the boilerplate and not the interesting transformation that you actually intend)
MS will just retrain the model on a different input. They could not care less and will actually happy that they get an external statement on the license situation and the ethics.
I think its a fantastic tool to use to work on though. I didn't think so seeing the demo i basically brushed it off. But using it is probably one of the most productive things to happen in the past decade
I have my own GPL software out there, most of the time I think it doesn't get really used out there so its not that much of a concern to me, I imagine its like that for other devs too.
I suppose if you're MongoDB (similar to GPL/used to be) or some big company you care more.
Could this become something people can't program without? Like imagine being stuck recycling the same programs and paradigms, not being able to move to something new, because Copilot hasn't seen it before.
I'd guess yes, for some people. Others, though, will refuse to use copilot out of sheer obstinacy if nothing else. They will produce the new paradigms for copilot to then consume.
So, I'm reading the linked article by RMS about Service as a Software Substitute (SaaSS) [1] which is one of the reasons why they object against GitHub Copilot.
The key argument why as SaaSS is ethically wrong is because it denies control over a computation that I could do on my own.
> "The clearest example is a translation service, which translates (say) English text into Spanish text. Translating a text for you is computing that is purely yours. You could do it by running a program on your own computer, if only you had the right program. (To be ethical, that program should be free.) The translation service substitutes for that program, so it is Service as a Software Substitute, or SaaSS. Since it denies you control over your computing, it does you wrong. (emphasis mine)"
I don't find that argument very convincing because it implicitly assumes that there is no alternative translation program that I could run on my own computer.
However, if there is an alternative, then a SaaS offers me choice. I can run a program on my own computer, e.g., if I am concerned about data privacy, or service reliability. The downside is that I have to install and maintain the software on my computer. Or, I could use an external service. The upside is that the barriers of use are minimal.
Of all the articles by RMS I have read so far, I find this one the least convincing.
They haven't made their mind yet on the licensing problem:
>With all these questions, many of them with legal implications [..] there aren't many simple answers. To get the answers the community needs, and to identify the best opportunities for defending user freedom in this space, the FSF is announcing a funded call for white papers to address Copilot, copyright, machine learning, and free software.
Their unacceptable and unjust opinion is just from the licensing of GitHub CoPilot / Visual Studio Code itself:
>We already know that Copilot as it stands is unacceptable and unjust, from our perspective. It requires running software that is not free/libre (Visual Studio, or parts of Visual Studio Code), and Copilot is Service as a Software Substitute. These are settled questions as far as we are concerned.
Copilot is an autocompletion framework the way Google is an autocompletion framework.
Both are driven by vast amount of data processing that can't be done locally both because you don't have the horsepower and you don't have the bandwidth or pragmatic data access to terabytes of source code.
So instead of vacuous appeals to emotion, it's better to justify our opinions with objective reasons other than "but I want to have this". We all want things, but we're not entitled to them.
Aside from the fact Copilot literally can only be offered as a service due to its nature (unless you want to sound like one of those jokes where "you downloaded the internet to your USB stick"), everyone is free to offer a service precisely how they decide.
They're not obligated to give you anything they don't want to. They don't have to listen to you, or FSF, or anyone else about what they consider, arbitrarily, an "absolute disgrace". You use it or you don't use it. Simple as that.
P.S.: I consider it an absolute disgrace ice-cream is not free but this argument never seems to works in practice.
> They're not obligated to give you anything they don't want to. They don't have to listen to you, or FSF, or anyone else about what they consider, arbitrarily, an "absolute disgrace". You use it or you don't use it. Simple as that.
No, it's not. The discussion that we're having is over whether this is permissible or not, and the lobbying that groups such as the FSF are doing is in support of a different set of rules to be enforced.
FSF considers many things not permissible, that thrive despite their harsh judgment. So what's the purpose of saying "permissible" then? According to whom and why?
FSF is an ideological organization, they're a bit like the religious equivalent of some clergy in the far East.
Yes I know a circle of people respect FSF a lot and pay attention every time they wave their fingers at someone. The same is also true of the ayatollah when he issues a fatwa. And then the world keeps spinning and nothing changes.
Their objection to it has nothing to do with the use of GPL source code. They object to it because:
> We already know that Copilot as it stands is unacceptable and unjust, from our perspective. It requires running software that is not free/libre (Visual Studio, or parts of Visual Studio Code), and Copilot is Service as a Software Substitute.
On the question of the use of source code released under the GPL, they do not have a position yet:
> With all these questions, many of them with legal implications that at first glance may have not been previously tested in a court of law, there aren't many simple answers.
They most likely are rebuilding the engine without GPL code and then doing a bunch of functional tests to see how bad it is. If it's not significantly worse, they will probably just not include GPL code anymore.
That's a fair and valid complaint. The thought that learning from copyrighted content makes new output copyrighted is pretty far fetched and wouldn't apply to a human.
I think people value their code snipits way way too much. A 10 line function to post to twitter is not worth anything. Its an entire codebase that has value.
> The thought that learning from copyrighted content makes new output copyrighted is pretty far fetched and wouldn't apply to a human.
It literally applies to a human. Copyright is about reproducing the same work. "Transforming" the work means copyright doesn't apply.
Most of your brain is trained on ideas that come from someone else's proprietary IP, whether you realize it or not. Think about that next time you're unintentionally humming that catchy tune from a Coca Cola commercial.
The copy/transform distinction isn't just about fair use parodies or commentary, but things like writing music or drawing paintings or writing fictional books in a similar style to someone else (and using some of the same ideas).
The crux here is that we can't accept that machine learning is "learning". We think of it as copying, therefore subject to copyright.
It doesn't help that Copilot in edge cases will copy. But in many cases the resulting snippet is substantially a new work.
But AI is inevitable, and therefore we'll have to start treating machines like human agents. It'll be really weird.
> someone putting their code under BSD for instance do so shouldn't be bothered by copilot regurgitating their code
I don’t agree in general with this. Remember that the BSD family of licenses still require that a copyright notice and the terms of the license be reproduced in distribution of the software and derivatives of the software, both in source and in binary forms.
Just cause I make software that I release under the ISC license, and I want proprietary software to be able to build on my work, does not mean I am ok with someone stripping away the copyright notice and the license terms from my code and claiming it as their own. Quite the opposite.
However, at the same time, if what is being reproduced is only a small snippet or some generic code, as I understand is what Copilot will usually do, I don’t personally mind. But I still think it needs to be tried in courts and that we get some rulings on it.
And I remain skeptical towards Copilot because I think it will be able to reproduce non-trivial portions of code as well, depriving people of credit for a lot of hard work that they put in. At the same time, it is cool tech, and it looks to have the potential to save a lot of time for a lot of its users by automating a lot of menial work in typing out the same old lines of code again and again. So it’s not like I am directly opposed to Copilot either. But I think we need to acknowledge the issues and that Microsoft and GitHub should work to address these kinds of things. And I am happy that the FSF is challenging them on these things, even though they are doing so from the point of view of a family of licenses that is more restrictive than the type of license I personally put on the code that I myself produce.
Do we owe all our professors and textbook makers compensation when we make money off our brain neural networks that they trained? Everyone also keeps talking about how bad copilot is. It’s the first step! It’s only going to improve and probably fast, given the potential value creation.
copilot looks very cool, but if people end up using it a lot, it probably means their programming language is not expressive enough, after all they were invented in order to be accessible to humans.
What i'd like to see is a copilot for scientific papers. There s so much duplication out there that it would be easy to train and it would save tons of time from the chore of writing and referencing the same things over and over
I think Copilot is a hard problem, maybe it isn't even solvable.
Sometimes it blatantly copies GPL code without my knowledge.
Sometimes I myself write code that could be part of a GPL code-base, without knowing.
Funny thing is, the difference here isn't the actual code that's written, but that Copilot has seen many GPL code bases and I didn't.
Sometimes I really have the feeling Copilot understands my code base and suggests code that seems to be custom tailored to it. Albeit in most of the cases it doesn't fit 100%.
I think the latter cases are when Copilot shines and doesn't violate GPL code at all, but can I be safe? Probably never.
To be perfectly honest, I think people will realise it's just not that useful and forget about it pretty quickly
Even at my place of work, there were some expressing interest in it, and after playing for an hour or two, haven't touched it since. I get the impression there are more people discussing it than actually using it.
These language models are not being utilised very well by tools such as copilot because they are not mapping very many functions from the editor to language model. The more functions you map, the more you get out of it. If Copilot's workings were completely open and configurable, you would find that people collectively can work together to map many functions to the language model. They are capable of far greater wonders with deep integration and collaboration. I have tried to demonstrate this with emacs.
What seems useful to me is the ability to type in "function that takes the path to an image file and returns a new image file with rounded corners".
These are not groundbreaking problems - I'm generally looking for solution out there that uses a popular library. This is especially useful if it's a language where I'm not up to date on the de-facto library of choise is for various use-cases. In most cases, especially while prototyping I'm not going to write it myself, nor care about which library - I'm far more concerned with some big picture goal.
If someone builds a product that can do the work of Googling a solution for me, that's the draw of the product. The code is freely available anyway.
I’ve been using Copilot for weeks now. It’s definitely useful for building upon what you already wrote. It’s very effective for single lines, but I don’t trust it to come up with entire functions. I tried, but obviously YMMV.
The licensing is definitely a problem, but I think that Copilot only highlighted the issue - it didn’t create it.
The concept of software license looks pretty fragile to me. You can own software but you can’t really own PL statements.
You can own the whole but you can’t really own the atomic parts that make the whole.
If so, closed-source is just a way to make you work really hard to achieve a result that someone else already achieved by means of obfuscation and secrecy. I’m not sure where open-source stands. Maybe it’s just a social contract.
IANAL, but until the question of whether software produced with the aid of Copilot, thus potentially containing LGPL'd, GPL'd or even AGPL'd code fragments (you never know really AIU) is subject to these respective or other copyleft licenses is settled, I think customers are well advised to stay clear of using Copilot. To the best of my knowledge, github won't provide legal shelter if customers are getting sued for xGPL violations; GPL, OTOH, has sufficient case law to make using Copilot very risky.
My biggest problem with copilot is not how it's trained, but with it's targeting of microsoft coding tools.
I don't use visual anything, and I don't know anyone who does. I code a lot of python, html and JS, and I use neovim. If I need a smart 'crutch' I'll whip out pycharm.
Mostly I don't feel the need for such things, but it would be fun and interesting to see just how good copilot is.
Areas of interest
While any topic related to Copilot's effect on free software may be in scope, the following questions are of particular interest:
- Is Copilot's training on public repositories infringing copyright? Is it fair use?
- How likely is the output of Copilot to generate actionable claims of violations on GPL-licensed works?
- How can developers ensure that any code to which they hold the copyright is protected against violations generated by Copilot?
- Is there a way for developers using Copilot to comply with free software licenses like the GPL?
- If Copilot learns from AGPL-covered code, is Copilot infringing the AGPL?
- If Copilot generates code which does give rise to a violation of a free software licensed work, how can this violation be discovered by the copyright holder on the underlying work?
- Is a trained artificial intelligence (AI) / machine learning (ML) model resulting from machine learning a compiled version of the training data, or is it something else, like source code that users can modify by doing further training?
- Is the Copilot trained AI/ML model copyrighted? If so, who holds that copyright?
- Should ethical advocacy organizations like the FSF argue for change in copyright law relevant to these questions?
While i do believe that the topic is definitely worthy of discussion, my question would be a bit different.
If the tooling is already pretty capable, wouldn't just ignoring all of the ethical questions lead to having a market advantage? Say, some company doesn't necessarily care about how the tool was trained and the implications of that, but just utilize it to have their developers write software at a 1.25x the speed of competition, knowing that noone will ever examine their SaaS codebase and won't care about license compliance. Wouldn't that mean that they'd also be more likely to beat their competition to market? Ergo, wouldn't NOT using Codepilot or tools like Tabnine put most others at a disadvantage?
Personally, i just see that as the logical and unavoidable progression of development tooling, the other issues notwithstanding, very much like IDEs did become commonplace with their refactoring tooling and autocomplete.
I've worked with Visual Studio Code on large Java codebases, as i've also used Eclipse, NetBeans and in the past few years IntelliJ IDEA; with every next tool i found that my productivity increased bunches. Now it's to a point where the IDE suggests not only a variety of fixes for the code itself, but also the tooling, such as installing Maven dependencies, adding new Spring configurations and so on. It would be hard to imagine going back to doing things manually and it feels like in time it'll be very much the same way in regards to the language syntax or looking at documentation for trivial things. After all, i'm paid to solve problems, not sit around and ponder how to initialize some library.
The actionable claims question is the hot question; the rest sort of is answered by that indirectly. It's mainly interesting from the point of view that a positive answer could cause commercial entities to ban the usage of co-pilot (and similar tools) in their organizations to avoid such claims. So, it could potentially be very damaging. Stackoverflow would be a nice example where people learn from each other where no doubt bits of IP from companies and OSS repositories gets mingled as well.
My impression is that these claims would not be actionable for a few simple reasons:
- The generated code is pretty small.
- The generated code is adapted to the context (i.e. not a vebatim copy).
- The generated code would be common to many repositories and not just one.
Because of all of the above, tracing any code fragment to a specific repository and then defending a claim would probably be very hard/impossible. Copyright is about the form of things and if it's not a verbatim copy of something really unique, it's hard to make the case for an infringement.
> knowing that noone will ever examine their SaaS codebase and won't care about license compliance
Everyone thinks this until they become the next Linksys, and have to crack open their entire tech stack because someone reverse engineered the text of the GPL in their firmware...
Frankly, i doubt that most software projects out there get that sort of attention. Aside from that, it's also very likely that management and the legal departments of most orgs don't even inspect the licenses of all the libraries that closely.
Not saying that i condone it or anything like that. However, it does feel like these things will oftentimes be ignored because of a lack of a regulatory body that'd inspect all codebases for compliance (even the idea of which doesn't feel feasible).
Because of that, cases where someone has both the skills to decompile a codebase and also has an axe to grind seem like the exception, rather than the norm.
In the linksys case no decompiling was even necessary. The plain text of the GPL license was present in the firmware image. Grep is a great tool for this sort of thing that everyone has access to :)
Tangential, but isn't it kind of weird that Copilot is a code generator and not a style-gann kind of code refactorer? That feels like a much easier task because you get to infer the intent of code from an existing example rather than from context alone.
I'm an open source audio coder. I'm not any great shakes as a programmer but I make my living by regularly coming up with novel ideas, and my codebase is on Github and MIT licensed. Over the course of hundreds of DSP plugins, some key parts are very repetitive.
This means that there are audio processing algorithms I do which NOBODY ELSE is doing, because they're unusual and in some ways arbitrarily wrong. They're chosen to produce a particular sound rather than the textbook-correct algorithm output. Example: interleaved IIR filters, to make the audio interact differently in the midrange and produce a lower Q factor at the cost of producing some odd artifacts near the Nyquist frequency.
Nobody out there in the normal world or commercial DSP or academia would intend to do that, because there are significant reasons not to (which I work around, in context). But if that stuff appears in Copilot output, they are jacking my INTENT but violating the very lenient MIT license by stripping my credit. They'd also be misleading hapless audio programmers who didn't intend to adopt my techniques, but that's a side issue.
I'm interested in who else out there has a substantial codebase subject to Copilot reprocessing, who is demonstrating intent that isn't 'normal' and doesn't exist in the 'normal' world of whatever domain's being coded for.
The point is, can it be demonstrated that Microsoft is taking SPECIFIC things from specific open source developers that can be clearly traced back to one source of distinct intentions, and then stripping the licensing? I feel like said intentions cannot be 'normal and industry-standard and correct'. It's gotta be things like my IIR interleaving, where it's a quirky choice you wouldn't automatically do, very likely with costs and consequences in its own right. Something you could choose to adopt if you liked the trade-offs (or in my case, the sound).
> The reason is that Copilot requires running software that is not free, such as Microsoft’s Visual Studio IDE or Visual Studio Code editor the FSF contends, and constitutes a “service as a software substitute” meaning it’s a way to gain power over other people’s computing.
Hold up a second. So if people have already made the choice to run software that is not free... enhancing their chosen tool set is unjust? (Besides, VS Code is free.)
I'm honestly interested in understanding their perspective, but I'm not following the leap from using an extension in VS code to gaining power over other people's computing.
Free as in beer. Their built-in tracking to the editor from Microsoft isn't freedom. There's VS Codium that compiles the MIT project without telemetry, but at that rate, I'd use a different editor.
I’d imagine that GitHub will end up re-training Copilot, excluding any “copyleft” licensed code. Not because what they do is legally tainted, but to avoid being berated by the FSF and the bad press that ensues.
Once again though, the FSF makes “free software” less relevant and harder to use. Who will want to use such software for anything when being threatened with costly litigation and bad press?
When tech such as copilot truly comes into its own it should be a productivity silver bullet. I hope at that point I will have access to it. At the point we have senior software engineer coding as a service if I had it just for myself I would hoard and not be quick to share
> The FSF said there are legal questions pertaining to Copilot [...]
There have always been lots of untested legal questions about GPL & co. Why hasn't the FSF figured out what it is they do and don't want? Shouldn't knowing what the licenses actually mean and communicating that to people be their number one job? Why else do they exist? To spread feelings and confusion?
>> We already know that Copilot as it stands is unacceptable and unjust, from our perspective. It requires running software that is not free/libre (Visual Studio, or parts of Visual Studio Code), and Copilot is Service as a Software Substitute.
So they dont know / not sure of the question of GPL usage in copilot. But they have a problem with SaaS and product that are not open sourced?
The entire purpose of the Free Software Foundation is that any product you offer to a user should be owned fully by the user, which means they should be able to take it apart, modify it to suit their specific needs, and put it back together. At minimum, that means they need to be able to see the source code and be able to build it themselves and run it on their own hardware.
So yes, closed-source software as a service is inherently unethical.
You don't have to agree with them, but they've been pretty consistent in this position for nearly 40 years. It's not exactly coming out of left field.
I dont know. I guess I am not well versed with FSF. I thought they were to promote Free Software, I didn't know their world view was any non-open source software are "unacceptable and unjust".
They promote Free Software (and specifically copyleft over permissive licenses) because they view proprietary software as morally wrong and something that should not exist.
That sounds like the completely wrong word for that. The free software foundation advocates for free software, does not like non-free software, big surprise.
Similarly fair trade does consider non-fair trade unethical.
You might disagree with that but having an opinion does not make one a "bigot".
There is a difference between advocating for free software and saying that all non-free software is evil. To compare to the original meaning of "bigotry", there is a difference between talking about how great your religion is and talking about how all other religions are evil.
Non-fair trade is considered unethical based on ethics. FSF considers closed source unethical based on their particular agenda, which has nothing to do with ethics or morality.
> Non-fair trade is considered unethical based on ethics.
No, it is not. If it were, fair trade wouldn't be niche. It adds lots of other obligations on companies "based on their particular agenda". I think it is worthwhile goal and something I like to support, but won't pretend that it is somehow pure, self-evident goodness, just like non-fair trade is not pure evil.
Again, you might disagree or have different ideas what "free" is supposed to mean but you should be better than throwing around phrases like "agenda" or "nothing to do with ethics or morality".
People and societies have no problem with ignoring ethics when it benefits them. Otherwise we wouldn’t be waging wars and sucking profits off poorer societies.
Fair trade is based on pretty fundamental ethics, such as fighting slavery. Can you point any such concept being foundational to Free Software?
> Fair trade is based on pretty fundamental ethics, such as fighting slavery.
Not really, no. Again, lots of non-"fair trade" products exist and those are not against fighting slavery.
> Can you point any such concept being foundational to Free Software?
Seriously? Their whole stick is fighting software practices they consider unethical. You might disagree whether that is fundamental ethics, but to them it is and it is rude and dishonest to pretend that is not the case.
For example, I am not vegetarian, but understand that there are people who feel strongly about that. That is fine and if you go "vegetarians are pushing their agenda which has nothing to do with ethics" that says more about you than about them.
Vegetarians do have a valid point: eating meat involves animal suffering. That’s ethics, not an entirely arbitrary, unfounded belief. Again, how does Free Software relate to ethics?
I think it's a reasonable position to take. Reducing the scope of fair use to strengthen copyleft is a double-edged sword, as it simultaneously makes copyright laws more restrictive, such a ruling can potentially be used by proprietary software vendors against the FOSS community in various ways. It's an issue that requires careful considerations.
[0] https://www.fsf.org/blogs/licensing/fsf-funded-call-for-whit...