Copilot (Probably) Won The Copyright Case

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the developer suing over GitHub co-pilot just got dealt a major blow in court this is going to be a fun one before we go any further quick disclosure I am invested in Microsoft it's not a lot of money but it's enough that I legally should probably let you guys know if you're not already familiar there's a bunch of devs who are very upset about copilot if somehow you don't know what co-pilot is co-pilot is a code autocomplete tool it's your AI paired programmer made by GitHub the thing that's sketchy about it being made by GitHub is that they have access to a ton of code data like an unfathomable amount of existing source code that many developers have created and while those developers might have open sourced their code and given it a license for free use for whatever you want to do with it most also require credit and that's where things start to get really sketchy because if you're training an AI on some open- source code that requires credit and then somebody else uses that AI to generate code they probably don't even know what source that code came from so they have no way to credit and we're not even talking about the chaos that is the potential for private code in private repos terrifying these are just real complications that exist once we get into this world where we have a box in the middle that's Opus skating the inputs and outputs and and more important points coming from chat already GPL code requires changes to be sent back Upstream so if it's trained on GPL you're kind of screwed so what's going on with this court case because this has impact possibly far outside of software Dev if it's determined that you can train an AI on existing data and then whatever comes out the other side doesn't have to follow the same rules that's a scary problem and it's important for us to understand what the courts are saying here because there is basically no precedent here yet we don't have an official Court ruling for whether or not things created by AI can even be copywritten on the other side much less how they apply with the copyright rules on the other side very useful link Kush thank you for that one other notable thing that just happened is that Microsoft dropped their Observer seat in the open AI board which is interesting in and of itself but Microsoft's also the owner of GitHub so even more interesting it seems like the relationship between these things is more complex than ever the potential legal case here is also more complex than ever let's dive in to what the law and what the courts have said so far a California judge dismissed nearly all claims laid out in a lawsuit that accuses GitHub Microsoft and open aai of copying code from developers a judge has tossed nearly all of the claims that a group of devs brought against GitHub Microsoft an open AI in a copyright lawsuit filed in 2022 we should take a quick look at the lawsuit this is a lawsuit yeah this as I said this isn't just for devs this could rewrite AI copyright rules entirely the belief of the people who are doing this lawsuit is that Microsoft GitHub and open AI are all violating copyright law by reproducing open source code using AI that's fair in a court order unsealed last week a California judge left only two claims standing one that accuses the companies of an open source license violation and another that alleges breach of contract the open source violation bit is also really really interesting because open source licenses haven't really meaningfully been tried in court if we go to the upload thing repo and we take a look at the license it's the standard boring M license got to update that copyright this license is relatively open and how this would be interpreted in court is still up for interpretation we haven't really tried the MIT license in court much less the GPL stuff so we don't really know what would happen if somebody was to violate the terms of MIT or GPL or any of these other licenses how much do those hold up in court usually the goal of these licenses isn't to outright prevent any types of things from happening it's a more General protection like the MIT license especially calls out that the software is being distributed without restriction including without limitation the rights to use copy yada yada yada but the more important point is this all caps at the bottom the software as is without warranty of any kind Express or implied including but not limited to the warranties of merchantability Fitness for a particular purpose in non-infringement in no event shall the authors or copyright holders be liable for any claim damages or other liability this is the goal of the license for the most part is to make it so you can't sue an open source maintainer for their code which I think is really fair and important and I would be surprised if anybody tried to challenge this in court if an open source project cost someone to lose money I don't think they can sue due to the nature of this license but everything above all of this stuff here hasn't really been tried in court yet and it's going to be interesting to see how that goes on top of the fact that the AI copyright question still hasn't been resolved the original lawsuit made 22 claims against the trio the trio being Microsoft GitHub and open AI accusing them of violating copyright laws by allowing the AI powered GitHub co-pilot coding assistant to train on developers work Microsoft the owner of GitHub uses open ai's technology to power the tool all three companies asked the court to throw out the lawsuit in January but the judge John tiger denied that request however judge Tiger's last ruling deals a blow to the accusation that GitHub co-pilot violates the dmca by suggesting code without proper attribution that's what I was emphasizing before that we don't really have the knowledge of we don't know if reproducing code that has an MIT license on it if that thing in the middle if I I did a diagram about this let's say we're using AI to design an app this is a diagram I drew to discuss figma's AI stuff but it applies the same here so we have these different apps that the AI uses to train so the AI gets given these example apps these example screenshots this example data and this might be copyright it might not be but it exists already and if I was a designer or a developer and I looked at some code and I used it as inspiration to write my own unique thing how much I have to attribute the inspiration is Up For Debate if I am using things that I've played with in the past and they're just sitting somewhere in my brain and then become references when I'm designing things it's hard to know how much that violates copyright law it's legitimately quite difficult to know if something inspired you or not as a human being so if I have used hundreds of different apps and then I go make my own and it happens to look like one of those ones that I wasn't even thinking of did I violate copyright where things get even more complex now is if it wasn't my brain CU if this was just like my brain's uh references where my brain in it has all these different apps that I've looked at all these different designs just floating around and then when I go to make my own unique app it might be heavily inspired by some or all of these things that's a complex problem and for the most part the courts have opinions but the complexity comes when this gets replaced with AI training data so now that this AI has been trained on this data if the app I get out on the other side is basically just a clone of this app if I'm effectively using the AI as a black box to Opus skate the fact that I'm copying this app that's gevy that's questionable this is the thing that we have to figure out now is what level of opusc what level of Black Box between the design I asked for and the design that inspired it is great enough to justify copyright law being different here because if this wasn't AI if this was just my brain and I made something that was similar we would be judging based on the similarity of the output and the input but once the AI is in the middle things get more complex I would argue that the thing we should be discussing here isn't whether or not this box is big enough or whether or not the things being trained are unique enough the only thing that matters in my opinion is whether or not the output in the inputs can be considered by the average person to be unique works if this unique app and the app that I make through the AI are similar enough that a reasonable person could mistake the two that feels like the violation not the use of AI in the middle to off youate it so this is where things are going to become more complex because I feel like we might be focusing on the technology too much and the reality of how these things work a little bit too little but let's keep reading to see what the court decided thus far so I'm quite curious as we said judge tagar has ruled that the dmca stuff probably isn't the case here although the courts previously ruled that co-pilot suggested code wasn't close enough to its original Source an amended version of the complaint takes issue with github's duplication detection filter which users can toggle on to detect and suppress co-pilot suggestions matching public code found on GitHub very interesting that they actually have a feature for this I didn't know they did that establishing trust in using GitHub co-pilot what technical safeguards are in place GitHub has created a duplication detection filter to detect and suppress GitHub co-pilot suggestions that contain code Snippets that match public code on GitHub your Enterprise administrator can choose to enable this filter for all organizations within your Enterprise or they can defer control back to the individual organizations so again as I was saying before if we were to take this design and instead of thinking of these as designs and thinking of these as source code so this is like a b c b a and we have this source code this is just in this pile of training data so we have this training data that has this code abcba and this is coming from some public MIT repo so it's a publicly licensed well we'll be more strict we'll do GPL because then it has to be credited and any changes have to be ported back so now that we have this public GPL repo that has this code in it and it's one of the things being trained let's say that our shell is inspired by app B we'll even put app a text above here so we have app A's text app B's like frame and then we take not even all of the code we'll just take the B and C from here is taking this snippet from this public code first off is it actually taken or did we happen to read a bunch of other projects that do ABC over and over again and established a pattern that we got this from like is this code actually from this public repo or is it from any of the hundreds of others that have the same code in them so if we just have a ton of these repos around and all of them have similar code maybe this one's MIT licensed maybe this one's actually owned by GitHub we have all of these different sources with very similar code and maybe all of these are part of the training data Maybe none of them are maybe the one that's trained is actually github's code but then there's a public repo that happens to have very similar looking code can the person who made this app get sued well if the owner of the GPL code can make a case in the courts that this code is clearly their code that happens to be here then they could say it's a violation of their license in their copyright because this code is identical but if the owner of this code could prove that there's prior art that there are other code bases with the same code that are not that owner it gets messy but now this is a debate over both ownership and reproduction where it's hard to know how unique a given line of code is and the AI is only making it harder this is also why things like patent law are so complex and why you'll often get advice especially from people like primagen to never look at patents because as soon as you do that is now admissible in court as evidence that you violated somebody's patent because if they can say oh this is the of a person who looks at patents so they have the potential to have copied my patent you're so that's why this gets so complex because now you don't have to have looked at this code to write the exact same code if GitHub looked at it that's in violation but if GitHub happens to have similar looking code even if this code is inspiration is different from this public repo there is still enough overlap that this public GPL repo could sue you and that's what makes this duping stuff so interesting because this isn't as simple as don't train on GPL code because again if there's a GPL version that we didn't train on and then something that GitHub owns that is similar enough you still might be seen as violating the source code here because GitHub doesn't know how to Source where this code came from GitHub isn't capable of saying directly oh that code you have there that had to have come from this code base here none of the sourcing of yai made decision stuff is anywhere near good enough to draw those types of clean paths so because of that there is now some room in the court to convince a judge or a jury that there's plausible believability that the code that you trained on and is the result is resulting in this output that that code might be github's code but it might also be this public GPL code and that's why this reverse engineering thing is so interesting because the duplication detection filter is actually training on public code but not to make their AI better at generating code quite the opposite they want to make sure that even if this code came from github's own code base if it is close enough to this public code if the relationship between this code and this publicly licensed GPL code is close enough they just won't give it to you they won't generate this code even if the generation result comes from things they trained on their code even if they own the copyright to this the possibility that your business might get screwed because code they own was used in your thing through AI is enough risk that businesses are scared so what they've done is they've built a patch on top that specifically looks out for things you might be able to be sued for and cuts those out of the results which is a really really interesting angle because it both confirms the fact that they're probably not just blindly training on all of this license code but also that the concern doesn't really matter it doesn't matter if the code came from this public GPL thing or if it came from GitHub because in the end if the output is close enough you might still get sued that's that's just starting to touch on the complexity of this case and I want to keep reading what these decisions were it's also interesting that the amended version of the complaint takes issue with this detection filter I'm curious what their issue is the amended lawsuit argues that GitHub gives users the option to receive identical code when the filter is turned off I hope that this helps showcase why this argument is weird because if the code being trained on just happens to be identical to this GPL code it doesn't matter which I trained on because the owner of this code is still going to try and Sue so it doesn't actually mean that github's tool is spitting out code that is violating copyright so much as GitHub and copywritten code have overlap and the output result could be mistaken for something copywritten another way to think of this is imagine you bought a book of Melodies so you're a a songwriter and you buy a book of Melodies and this book has a whole bunch of different Melodies in it for writing songs and on page 76 you find a Melody you really like so page 76 has a Melody let's say it's like a c c a f a just making up random letters not even using Sharps and flats whatever so you have this melody that you found in this book and you really like it so you go and you make a song using this melody and then it turns out that Marvin Gay's estate has a song that is a Melody that's AC a ga a so if Marvin Gay estate has a song that it owns that has one note that's different but it's close enough or even worse let's say it's transposed they go down a note so GB this isn't a proper like uh change but I'm just doing it for the sake of it so this is all the same notes but shifted down one note it's the way to think of it so if you have this song that you wrote Based On A Melody from this book that you purchased but transpose it could be a different Melody so if Marvin Gay can claim that these Melodies are similar enough even if you got this Melody from something else and even if the author of This Book of Melodies has never listened to a Marvin Gay Song and literally could not possibly have violated your copyright it doesn't matter because the Court's determination is going to be based on are they similar enough to be reasonably mistaken and is this a market alternative to this Marvin Gay's estate probably couldn't have sued the book of melodies for for putting this melody in but now if somebody produces a song based on this Melody from The Book of Melodies that they purchased that person could get sued the only way to prevent what we're describing here would be to have a separate book book of illegal to use Melodies and this book of illegal to use Melodies isn't a reference for you for What songs to write it's the opposite it's actually going to let you know if you have a melody and then you check the book hey this melody can't be used so maybe in this book they also have page 76 and it says do not use this melody and it has that melody in there so now I as a music writer maybe I own both of these books maybe I own the book of Melodies and I have this melody I really like so I start working on a song with it and then I remember oh I don't want to get sued I'm going to go check this book of illegal to use Melodies and see if there's one that's similar enough so I go through this book I see one that's kind of similar I geted out and then I go and do something else instead that's what's actually happening with this is they are using a separate book a separate source of data to see if the output they generated is close enough to something you might get sued for that's why they would block it not because by default it's generating things that are violating your copyright but because of the liability of what's being generated neither the book of Melodies nor the person using this melody are violating Marvin Gay's copyright but what they are doing is creating a work that could be seen as similar enough in the court given enough money and time spent by the group that's fighting this suit that that's a liability and the thing that being worked on here isn't preventing the tool from generating copywritten works it's much more specific what they're trying to do here is make it so you're less likely to be sued they're trying to lower the liability and lowering the liability of a copyright suit is a very different thing than making it so your tool doesn't generate copywritten work and I want to draw that line because I might not end up agreeing with the results of this case now that I know they're exploring this option because I do not think github's changes are to prevent copyright from being violated I think those changes are much more specific it's to lower the risk profile of using those things this was never about whether or not the thing coming out is illegal it's about whether or not your company is willing to fight the case in court to defend it in the liability around using these things in the first place apparently they also cited a study that shows how AI models can memorize and regurgitate parts of their training data which could potentially include copyrighted code I bet you could do a very similar study where if you took a person that was a professional pianist and you put them in front of a piano and said compose me five Melodies that at least three of those Melodies are going to be very similar to ones that already exist in the public it would be hard for me to believe that anything somebody composes on a piano is going to be truly unique and not referencing a bunch of things that they've experienced in the past themselves it is genuinely really hard for me to see a world where creative output isn't inspired enough to potentially be seen as violating of copyright so yes the AI model might memorize parts of code and regurgitate them but so does a human it's very common that's just how knowledge works is you reuse parts of things you already know what level of reuse and what level of knowledge matters and should be considered copyright violation is a a very up for discussion thing that a lot of people are going to disagree on I'm already terrified of what the comment section of his video is going to look like that aside this is a complex problem okay that's a relief to hear apparently this did not hold up in court as judge darar determined that code GitHub allegedly copied from developers wasn't similar enough to their original workof where was this person during all the Marvin Gay lawsuits he also mentions a part of the cited study that says GitHub co-pilot rarely emits memorized code in benign situations judge tagar dismissed this allegation with prejudice meaning that the developers can't refile the claim W huge baller I'm going to drop a really spicy take I think the biggest mistake that these devs made was filing this lawsuit in California and not in Texas it's actually become a meme in Texas especially for weird patent lawsuits and patent trolling there's a couple cour houses in Texas one in particular um Texas Court House Samsung Ice Rink one my favorite random stories why a small town in Texas has Samsung's ear this small town had an ice rink sponsored by Samsung why does a tiny town with 20,000 people have an ice rink sponsored by Samsung the reason is because Samsung gets sued by random patent trolls all the time in Marshall Texas because Marshall Texas courthouse has a bunch of people in it that aren't particularly aware of technology and as a result you can sue a big company in this courthouse and you can convince the jury or even the judge of some really stupid and then win the lawsuit the reason Samsung bought an ice rink in Texas by the way one of the hottest states in the country they bought this because they're trying to win some good sentiment from both the people who work at the courthouse and also the potential jury because they need whatever they can do to potentially not lose these lawsuits and it's so funny that this random town has been discovered as a hotbed of dumb tech people to sue other companies through I would be surprised if this lawsuit didn't go entirely differently if they went through that house in Texas instead but yeah if you wanted to understand how these copyright laws and their enforcement is they're trying to socially engineer a random small town in Texas because their courthouse is so technically incompetent that it's resulted in a bunch of rulings like that's the state of these things it's nuts anyways the courts also dismissed requests for punitive damages as well as monetary relief in the form of unjust enrichment nice these devs got squashed this doesn't mean the lawsuit is over litigation will likely continue with the developers claims regarding breach of contract and open source license violations one more thing on copyright just to emphasize how vague dmca is and also how weird the southern district of Texas's rulings are there are some contradictory rulings being cited in here the first comes from this southern district of Texas case from 2023 which concludes that dmca is not limited to copyright management information conveyed in connection with identical copies of a work so this Court ruling cites that the work being an identical copy is not necessary for to violate copyright which uh is scary for all of us reaction channels but thankfully there has been additional rulings on this from the ninth circuit that continues to compel us to research because we reach different conclusions in these other cases the one that was reached in this case in 20120 is that courts have found that no dmca violation exists where the works are not identical and then another case in 2023 is that the plaintiff has not plausibly alleged that defendants distributed identical copies of plaintiff's comparison for example say that would violate dmca law according to these rulings something like um internet archive Distributing books that are still for sale so they've a PDF of a book that I could go get through Amazon as an ebook those are identical or at the very least functionally identical products one is being given to me illegally for free the other's giving to me legally for purchase that's a thing that would be violating of dmca but if I was to make a movie based on that book that would be a little more vague so that's what's interesting about these rulings is we don't have a strong answer yet what would it take to violate dmca let me know what you guys think in the comments is all AI violating copyright or is this a more vague thing till next time peace
Info
Channel: Theo - t3․gg
Views: 5,404
Rating: undefined out of 5
Keywords: web development, full stack, typescript, javascript, react, programming, programmer, theo, t3 stack, t3, t3.gg, t3dotgg
Id: IhnN_qKwDus
Channel Id: undefined
Length: 24min 11sec (1451 seconds)
Published: Sat Jul 13 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.