2 years ago

#128 Copyright & Machine Learning Models

Does training AI violate copyright laws?

Transcript

David Kopec

Modern, sophisticated machine learning models are often trained on large amounts of copyrighted material. Is that legal? Welcome to COPEC, explained software, the podcast where we make computing intelligible. In this episode, we're going to discuss a matter that crosses the lines between the legal and the technical. Sophisticated machine learning models, like large language models, otherwise known as LLMs, have made a huge splash the last few years. Oftentimes they're trained on large amounts of copyrighted material. For example, the large language model behind chat GPT is trained on many copyrighted books. In this episode, we're going to discuss the legal theory that makes that possible. Or is it? Should it be possible? That's open for debate. We'll discuss both sides today, so we're going to assume that listeners have a familiarity with what copyright is. Copyright is a type of intellectual property that protects creative works. For example, copyright gives exclusive rights of redistribution to the author of a novel or the creator of a movie or the singer of a song and its audio recording. So copyright is somewhat fundamental to the creative economy. We're in a new world now, though, with machine learning models that can generate creative works that are near or on a par with the work of many humans. This has upset many content creators, including writers, artists, and even filmmakers. And these models that can create these new works are built using data from their works. So why is that allowed? A lot of the ideas we're going to discuss today were summarized by a University of California, Berkeley School of Law student named Jenny Kwong, and I hope I pronounced that correctly. I'll link to her article in the show notes so before we can talk about why this may or may not be legal, we need to talk about the idea of copyright infringement. Now, certain kinds of uses of somebody else's copyrighted work are actually permissible. That is the core of the argument around machine learning models. What kinds of reuses are permissible and what kinds are not.

Rebecca Kopec

To quote Kwan, the seminal Supreme Court copyright case Baker v. Selden distinguished a copyrighted work from its material form and showed that not all uses of a work's material form are acts of copyright infringement. Copyright infringement requires not just copying of a work's material form, but also the unauthorized use of the work for its expressive purpose. Merely technical or noncommunicative uses are not uses of a work for its expressive purpose, and therefore are not copyright infringement.

David Kopec

So we need to have an expressive purpose. We need to be using the copyrighted work for the exact same reason that it was originally created. In other words, if I have a piece of art and I copy it exactly and have the same piece of art, well, then I've infringed on the original piece of art. But if I use the piece of art to maybe train a model, and the model no longer contains any of the expression of the original piece of art, but just maybe some ideas from the original piece of art, then that may be permissible, according to this quote. So that is a basis to maybe say that all machine learning models are allowed. But in practice, this is not how courts often analyze the issue. Instead, they go to another bit of us copyright law related to the idea of fair use. Fair use is not just something that was invented in a court, it is actually codified in law. Basically, there are four criteria that are used by a court based on this established US law to determine whether or not the reuse of a copyrighted material is, quote, fair use. And if it is fair use, then it's allowed to be reused and the reuse is not considered infringement. Here are the four criteria. Number one, the purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational purposes. Number two, the nature of the copyrighted work. Number three, the amount and substantiality of the portion used in relation to the copyrighted work as a whole. And number four, the effect of the use upon the potential market for or value of the copyrighted work. So a court will go and weigh all four of these criteria, and one particular use might actually fail one of those criteria, but pass some of the other criteria. For example, as a teacher, maybe I go and I want to reuse a portion of a book to distribute amongst my students without asking them to purchase the whole book. So I literally go and take a PDF, and maybe it's just a chapter out of, let's say, a 30 chapter book, and I just distribute it to them. Normally, you'd think that's probably copyright infringement, right? Because a chapter might be ten pages or something, and that sounds like a lot of somebody else's written work. But let's look at the four criteria. Criteria number one, the purpose and character of the use, including whether such uses of a commercial nature is for nonprofit educational purposes. Well, if I'm a teacher, then it's for nonprofit educational purposes that I'm distributing the work to my students. So that might be a criteria in favor of it being fair use and me being allowed to reuse it. Criteria two, the nature of the copyrighted work. Well written works. Books are generally very well protected. So that might be a criteria against it. I think not. Sure. I'm not a lawyer. And by the way, everything we say in this episode, please take with a grain of salt. We are not lawyers and this is not legal advice. Criteria three, the amount and substantiality of the portion used in relation to the copyright work as a whole. Well, if it's a 30 chapter book and I'm only using one chapter, they might say, well, that's not so much of the book. And number four, the effect of the use upon the potential market for or value of the copyrighted work. I think that one's a little questionable. Like if I wasn't distributing the chapter, maybe the students would have to buy the whole book. So this would really be hurting maybe a few sales of the book. So perhaps you could argue that one. Either way you might say, well, the students are Never going to purchase the whole book just for one chapter out of 30. So I think you can make arguments both ways. But anyway, a court would go and weigh these four criteria to determine whether or not my use of that chapter was infringement or is it fair use in the same way, this is how a lot of machine learning data model cases are being handled. They're being weighed by courts who determine whether or not the use of the original copyrighted data within the data set that the machine learning model is trained on was fair use or not. As you might imagine, this creates a lot of uncertainty.

Rebecca Kopec

It really relies on the individual's judgment about what it could be, and there's just lots of confusion about it.

David Kopec

Right. And it has to constantly be tested in a court. So that uses a lot of legal resources, both on the part of the public court system as well as legal fees for the corporation. And who can handle all of those fees? Well, large corporations can. Somebody like an OpenaI that just had a $10 billion investment from Microsoft. They can afford all kinds of fair use trials. Somebody like a Google. But what if you're a small company and you're innovating in the artificial intelligence space with some really great machine learning models? You might really be afraid of a copyright lawsuit. You might not be able to pay the legal fees to defend your fair use of some of the copyrighted data that you trained your model on. So this whole situation seems to favor large corporations over small innovators. And that's one of Kwan's main arguments in her paper.

Rebecca Kopec

I think an important or interesting point to me in all of this too, is what was the original intent of copyright in general?

David Kopec

Yeah, if you go back, the US actually codified the idea that we should have copyright all the way in the Constitution. So very insightful of the drafters of the Constitution back in 1789 to include a provision for developing copyright law. And the original reason they included it was for the public good. So the idea was, yes, we're going to give authors a way to make money on their works, but we're doing that so that they create more works. We're going to give them an exclusive monopoly for a limited time on their works to encourage them to make more works, which is good for the public. The public having more artistic works, more scientific works to learn from, to enjoy, is good for them. So we want to actually create this incentive for people to create more for the public. So the intention of copyright laws is for the public good. It is not there just to make a profit for Disney or to make a profit for me. I'm a book author, right? And so I benefit from copyright law. But the reason we have copyright law is not for me to benefit. That's not the first reason. The main reason we have it is for the public to benefit from the books that I'm incentivized to write. And to be honest, I would be much less likely to write books if copyright law didn't exist. Copyright law says to me, well, nobody else is going to be able to rip off my books, and therefore I'm going to be able to profit from them. And I like writing books. I enjoy doing it. Would I have written as many as I did if it wasn't for the fact that I have copyright Protecting my ability to make money from them? No, I absolutely wouldn't have. And I think that's true for a lot of other writers, artists, large corporations who make media, whether they choose to admit it or not.

Rebecca Kopec

I think that also is an important argument for allowing these models to utilize all the different data. And because they are presumably or hopefully opening up things to the public, right. They're allowing this new resource, this new technology that really does align, I think, with the original goals of copyright.

David Kopec

Right. So why are these models public good? Well, let's say a new model comes out that makes people a lot more productive, right? Let's say a new model comes out that really does something good for healthcare. Let's say a new model comes out that is able to give new communicative Powers to small Businesses that only large businesses used to have. Those all seem like really good causes, and we as a public, need to weigh the good of all those causes against the good of the original authors and artists. Being able to make money at the same time, of course, it depends on the particular niche and how the model is being used, whether it's really a public good or not.

Rebecca Kopec

In the Jenny Klon article, she actually argues that there should be A-U-S. Data mining safe harbor law. What does she mean by that?

David Kopec

So instead of having all this ambiguity and having all these court challenges, she's saying, why don't we just pass a law that says that this use is allowed? I want to also mention one other way that machine learning models are allowed to be trained is the particular way that the first clause of fair use is often interpreted. Remember, the first clause says the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes. But let's just take the first part of that, the purpose and character of the use. If the purpose and character of the use is transformational, it's often considered fair use, then so obviously a machine learning model is completely different from a book, right? And so that would be a transformational use of the copyrighted work. Therefore the book. So that is often allowed. Any kind of transformational use is often considered fair use if it's truly transformational. There are several famous cases that Quan cites. One that I know from outside of her article is the use of collecting copyrighted material into databases for queries and lookups of metadata. Even though the original metadata might be under some kind of copyright when it's collected into a database, we've transformed that metadata into another form. If you think about it, that's not that different from copyrighted material being transformed into a machine learning model which is also used for doing queries to gain insights. Anyway, going back to the idea of just having a law and just codifying this, instead of having to constantly test fair use, use tons of legal resources and put big companies at an advantage, we could just have a law that says you can use copyrighted material to train machine learning models. In fact, several other countries have already done this. There are laws that allow for either the non commercial or full use, including in commercial applications in Japan, the UK, and the European Union. Kwan argues that laws that only allow for non commercial use don't go far enough because again, they don't create a level playing field for small innovators against large corporations.

Rebecca Kopec

Well, what would or what has an artist been saying about know, this all makes sense when we're sitting here talking about it. But if I was on the other side, if I was the artist creating work and then seeing AI create something that seems very similar or is really good. And what might I be saying?

David Kopec

Yeah, a lot of artists and writers are angry, and in fact, they're suing OpenAI right now. There are several ongoing lawsuits between organizations like the Authors Guild and even between individual artists, and we're yet to see how these play out in court, but we can understand why they're angry. Imagine you're an artist who spent ten years perfecting your craft, and you have a very distinctive style, and now an AI image generator has been trained on a lot of your works, and you can ask it to produce a new image in the style of your work. Well, that might mean that the market for your work completely disappears. And based on the different criteria of fair use, I think maybe there's some room here for the courts to find in favor of the artist. But I really don't know. Again, I'm not a lawyer, but I think it really depends on the context and whether or not the new use is competing with the original copyrighted work. That is the fourth criteria, fair use, the effect of the use upon the potential market for or value of the copyrighted work. But again, fair use is about weighing all four criteria, not just the fourth criteria.

Rebecca Kopec

And the Jenny Kwan article really focuses on the training, the way that these models are learning and how they're accessing or needing so much information. That's what the lawsuit or the lawsuits have been also really focusing on. Is that correct?

David Kopec

Well, Quan, in her article, specifically says that she doesn't want what she's writing to be applied to the type of generative AI that we were just mentioning these court cases about. So things like Dolly Three or chat GPT, where we're actually creating new creative works. So of course, there's been machine learning models around for a long time, and machine learning models are used for things beyond just LLMs and diffusion models that create images. So we might need to have a distinction between those that are creating competing works to existing copyrighted works and those that are being used for totally different purposes. And it's up to the courts. We'll see where the courts land. One thing that there already is precedent for relating to these type of generative AI models and copyright law is the idea that non human created works cannot be copyrighted. For example, there's a famous case where a monkey, believe it or not, pressed the shutter button on a camera, and that picture was ruled by a court to not be copyrightable because the monkey is non human. So there has to be some kind of human input that goes into a copyrighted work, is giving a prompt to dolly three or Chad GPT enough human input. There already has been a ruling by a judge that said that machine learning generated art is not copyrightable. Now, will that hold up as it goes through more layers of the court system? I don't know, but there certainly seems to be precedent for that. So it's possible that these machine learning generated works of art may not be protectable. So to answer our overall question, how is it legal that machine learning models are trained on copyrighted works? The answer is twofold right now. One is the idea that the use of them is functional rather than expressive, and secondlY, that the use of them is fair use according to the fair use doctrine enshrined in US law. But courts are having to make a decision on a case by case basis right now, and it seems it would probably be a lot more efficient to just have a generalized law that established a doctrine about machine learning. Training of copyrighted Works this is one.

Rebecca Kopec

Of the challenges with new technologies. The courts are always kind of playing catch up as things get introduced, and we have to adapt and apply our laws and understand them in a whole new way.

David Kopec

Absolutely, and I think that maybe there should be a distinction, this is my personal opinion, between machine learning models that generate competing works to the original copyrighted works and machine learning models that are so transformational that they're really unrelated in their output to what the original works were like.

Rebecca Kopec

I agree with that.

David Kopec

We'll see where this lands as the court cases progress, and Congress hopefully considers this as well. And again, I just want to say that we're not lawyers, but we do recommend the article by Jenny Kwan, which we link to in the show notes if you want to find out more about this topic. Thanks for listening to us this week, Rebecca. How can people get in touch with us on X?

Rebecca Kopec

We're at copeclains, E-X-P-L-A-I-N-S.

David Kopec

Thanks for listening. Hope you have a great holiday, and we'll see you in a couple of weeks. Bye.

Many large sophisticated machine learning models, like those employed in generative AI, are trained on immense amounts of copyrighted images or text. How is that legal? In this episode we delve into the exceptions to copyright law that enable such uses to not be seen by courts as infringement. This includes expressive vs functional uses of a copyrighted work, fair use, and the possibility of a data mining safe harbor law. We also discuss whether such interpretations are to the benefit or detriment of society as a whole.

A note: as mentioned in the episode, we are not lawyers, and this episode should not be considered legal advice. It is just a discussion of the issue based on our somewhat limited understanding of the legal arguments and expanded to consider the societal implications. Also as mentioned in the episode, we based much of our understanding on the article "Does Training AI Violate Copyright Law?" by Jenny Quang which is linked below in the show notes.

Show Notes

Does Training AI Violate Copyright Law? by Jenny Quang via Berkeley Technology Law Journal

Find out more at http://kopec.live