Subscribe to Hello World
Hello World is a weekly newsletter—delivered every Saturday morning—that goes deep into our original reporting and the questions we put to big thinkers in the field. Browse the archive here.
Before they gobbled up headlines everywhere, large language models ingested truly staggering amounts of data to train their models. That training data didn’t emerge from the ether: Some of it came from other people’s creativity and work. So it’s unsurprising that copyright concerns are front and center in the early days of this technological boom—we’ve been here before.
Hello! I’m Nabiha Syed, the CEO of The Markup, and I spent the last decade as a media lawyer obsessed with emerging technology. I frequently joke that artificial intelligence is Soylent Green—it’s made of people, specifically the nuggets of data we’ve been leaving all over the internet—and we should probably pay more attention to what that means for our future. (Look, I didn’t say it was a good joke.) The point is, especially now that lawyers have entered the mix, we’re going to have to find the right model to understand the technological upheaval we find ourselves in. So I turned to Katherine Forrest, former federal judge for the Southern District of New York and partner at Paul, Weiss, to help us break down how we should be thinking about copyright—a doctrine designed, as she points out, to encourage innovation!—and generative AI.
Keep reading for a mention of one of my favorite Supreme Court cases, Campbell v. Acuff-Rose Music; the human ingenuity of prompts; and why this moment is “as if Alexander Graham Bell invented the iPhone.”
(This interview has been edited for brevity and clarity.)
Syed: Some copyright owners, like artists and Getty Images, argue that the training data for large language models included copyrighted materials, and they are bringing litigation accordingly. What do you make of their arguments?
Forrest: I would like to move away from the expression “large language model” because sometimes that causes people to get their heads stuck into a language-only kind of model space. These kinds of models are really what are called foundation models, where the same model can be used to do a multiplicity of things. Foundation models are trained on tremendous amounts of content that are not just language-based but also photographic-based, video-based, audio-based….
Understanding the kind of breadth that these foundation models can have is so important when you’re talking about what kinds of training data gets captured. It’s everything, songs or art or what have you. So it is, in my view, logical and reasonable for the content owners to question the wholesale and large-scale copying of their data to train these models.
We know that copyrighted works for OpenAI are copyrighted in toto, because OpenAI has put that in a statement to the Patent and Trademark Office, which it then defended on fair use grounds.
Earlier in my career, before I went on the bench, I did a lot of work in the internet music space, including the LimeWire case on behalf of the RIAA, the MP3.com case, and Chambers v. Time Warner. That was a case about old recording artists who had relinquished the rights to their recordings to a record company prior to the internet. And the question in that case was whether or not the internet—because it was a non-anticipated technology at the time that recording contract was signed—was one of the places where the record companies had actually acquired the rights. And the answer was yes. What happened then was there was a lot of copying by MP3.com and LimeWire where the view was, well, internet music is not really copyrighted music because it’s not in a, quote, “tangible medium of expression,” that somehow the digital transformation had resulted in this kind of ephemeral music where it was just 0s and 1s and wasn’t the kind of sound recording that had been obtained by anybody as a copyrighted work.
For me, it’s copyright infringement—there can be a lot of technical words around it, but it’s copyright infringement. There is an entire song that was being copied. And when you played it, it sounded like the song. So we prevailed. And as you now know, music over the internet is considered to be copyrightable. That’s the major way of distributing music these days, through licensed uses, whether it be Spotify or iTunes or Pandora. The world evolved to accept that this material was in fact copyrightable and protectable.
With the current environment with generative AI, there’s a new and different question about copyrighted works, which is—whether it be music or a written work or a photographic work or something else—does making a copy of the copyrighted work infringe on the author’s copyright? (And we’ll come back to who an author is in a moment.) There’s no doubt that the entirety of the work is being copied, because that’s the point of the ability to train a foundation model; you need the entirety of the work in order to determine the probabilistic relationship of one part of the work to the other. In other words, if you chopped it up so all you had were a few words here and a few words there, it wouldn’t give you the information that you need, which is, how is this constructed. So we know 100 percent is copied.
Then you get into the fair use question, which is complicated because no single factor of the four fair use factors is determinative. The fourth factor, which is, Will this impinge on the commercial opportunity for the work?—that factor is, I think, one which is answered by saying, “Absolutely.” Because it is a replacement of that work forever.
So the ultimate difference between this situation and say, Google Books, is that the point of Google Books was to allow you to search for the book, and it gave you a portion of the copy for you to access. The point was discovering the underlying work. For generative AI, the value is not necessarily the discovery of the underlying creative work but rather a replacement of the work with something else. It is 100 percent copy leading to 100 percent replacement. And it’s absolutely commercial. This will all be fought out in the courts.
Syed: That will capture the live debate over whether these innovations can help you discover more, versus erase and replace what came before. That matters for the Google Books-type analysis.
Forrest: I would ask you: Has anybody who suggests this technology is additive said, “Please lead me to where I can buy this work”? And the answer is no.
The architecture of the past had a pathway back to the underlying work. That replacing and erasing is critical because part of what this does is obscure, intentionally so, and part of the beauty of the tool is its ability to come up with nuanced and innovative answers. But it’s erasing the language now.
There’s a real difference in terms of copyright between the input moment and the output moment. The input moment is the training that you and I have been talking about, and that’s where the copying is occurring. That’s where the training of the tool happens; that’s going to become the replacement of the human forever.
The output is something different. For Getty Images, for instance, the output actually still had the little image of the Getty logo. Hence, there are several different trademark claims that they’ve got embedded in that suit. So the input moment and the output moment, I believe, need to be analyzed separately. Both face different copyright issues.
Syed: Let’s focus on the “output” for a minute. Computer-generated art has been around for decades, and the U.S. Copyright Office has been convening and commenting on AI and art issues for some time. It provided recent guidance that it is open to granting ownership to AI-generated work on a “case-by-case” basis. Can you walk us through what’s happening here?
Forrest: I think that this is an area that is in flux. We know that there has been guidance by the Copyright Office that they do not find that AI alone is an author, that an author is human. What’s happening now is there’s a recognition that there are human beings involved in the process throughout. We know that these tools don’t just spontaneously, with no prompt at all, start creating things randomly. They are query machines, if you will. And how do you, as the Copyright Office, make distinctions between when you’ve got enough human involvement to check that box? If I give a prompt, “Give me a woman in a field drinking tea with a lion,” is that enough human input? Am I participating in that art?
That human inspiration is given value to some extent by the Copyright Office now. And so the question isn’t one of a spectrum; it’s on a case-by-case basis in terms of the amount of human involvement.
One of the points of copyright in the Constitution is to incentivize human creativity. It’s a recognition that there are certain protections given to the people to support innovations that will have a broader public benefit. So when you move into the world of copyrighted artificial-intelligence-generated content, you’re moving into a different place in terms of the role of the U.S. Constitution. Now, that might not be the logical place to go if human ingenuity is still involved in the prompts. If human ingenuity is still involved in, “You’ve made me a picture of the moon. I now want the moon to be a half moon,” and the AI generates that, that is kind of synergistic involvement, the kind of artistic creation that’s still within the founders’—certainly not what they anticipated, but—within the four corners of what could be placed there. The Copyright Office is trying to balance our constitutional principles with the realities of where technology is moving us.
Syed: On that note, an AI-generated song called “Heart on my Sleeve” emerged over the weekend, and it sounds like it’s by Drake and The Weeknd … but it’s not. Neither of them made this song, but that didn’t stop it from becoming wildly popular with more than nine million listens on TikTok before it was removed. There’s some early suspicion that this was part of a marketing campaign from a tech company unrelated to either artist. Can you break down the legal issues at the input and output stages of this development?
Forrest: There are a lot of different pieces there. First of all, you’d want to understand how much of the music was through the kind of music tools that have existed outside of generative AI. For at least two decades, there has been music that’s been created with a variety of electronic tools. That has not been AI; it’s been people thinking about how to construct and put together new and different sounds. That is not new.
Then there’s the piece of how it sounds like a particular artist. I don’t want to comment on how this particular song [“Heart on my Sleeve”] was made. But often that can be done by feeding into the tool a training set consisting of a large amount of a particular artist’s songs. Was that tool trained on exact copies of the artists’ music that were intact at the moment that they were being analyzed—in other words, not transformed at the moment of analysis, but transformed later into this other work?
That raises the question: Is it transformative? Is it the kind of thing which falls under Campbell v. Acuff-Rose Music? That’s a 1994 Supreme Court case that looked at fair use and what should be considered as sufficiently transformative.
There’s a separate copyright and fair use analysis on the output side, if it sounds like other artists. There’s the possibility of unfair competition claims. These are the kinds of issues that will be litigated: whether or not a song that sounds like an artist, and is being passed off like it is that artist, is interfering with the commercial opportunities for that artist.
Syed: What if the artists use it themselves as sort of a creativity prosthetic right there in the studio?
Forrest: There’s a base set of training that comes in the tool, and that base set of training is not limited to that artist. It’s going to have maybe some John Lennon, maybe some Madonna, maybe some Taylor Swift, probably all of the above, plus a whole lot more. So when the artist is playing with it in the studio, there’s a risk analysis that they have to undertake, because some of the interesting new sounds may be from the works of other artists. But on the input side, there’s the possibility, depending upon how these cases come out, that if the training of the tool is eventually found to be an infringement, right, and the artist has an output that’s a hit, that output could have a problem, potentially even an injunction against being utilized, if it was found that it was based upon an input that was copyrighted without fair use. These are all absolutely unsettled questions right now in the law.
So I certainly understand the desire to play with what is extraordinarily interesting technology, and we are at a transformative moment. In my view, this is one of the most transformative moments in our lifetime, and in all industries. If you’re in live theater, working in a hair salon, or, you know, a carpenter, there are certain things that are not impacted. But my job and your job are all impacted by this in ways that we have not even begun to understand. And so as the dust settles, we will be able to figure out how to use these tools in a way that is safe, that actually appreciates, compensates, validates, attributes the right kind of artist input into the tools. But these are all things that have to be worked out.
Syed: Stepping back, it really feels like the iPhone moment.
Forrest: It’s not just the iPhone. It is as if Alexander Graham Bell invented the iPhone.
If you think about it, the human brain has something like 70 billion neuronal connections. ChatGPT-2 had 1.75 billion. Chat GPT-3 had 175 billion. So it had over 100 billion more neuronal connections than the human brain. And ChatGPT-4 has extraordinary additional computing power layered on top of that. These are incredibly innovative pieces of technology.
To those of us who are involved in determining the right legal frameworks, the right legal boxes to put things in, the right kinds of potential licensing schemes and all of that, this is a today issue.
This certainly gives me much to chew over. Will I stop using ChatGPT to write bespoke bedtime stories for my toddler? Too soon to say. (Although I have been thinking about my timing, given last week’s insights on the water footprint of AI.) But at a high level, over here at The Markup we’ve been engaging with all kinds of innovators, critics, and thinkers around artificial intelligence—including events like Sisi Wei’s panel at ISOJ last week and my upcoming chat at Stanford on April 26. Stay tuned for much more to come!
Thanks for reading!
Always,
Nabiha Syed
Chief Executive Officer
The Markup