The developer OpenAI has said it would be impossible to create tools like its groundbreaking chatbot ChatGPT without access to copyrighted material, as pressure grows on artificial intelligence firms over the content used to train their products.
Chatbots such as ChatGPT and image generators like Stable Diffusion are “trained” on a vast trove of data taken from the internet, with much of it covered by copyright – a legal protection against someone’s work being used without permission.
Dan Milmo for the Guardian
I can’t become a billionaire without robbing banks so therefore robbing banks should be legal.
Like many other things OpenAI says, this should be taken with a grain of salt.
If your aim is to build a “language” model, there is definitely sufficient information available in the public domain. Yes, it won’t “speak” very modern English, but it will have the basic functionality.
If your aim is to be able to answer questions, it is possible to “purchase” datasets, or integrate open sources, like Wikipedia (CC BY-SA licensed).
If you really need the most up to date information on everything, you can include a “web search” module, which ChatGPT and Google’s Bard already does. Basically you’d do a web query as an agent of the user, download those pages, and summarize, again at the users request. This will obviously slower than “memorizing” that information, but not much different than a “very advanced” screen reader.
In any case, OpenAI seems to build their public support to get preferential treatment from the government. Don’t get me wrong, they offer a very valuable scientific output, and a useful tool (which I pay the monthly fee for). But it does not mean they act like any other business.
Spicy take: human artists and writers also learn by digesting copyrighted works.
Hexadecima,
You hit the nail on the head. Everyone single one of us with academic training takes away knowledge from copyrighted works, be it text books, news articles, etc. Even though it’s nearly all copyrighted we have the right to remember it, talk about it, and even profit from it, without the author’s permission. Like it or not copyright law allows this. Storing this knowledge in our brains has never been vilified before. I believe the real criticism of using copyrighted works to train artificial neural nets (as opposed to biological ones) isn’t that regurgitating knowledge violates copyright law, but that artificial NN are becoming more scalable and effective.
* I understand that there are cases of actual of copyright infringement, where the NN reproduces a work verbatim. That obviously needs to be fixed. But even once verbatim reproduction gets fixed to fully respect copyright law, people are still going to get upset over the AI, that’s just the truth.
I suspect some would favor amending copyright law with more blatantly discriminatory terms…
I can imagine how these discriminatory terms could end up causing their own new problems once we inevitably start treating human ailments like dementia and parkinsons with machine implants.
It all depends how that copyright material is presented and what (if any) value is added.
At the end of the day, OSAlert is a compilation of other’s copyright work with some editorial notes and a comments section.
Adurbe,
No doubt. I think osnews is fair use, but I do wonder what would happen if somebody asked osnews to take down an article with a take down notice. Some of those take down notices are kind of abusive in nature, but legal threats are legal threats.
“without the author’s permission”
You could argue that the author has explicitly granted permission by selling a copy of a textbook which is designed to teach. If they didn’t want people to learn from it, they wouldn’t have published a textbook.
bert64,
That’s not what “explicitly” means, so I’m going to take the liberty of interpreting your comment with “implicitly” there instead. Personally I don’t really see how transitioning to textbooks changes the nature of the debate at all. Let me ask you this, does it make any difference to you that AI is trained on material published in print versus online?
Here is the problem I have. This isn’t really AI. And I think its a shame to label it as such. These are not system approaching sentience in any way. They are a difference engines. The difference between humans and AI is that humans do not remember everything exactly. Our experiences change our thoughts, and color how we remember things. Computers store things exactly. Humans are also capable of originality, AI isn’t. As AI isn’t capable of originality, by definition everything it creates is a copy. It doesn’t matter to me is the source was one work or millions. Note that when a human copies something and their isnt any originality in it, they get sued/punished/called-out.
TechGeek,
Most of us don’t believe the technology has reached sentience or general AI quite yet. However it’s an interesting topic in it’s own right since how can we really prove anybody else is sentient? Anyway that’s a different topic.
I wanted to discuss the fact that computers can make exact copies. This doesn’t automatically imply that’s what a neural net is doing. With neural nets there’s such a thing as over-fitting the data, Sometimes might produce exact copies but that’s typically unintentional. It’s neither the goal nor ideal for neural nets to over fit their source data since it interferes with it’s ability to generalize. Generalization is the goal.
Take a self driving car for example. The network could perfectly recognize every “stop sign” in the input data set, and yet fail to recognize a stop sign at a slightly different angle or in slightly different lighting. As such, overfitting gives us poor results and being able to recognize patterns without overfitting is usually an important goal for training a NN. We don’t want to record perfect copies, but rather record the traits that a stop sign has. If the NN contains perfect copies, it’s unintentional and likely means the NN training can be further optimized in order to create a more generalized abstraction.
Both of these are…questionable. I don’t believe typical humans have many original thoughts. It’s not meant to be condescending, but rather a statistical observation. Every passing generation, more and more people will have thought the same things already. With so many humans having come and gone, there’s a very high likelihood that almost every idea we’re thinking of has already been thought. It doesn’t matter if we’re writing stories, painting, songs, coding….originality is like a land grab, it helps to be first because everything that follows faces more and more overlap.
Pre-existing work has a huge influence on all human creativity. Our brains are trained by soaking up and tons of input from our environment and other artists/musicians/etc. Originality isn’t the hard part for a computer, after all a random pattern can be original. The key to creativity is triggering our brain’s pattern recognition neurons in a mix of both familiar and new ways. I believe that even today’s neural nets are passing this creative process with flying colors
Given millions (or billions) of examples of human works, you’ll find tons of overlap. This is actually creating something of a mathematical dilemma for musicians. Especially as copyright holders become more aggressive at enforcing their land grabs. Ironically enough, one of the best defenses against a copyright infringement case is to show that the copyright holder infringed on an even earlier work.
The term “artificial intelligence” was originally used for very basic decision-making systems. Some of the groundbreaking artificial intelligence systems in the knowledge representation space, like MYCIN, would appear to us as being little more than computerized “choose your own adventure” novels; a hardcoded decision tree built by experts to simplify the labour involved in decision-making about a niche problem.
Equating AI with the capacity for creativity is a pop culture fallacy. The term you want is artificial *general* intelligence, or AGI, also known as “strong” AI. No one involved in the area professionally is under any delusion that today’s models meet that criterion, except perhaps the noted case of Blake Lemoine. There are, however, many people who think that human-level problem solving may arise within the decade if the breakthroughs at OpenAI proceed at the current pace. That’s probably quite unlikely to be the case, as technological innovation tends to follow a sigmoidal curve, but there have always been overoptimistic futurists.
Regarding originality in human works—it simply isn’t true that people get “sued/punished/called-out” for unoriginal creative works. Painted portraits are not expected to be original works; neither are individual frames of in-between animation (they’re quite literally derived from the adjacent keyframes, after all.)
Likewise there are many forms of widely tolerated unoriginality—Hollywood blockbusters, Bollywood mockbusters, genre films, pornography, decorative art, Christmas cards… We simply don’t expect these to be new or innovative products, even though they are created using the same tools as avant garde cinema, sculpture, and illustration. Sometimes, artists are just tradespeople, performing work with transactional goals. These artists are basically tools, working for their patron’s vision.
The person designing the prompt and selecting images is expected to bring the creativity to the table. If you see a soulless piece of art being promoted somewhere, blame the human who commissioned it, not the illustrator—be they meat or machine—that executed it. The AI is only a tool serving its patron.
And that, really, is the problem: the folks most predisposed to use generative tools are the least capable of utilizing them effectively. Many of them are outsiders to the art world, with no talent or training, so we see countless anime waifus with big tits, six fingers, and lopsided faces. The same thing happened when synthesizers became available in music—a few veteran musicians with real talent used their new instruments to emulate previous sonic textures (which mostly just sounded uncanny); a large group of outsiders tried to innovate with new sounds from scratch (but they had no knowledge of music theory, so the results were terrible); and it took a generation or longer before synthesizers were really “here to stay” as far as mainstream radio was concerned. The new crowd doesn’t fiddle with programming patterns into TB-303s or TR-808s, though. They have better tools in the form of proper digital audio workstations.
Text-to-image prompting is the AI art equivalent of an analogue synthesizer. They’re primitive tools, designed in isolation from how artists actually create; right now an experienced artist with a clear vision is more likely to resort to their tablet or paintbrush than to waste time trying to tease Stable Diffusion or Midjourney into producing comparable results. But that doesn’t make them any less valid as instruments of creation; they just make different mistakes than a pencil.
sukru,
That’s a great point, if we limit training material to public domain and things that are out of copyright, then it fundamentally changes the AI’s level of expertise. Public domain offers a great representation of human knowledge over half a century ago, but it opens up humongous knowledge gaps.
Just take the field of medicine for example, AI has great opportunity to improve patient services.
https://www.businessinsider.com/chatgpt-more-empathetic-than-doctors-study-save-time-bedside-manner-2023-5?op=1
I fully understand the need for quality controls, but everyone including Thom should agree that depriving the AI of modern texts and making it dependent on antiquated ones isn’t a productive path…there needs to be a better solution.
Alfman there is a problem here. What the maker of ChatGPT wrote is true but deceptive.
https://huggingface.co/Mitsua/mitsua-diffusion-one
This is example of something trained on legal sources. Do note this is not public domain sources only.
“No Rights Reserved” copyright a new work made today can declared this by the author. Yes “No Rights Reserved” is still a valid copyright. Documents of this you still should cite your source.
Public Domain that another kettle of fish. That works that Copyright has expired or forfeited and works that could not be copyrighted.
–A public-domain book is a book with no copyright, a book that was created without a license, or a book where its copyrights expired or have been forfeited.–
As per the wikipedia.
https://www.archives.gov/research/still-pictures/permissions
–Materials created and produced by United States federal agencies, or by an officer or employee of the United States Government as part of that person’s official duties, are considered works of the United States Government. These works are not eligible for copyright protection, in the United States, and are treated as though they are in the public domain. —
Yes there lot of documents produced every year that are “No Rights Reserved” because they are not eligible by law for copyright protection any more than this due to some legal restriction by a government.
The reality is there is a lot of modern texts that are legally usable. Think about it your are training AI based on current government recommendations for medical treatment majority of the current data falls under “No Rights Reserved” around the world due to government laws to make sure these documents could get to human doctors in the first place. Yes medical process documents pay walled by copyright turn out not be that useful.
There is a lot more legal training data when you are doing Public Domain+”No Rights Reserved” copyright.
Also there are wacky things cannot be copyrighted. You know all those forms you fill in by law the blank form cannot be copyrighted in most countries. Yes this can include all the text providing directions how to fill in form. Yes this leads to some really wacky were blank exams in some countries have no copyright the answer sheet for the exam is copyrighted.
Would working out how to train AI using only public domain+”No rights reserved” and items that cannot be copyrighted be pain in the but absolute yes. But examples of AI train that way existing. Why is it a pain in the but it required filtering sources and not include the works with more restrictive copyright.
Also parties like OpenAI really should be taking USA copyright timeframe. The USA has the longest copyright time frame before a work falls into the public domain.
The statement by OpenAI annoys me. It true but deceptive. OpenAI does not want to do the process of legally challenging copyright terms or the process of filtering their inputs to be “new works without copyright”+”No Rights Reserved”+ “Public Domain”
Also they don’t want to have to make their AI know how to cite documents they have used. This would require admitting they had used copyrighted work.
Yes legally due to what AI produces being a stack who knows what you cannot put a new copyright over anything AI generated legally. Welcome to future legal nightmare for those who use AI to assist themselves to write code. Lot of AI generated today is who knows copyright all you know for sure it not yours so you cannot be sure if it compatible with your copyright.
If you have a AI that is fully trained on public domain+”No rights reserved” and place what it generates in your own copyrighted work its not going to nuke your works copyright.
Yes there have been examples of comic books that uses AI to enhance their graphics that have been told by judges that they cannot copyright the work with their own copyright because they cannot prove the copyrights of what the AI used so the work is tainted. If they want to release the comic book with copyright release the non AI improved version.
oiaohm,
I’m not quite following your point here. Limiting AI neural networks to public domain works implies depriving AI from learning the exact same copyrighted materials that humans use to learn.
Even to the extend that this is true, that isn’t where the controversy is at.
I understand you’d like them to challenge the copyright terms, I don’t have an issue with that. However the point remains this is placing a new burden on AI that doesn’t exist for human readers.
It’s pretty clear they ARE using copyrighted works. The debate is really about whether they should be allowed to or not. When a human doctor reads a copyrighted work in a medical journal, the knowledge is obviously retained in their brain where in can be recalled. The knowledge learned by the doctor from the copyrighted work can be used to change the doctor’s opinions, improve the doctor’s practice, even to publish a new book on the subject using their own words. Copyright permits humans to do this this. The obvious question for me and probably many others is whether AI should have the right to do the exact same thing.
Do you mind sharing sources for this?
https://arstechnica.com/information-technology/2023/02/us-copyright-office-withdraws-copyright-for-ai-generated-comic-artwork/
https://www.klgates.com/Federal-Court-Rules-Work-Generated-by-Artificial-Intelligence-Alone-Is-Not-Eligible-for-Copyright-Protection-8-30-2023
Alfman this above is different ruling but when you dig around there are multi-able rulings saying AI generated is problem.
–When a human doctor reads a copyrighted work in a medical journal, the knowledge is obviously retained in their brain where in can be recalled. The knowledge learned by the doctor from the copyrighted work can be used to change the doctor’s opinions, improve the doctor’s practice, even to publish a new book on the subject using their own words. Copyright permits humans to do this this. The obvious question for me and probably many others is whether AI should have the right to do the exact same thing.–
There is something important here. Depending on your country copyright laws put other requirements on you. Australian fair dealing requirements requires to provide relevant attribution to the copyright works you have used or you are legally dead in the water. USA fair use if you have not include attribution you will be up for higher damages if you did not have the legal right to use the copyright work because lacking attribution the person reading your work could not gone and bought the source work so you have stolen more money.
This is a bit problem items like chatgpt don’t normally generate valid attribution even when asked.
You cannot ask chatgpt a valid question is this based on public domain and get anywhere near correct answer. Yes a human might get this wrong some of the time on their own works but 90% percent of the time a human author will make the right call for what works the work is based on.
The big thing about these big AI models they cannot do something humans can do. That fairly correctly attribute the work they have generated. Yes human citing what documents what they have written is based on normally has a better chance of pulling it off correctly.
Yes the relevant attribution of the Australian copyright law allows for minor errors in attribution this been not attributing some works used and attributing some works that were not used. But majority of the attribution has to be right.
It is horrible ask chatgpt to attribute it sources and watch it make URL that don’t exist and have never existed and then proceed not to attribute any valid source of anything it used.
AI tools like ChatGPT are getting todo thing that if a human did these and went into a copyright court would get ripped a new one.
From my point if AI bot is to be treated like a human under copyright law it should be able to some what correct attribute what it used to generate it output. Somewhat at least 50% correct but being a computer system demanding 100 correct should not be off the cards. A human will get away with their attribution being up to 50% wrong but that also means it has to be 50% right.
With medical attribution is very important. What if the medical advice you are giving is based on a paper that proven to be absolutely bogus and following it leads to dead humans.
The reality here this AI systems are getting way with doing something that if a human does and gets caught can have their life completely ruined due to copyright laws around the world demanding millions of dollars out of them because what they did was not fair use or fair dealing because there was no credit to sources.
There is a requirement on human authors that AI authors are not living up to. This end up with courts ruling AI works cannot be copyrighted because you have zero attribution for what this work is based on. Human author questioned about his work is going to be able to answer with some attribution for where the ideas of the work came from.
Alfman put your self in the judges seat for three cases.
Case 1: A human author is having his copyright questioned the court has asked him to provide all works that attributed to this current work coming into exist. The human author has. Now they have been able to go though and show without question that this is majority new work. With inspiration taken from stuff. For me I would say case 1 is absolutely a valid copyright most judges have.
Case 2: A human author again having his copyright questioned this time will not provide any attribution instead is attempting to claim this is 100 percent new idea with no source materials to be attributed. You find case after case of this being laughed out of court and the copyright not upheld because the work is most likely stolen.
Case 3: AI generate case. No attribution to provide to the court. The courts have been ruling the same as Case 2 there is no copyright because the work cannot be prove not to be stolen.
The reality here the courts are asking AI systems to provide the same stuff a human author before the courts could provide. Current AI fails todo this so Copyright protection on AI generated does not hold up. Yes without being able to attribute sources or be able to ask someone to attribute source the work is a tainted work and a legal problem use.
oiaohm,
Thank you very much for the link.
Historically it was necessary to provide evidence to prove copyright infringement, but it appears the copyright office is revising copyright policy by assuming infringement in the absence of evidence. I understand the controversy here, but I have to wonder if this is setting a rather dangerous precedent. Withholding copyright protections for those who cannot prove there wasn’t copying is a total reversal, which could have drastic consequences not only for comic book authors but news and software authors as well. I wonder if they really thought this through.
Not too long ago I predicted governments AI prohibitions wouldn’t actually stop AI, instead it would incentivize lying about it’s use and this case perfectly demonstrates why. Does the copyright office intend to start challenging all copyrights now? If so, it’s only a matter of time before the copyright system collapses because there’s no proof.
Attribution applies to quotes and expressions, but ideas themselves cannot be copyrighted.
The problem with your view is that human authors/reporters/artists/software devs/etc fail to meet that bar too. It might help to provide more details if one’s work is contested, but copyright law doesn’t require preemptive disclosure and I bet most authors would not really be able to prove the originality of their ideas if someone comes up and shows it’s similar to a preexisting work – even if the author thought it up independently.
“I’m not quite following your point here. Limiting AI neural networks to public domain works implies depriving AI from learning the exact same copyrighted materials that humans use to learn.”
This is exactly it, many works are licensed for the explicit purposes of training humans, they are _NOT_ licensed for training AI models. Copyright allows the author to decide how and for what their work is used.
bert64,
I don’t think the use of “explicit” fits here though. The vast majority of sources have never stipulated this and at best you could say it’s “implied”, but even then it’s only presumptuously implied.
As kurkosdr was pointing out, the concepts of fair use and transient copies means that intermediary copies are typically not counted as infringing because they’re not redistributed that way. These sort of transient transformations happen all the time with software, web browsers, dvd players, screen readers, etc. Traditionally these transient forms are not considered infringing and in principal a NN only creates transient copies. If no permanent copies are distributed, the case for infringement becomes weak.
When humans do the same thing, it’s considered fair use. Things will become quite awkward for legal purposes if we start applying double standards, one for humans and another for computers. And I concede there are people who want to penalize AI and don’t care about consequences, but I have serious doubts as to whether penalties can work long term…Prohibitions on AI are unlikely to stop AI development, rather it will change jurisdiction and move underground. We’ll end up with a legal system that rewards the liars, is this really a better outcome?
And why is this a necessity to humankind in the first place?
+1
I really don’t understand why copyright is a necessity to humankind in the first place.
Consider this: I don’t give a fuck about copyright.
j0scher,
Is it a tool for old creators, or is it a tool for the wealthy to own something more. I’m not sure if artists who have been dead for 70 years give a fuck about their rights.
Thom, please take some time to acquaint yourself with the concept of fair use and transient copies. Or are you seriously trying to claim that weights on the synapses of a neural net are “copyright infringement”? Because that would require such a wide definition of copyright and such a narrow definition of fair use, that it would be even beyond the MPAA’s wildest dreams.
My take is that artists, record labels, and studios have their 95 years of copyright (or 75 years after the author’s death), “because that’s what the law says”. If LLMs are not infringing copyright “because that’s what the law says”, so be it. We are beyond the point of pretending copyright law is reasonable in any way. It’s all about what the law says.
Some things are forbidden by law. Stealing (as in : from a shop). Experimenting with human embrios. Plagiarism.
If that stops certain advances in either growing organs, or in making some kind of IT applications, so be it.
Humans aren’t allowed to plagiarize. Why should an application be allowed to?
DannyBackx,
Yes, that’s a simple concept, but you’re applying it to a nuanced topic with unfair presumptions of innocence and guilt. It’s unfair to place humans on a plateau that perhaps isn’t deserved. Creating something completely new and non derivative across all of humanity is exceptionally rare. Humans draw inspiration from other works all the time. Hopefully it’s not verbatim, but everyone copies ideas and it’s not their fault because overlap is inevitable. So with this in mind IMHO these language models have become quite good at mixing ideas into new expressions like we do.
So we’re one court decision away from having a major gen AI industry meltdown?
dsmogor,
Some people may wish this, but I don’t think people should get too hopeful, AI will succeed one way or another, there’s too much money to gain. The only question is who controls it and what countries will be allow them to host it. In other words, there is no scenario in which a world-wide crackdown is successful. The countries that may want to restrict AI have too many adversaries who’d be willing and able to take AI money instead.