Case Report: Anthropic Trained Claude AI on Stolen Data (Bartz v. Anthropic)
The court draws a line between AI training and the way you build your dataset. You cannot build AI with stolen books and call it innovation.
Anthropic trained its AI using a large content of books. Some of those books were downloaded from pirate websites. A group of authors took them to court. The judge has now dismissed most of the claims, but one serious allegation survived. The court ruled that Anthropic may have copied and stored these works without permission.
⚖️ Litigants: Bartz et al v. Anthropic
🏛️ Court: United States District Court (California)
🗓️ Judgment Date: 23 June 2025
🗂️ Case Number: C 24-05417 WHA
Fair Use or Free-for-All?
Claude is off the hook for now; Anthropic is not. In a nutshell, this was the decision of Judge Vince Chhabria in Bartz v. Anthropic.
Here are the basic facts.
The plaintiffs, a group of authors, sued Anthropic, the company behind the Claude large language model, alleging copyright infringement.
The plaintiffs were angry that their copyrighted books were used and internalised as training data without their permission.
Some of those books, it turns out, were illegally obtained.
Judge Chhabria agreed to dismiss nearly all the claims except one: direct infringement for Anthropic’s reproduction and storage of the plaintiffs’ books. Why? Because Anthropic allegedly downloaded pirated digital versions of those works and copied them into its dataset.
That has very little to do with AI tech and very much to do with basic copyright law.
Copying an entire book without a licence and using it for training purposes is still copying. A fair use defence is a long shot if your source is a torrent file for instance.
Now, the fair use conversation gets slightly more complex when it comes to the act of training. The plaintiffs tried to argue that the act of using their books to train Claude was itself infringing.
The court did not buy it. In fact, Judge Chhabria noted that the training process transforms the original works into data that is statistically abstracted and deeply buried in model weights.
The plaintiffs could not prove that any new, tangible version of their books was created or distributed. Without such a showing, the fair use argument still had breathing room.
Judge Chhabria also made something else very clear. The court is not giving a blanket blessing to AI companies slurping copyrighted books into their datasets.
It matters where those books come from, and it matters how they are used.
If Anthropic had licensed the books or bought them through a legitimate source and processed them for training, this case might have been thrown out completely. Instead, the allegations that it sourced them from illegal platforms was enough to keep at least one claim alive.
The fair use discussion turned on the idea of transformation.
While the plaintiffs insisted that transformation should only apply if the resulting model produces transformative output, the judge disagreed.
The transformation, in his view, happens at the stage of training, where the work becomes part of a much larger statistical soup.
This interpretation is generous but not unprincipled. It echoes recent rulings, like in Authors Guild v. Google, where the court found that scanning books for search functionality could qualify as transformative use.
Still, Anthropic will have to face discovery. The case now proceeds with the direct infringement claim intact.
The company must answer for how it acquired these books and what it did with them. The plaintiffs have another chance to revise and refile a claim for contributory infringement.
This ruling reminds everyone in the AI space to pay attention to the input sources. You cannot blame the model for what it learns if the textbook (or source) came from a pirated or illegally sourced content.
Current Legal Position
If you purchase a physical book, scan every page, and then feed it into a machine learning pipeline to train a language model without showing the output to the world, you may be well within the comfort zone of fair use, at least according to Judge Chhabria’s reading of copyright law in Bartz v. Anthropic.
One of the more amusing yet critical distinctions the court made in this case was between books obtained through lawful purchase and those sourced from copyright grey markets.
While the court was firm in allowing claims based on pirated copies to proceed, it was noticeably untroubled by the idea of scanning and digitizing books that Anthropic had presumably purchased through legal means.
The logic? You own the copy, you can break it down for internal use, and provided you do not redistribute or reproduce it publicly, you may fall under the fair use doctrine.
The plaintiffs had claimed that the act of transforming even legally acquired books into training data constituted a violation of their copyright.
The court did not agree.
Instead, Judge Chhabria held that making temporary digital reproductions of purchased books to facilitate internal model training is unlikely to cause market harm, especially when those copies are neither distributed nor accessible to the public.
The act is transformative in a way that the law has found acceptable in similar contexts, particularly when the purpose is analytical or computational, rather than for consumption or commercial redistribution.
From a factual standpoint, the court acknowledged that Anthropic used both lawfully and unlawfully sourced materials, but when it came to the books that Anthropic had legally obtained, the court was unconvinced that scanning them and extracting statistical representations as part of training posed any substantial infringement threat.
There was no evidence that Claude memorised and reproduced entire chapters from those inputs.
There was also no claim that users had ever received a reconstructed or recognisable version of any specific book through Claude’s outputs.
The judge pointed to existing precedent that has supported similar uses in the past.
Where courts have found fair use in cases involving the scanning of books for non-consumptive purposes, the logic has consistently rested on the idea that such uses do not undermine the market for the original work.
The value of a novel remains untouched when an AI disassembles it to extract patterns of language rather than provide a substitute reading experience.
In practical terms, this part of the ruling delivers a key endorsement of internal AI development processes that respect boundaries around access and output.
If no one outside the AI use sees the processed material, and if the processing serves to feed a model that outputs abstract, non-replicative content, then the practice may be legally safe.
Of course, this assumes that the copy of the book was acquired through legitimate channels.
Lessons from a Lawsuit Anthropic Asked For
At the centre of the Bartz v Anthropic complaint is the allegation that Anthropic used copyrighted books without permission during the training of its Claude AI model.
That alone would be enough to cause a few legal eyebrows to rise, but what makes this dispute memorable is that the plaintiffs claim many of those books were illegally obtained from notorious pirate sources like Bibliotik, Library Genesis, and Z-Library.
The complaint points out that some books even had filenames identical to ones distributed on those pirate networks.
Judge Vince Chhabria took one look at the procedural mess and rejected arguments about vicarious infringement, rejected claims that Claude's outputs reproduced plaintiffs' works, and dismissed the idea that training a model on copyrighted tex, without specific examples of infringing outputs, was enough to support every kind of copyright violation under the sun.
But the claim that Anthropic directly copied, stored, and used copyrighted material from illegal sources was too specific to ignore. That part remains.
Anthropic might have hoped for a quick dismissal of the entire lawsuit, but when you are dealing with works pulled from websites that every digital publisher dreads, things get sticky.
You do not get to invoke transformative use when you have allegedly copied wholesale from illegal digital sources.
The court made clear that fair use has limits. It may cover model training in theory, but it does not cover building that training dataset with pirated goods, and certainly not when the copying is verbatim, unlicensed, and systematic.
The complaint includes screenshots, filenames, and detailed appendices to support the claim that Anthropic copied specific works by the named plaintiffs.
This evidence lined up with the focus of a copyright lawyer who skipped lunch to finish the footnotes.
The message here is that if you are training a model on literary content, and that content comes from a source your legal team cannot publicly name without wincing, you should expect litigation. And you should expect that litigation to survive past the first round.
Anthropic will now face discovery on the surviving direct infringement claim. That means internal communications, documentation about dataset construction, and possibly awkward depositions. Whether or not the plaintiffs eventually win, they have secured one thing already: they have forced a very large company to answer questions it clearly hoped to avoid.
That is the latest on Bartz v Anthropic. We will be watching how this case unfolds and what it signals for future copyright disputes in AI. If you have thoughts or questions, feel free to reply or leave a comment.
We always read and respond thoughtfully to every comment.