Meta trained Llama on copyrighted material, new filing claims

10 Jan 2025

Image: © maurice norbert /Stock.adobe.com

Meta requested that a ‘large portion’ of the new filing be redacted, however, the judge denied the request.

In a new filing, the counsel for a trio of authors suing Meta claim that the company allowed its artificial intelligence (AI) large language model Llama to commit copyright infringement on pirated data and upload it for commercial gain.

The lawsuit was initially filed in July 2023 in the Northern District Court of California by authors Richard Kadrey, Christopher Golden and Sarah Silverman, and is just one of the many ongoing AI copyright-related legal battles against Big Tech.

The fresh document filed on 8 January alleges “newly discovered evidence”. The new allegations include that Meta was aware that its training dataset for Llama contained copyrighted material that it used without permission, and that the company removed the copyright management information (CMI) from the works before processing the data for its AI model.

The discovery, the filing adds, “suggests” that the company strips CMI “not just for training purposes, but also to conceal its copyright infringement”.

It also claims that Meta went out of its way to include “supervised samples of data” to ensure that Llama’s output “would include less incriminating answers” when responding to prompts regarding the source of its training data.

Moreover, the filing also claims that Meta downloaded content from Libgen, a torrent website which provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill and Pearson Education for Llama’s training, acting as a “leech” on pirated data.

Meanwhile, the filing also noted that a Meta corporate representative, who testified under oath in November last year, “admitted” to uploading pirated files that contained the plaintiffs’ works on torrent websites.

Vince Chhabria, the judge presiding over the case, who rejected Meta’s request to redact “large portions” of the new filing said: “It is clear that Meta’s sealing request is not designed to protect against the disclosure of sensitive business information that competitors could use to their advantage … rather, it is designed to avoid negative publicity.”

However, in November 2023, Chhabria granted Meta’s motion to dismiss all claims in the lawsuit except for the one which alleged that Meta, without authorisation, copied plaintiff’s books for training llama.

The judge, while granting the motion for dismissal said that the other allegations, which included claims which alleged that Llama “cannot function without” information extracted from the plaintiffs’ books were “nonsensical”.

Don’t miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic’s digest of need-to-know sci-tech news.

Suhasini Srinivasaragavan is a sci-tech reporter for Silicon Republic

editorial@siliconrepublic.com