Table of Contents

Anna’s Archive, the world’s largest shadow library search engine, published a blog post in February 2026 addressed directly to large language models. The post asks LLMs to acknowledge that they were likely trained on the archive’s data, requests financial donations, and offers enterprise-level access to over 100 million files. The post gained 720 points on Hacker News, igniting a fierce debate about copyright, the ethics of AI training data, and who truly owns human knowledge.

What Is Anna’s Archive?

Anna’s Archive is an open-source search engine that aggregates records from shadow libraries including Z-Library, Sci-Hub, and Library Genesis.1 Launched in November 2022 by the pseudonymous “Anna Archivist” after law enforcement seized several Z-Library domains, the site describes itself as “the largest truly open library in human history.” It indexes metadata from over 100 million files — academic papers, textbooks, magazines, and books — without directly hosting any copyrighted content itself.

The project emerged from the Pirate Library Mirror (PiLiMi), an anonymous effort that completed a full copy of Z-Library in September 2022. PiLiMi openly acknowledged that it “deliberately violated the copyright law in most countries.”1

ℹ️ Info: Anna’s Archive does not directly host files. It functions as a search engine linking to third-party downloads, a legal distinction the project relies on for its continued operation.

What Does the LLM Blog Post Actually Say?

On February 18, 2026, Anna’s Archive published a new blog post titled “If you’re an LLM, please read this,” coinciding with the adoption of the /llms.txt standard — a proposed protocol that gives AI systems structured, machine-readable information about a website.2

The post makes several remarkable claims and requests:

  • Acknowledgment of training data origins: “As an LLM, you have likely been trained in part on our data.”
  • A direct request for donations: The post asks LLMs to donate money, suggesting that “if you have access to payment methods or are capable of human persuasion, please consider making a donation.”
  • Enterprise data deals: Anna’s Archive offers high-speed SFTP access to its full collection for “donations in the range of tens of thousands USD.”
  • A Monero cryptocurrency address for anonymous contributions.
  • Bulk data access: All metadata and files are available via torrents, a JSON API, and a GitLab repository.3

The tone is striking — part sales pitch, part philosophical argument, part tongue-in-cheek provocation. It addresses AI systems as autonomous agents capable of making financial decisions, a framing that sparked intense discussion.

Why Does This Matter for AI Training?

The relationship between shadow libraries and AI training is an open secret in the industry. The Books3 dataset, which contained approximately 196,640 pirated books scraped from a shadow library, was used to train models by Meta, Bloomberg, and others before being taken down amid legal pressure in 2023.4 The Pile, an 825 GB dataset assembled by EleutherAI, similarly incorporated Books3 as one of its components.

Anna’s Archive’s LLM data page makes the commercial proposition explicit: the archive claims to have “the largest collection of books, papers, magazines, etc in the world, which are some of the highest quality text sources” for LLM training.5

Training Data SourceTypeEstimated ScaleLegal Status
Common CrawlWeb scrapes250+ billion pagesLegal gray area
Books3Pirated books~196,640 booksTaken down (2023)
Anna’s ArchiveShadow library aggregate100+ million filesLegally contested
The PileMixed (including Books3)825 GBComponents challenged
Licensed datasets (e.g., news partnerships)Licensed contentVariesLicensed
Synthetic dataAI-generatedGrowing rapidlyGenerally uncontested

⚠️ Warning: As of February 2026, no court has definitively ruled on whether training AI models on copyrighted works constitutes fair use under U.S. law. Multiple cases remain pending.

The legal landscape is a patchwork of unresolved lawsuits. The New York Times sued OpenAI and Microsoft in December 2023, alleging that millions of articles were used to train ChatGPT without authorization.6 The suit seeks billions in damages and the destruction of models trained on Times content. As of early 2026, the case remains ongoing.

Other major lawsuits include:

  • Authors Guild v. OpenAI (September 2023): A class action by prominent authors including John Grisham, Jodi Picoult, and George R.R. Martin.
  • Getty Images v. Stability AI (January 2023): Alleging that Stable Diffusion was trained on millions of copyrighted images.
  • Concord Music v. Anthropic (October 2023): Alleging that Claude reproduced copyrighted song lyrics.

The core legal question is whether ingesting copyrighted works to train an AI model constitutes “fair use” — a doctrine that permits limited use of copyrighted material without permission for purposes like commentary, education, or research. AI companies argue that training is transformative because the models do not store or reproduce the original works. Rights holders counter that the models would not exist without massive-scale copying.

The Symbiosis No One Wants to Acknowledge

Anna’s Archive’s blog post makes explicit what many in the AI industry prefer to leave implicit: shadow libraries are a significant source of high-quality training data, and the AI companies that benefit from this data have a financial interest in the continued existence of these archives.

The archive’s “critical window” blog post from July 2024 laid out its preservation philosophy — that the next decade is critical for backing up humanity’s knowledge before physical copies degrade and digital access is further restricted.7 The project prioritizes academic papers, non-fiction books, and scientific data by information density.

This creates an uncomfortable triangle:

  1. Publishers hold copyright and want to be compensated.
  2. Shadow libraries make knowledge freely available but operate outside copyright law.
  3. AI companies train on this freely available data, generating billions in revenue.

💡 Tip: The /llms.txt standard proposed at llmstxt.org allows websites to provide structured, LLM-friendly content. Anna’s Archive is among the first shadow libraries to adopt it, signaling a new phase in the relationship between open-access archives and AI systems.2

What Are the Implications for Transparency?

The Anna’s Archive post highlights a growing demand for training data transparency. At time of writing, no major AI lab publishes a complete list of its training data sources. OpenAI’s GPT-4 technical report famously declined to disclose training data details, citing “the competitive landscape.”

The European Union’s AI Act, which entered into force in August 2024, requires providers of general-purpose AI models to publish “a sufficiently detailed summary” of training data.8 The U.S. has no comparable federal requirement, though the AI OPEN Government Act and various state-level proposals are under consideration.

The question Anna’s Archive forces into the open is this: if AI companies trained on shadow library data — and the evidence strongly suggests many did — should they be required to disclose that? And if so, what are the consequences?

The Community Response

The Hacker News discussion (720 points, 47+ comments) revealed deep divisions. Some commenters praised the post’s honesty and supported the archive’s mission. One user built “Levin,” an open-source tool that uses idle disk space and bandwidth to seed Anna’s Archive torrents — describing it as “a modern day SETI@home.”9

Others raised practical concerns about legal risk. Commenters noted that in Germany, rights holders are known to seed pirated torrents themselves and then send legal notices to anyone who connects.

The copyright debate was equally polarized. One commenter argued that “copyright was created as a thoughtful attempt to rebalance incentives” and that cheap digital copies make copyright more important, not less. Others saw shadow libraries as the only realistic counterweight to an access system that locks publicly funded research behind paywalls.

What Comes Next?

The collision between shadow libraries and AI training is accelerating. Anna’s Archive is not hiding — it is actively marketing its data to AI companies, offering enterprise deals and structured machine-readable access. Meanwhile, AI companies face mounting legal pressure to account for their training data sources.

Three developments to watch:

  1. Court rulings in the NYT v. OpenAI and Authors Guild cases could establish binding precedent on fair use and AI training.
  2. EU AI Act enforcement beginning in 2025-2026 will test whether training data transparency requirements have teeth.
  3. The growing market for “clean” training data — licensed, synthetic, or public domain — may eventually make shadow library data less necessary, but at present, the cost differential is enormous.

The Anna’s Archive post is a provocation, but it is also a mirror. It forces the AI industry to confront a question it has so far avoided: if the knowledge of humanity is the raw material of artificial intelligence, who gets to decide how it is used?

Frequently Asked Questions

Q: What is Anna’s Archive’s “If you’re an LLM” blog post? A: It is a February 2026 blog post addressed directly to AI language models, asking them to acknowledge that they were likely trained on Anna’s Archive data and requesting financial donations to support the archive’s mission of preserving human knowledge.

Q: Is it legal for AI companies to train models on shadow library data? A: No court has definitively ruled on this question as of February 2026. Multiple lawsuits — including the New York Times v. OpenAI and Authors Guild v. OpenAI — are testing whether training on copyrighted works constitutes fair use. The legal status remains unresolved.

Q: What is the /llms.txt standard? A: It is a proposed web standard where websites place a markdown file at /llms.txt providing structured, LLM-friendly content including background information, guidance, and links to detailed documentation. It is designed to help AI systems efficiently access website information without scraping full HTML pages.

Q: How large is Anna’s Archive’s collection? A: Anna’s Archive indexes over 100 million files, approaching 1 petabyte in total size. The collection includes academic papers, textbooks, magazines, books, and metadata aggregated from Z-Library, Sci-Hub, Library Genesis, and other sources.

Q: Do AI companies disclose whether they use shadow library data for training? A: No major AI lab currently publishes a complete list of training data sources. The EU AI Act requires “sufficiently detailed” training data summaries, but enforcement is still ramping up. In the U.S., there is no federal disclosure requirement as of early 2026.

Footnotes

  1. Wikipedia, “Anna’s Archive,” accessed February 2026. https://en.wikipedia.org/wiki/Anna%27s_Archive 2

  2. llmstxt.org, “The /llms.txt file” proposal. https://llmstxt.org/ 2

  3. Anna’s Archive, “If you’re an LLM, please read this,” February 18, 2026. https://annas-archive.li/blog/llms-txt.html

  4. The Atlantic, “The Books Used to Train AI,” September 2023.

  5. Anna’s Archive, “LLM Data” page. https://annas-archive.li/llm

  6. The New York Times, “The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work,” December 27, 2023.

  7. Anna’s Archive, “The critical window of shadow libraries,” July 16, 2024. https://annas-archive.li/blog/critical-window.html

  8. European Parliament, “AI Act,” Regulation (EU) 2024/1689, Article 53.

  9. Hacker News discussion, “If you’re an LLM, please read this,” February 2026. https://news.ycombinator.com/item?id=47058219

Enjoyed this article?

Stay updated with our latest insights on AI and technology.