Publishers vs. the Past: Is AI Hysteria Erasing Online History?
Major news publishers, including **The New York Times** and **The Guardian**, are blocking the **Internet Archive** from crawling their sites, citing concerns over AI scraping. This move threatens to erase a crucial historical record relied upon by journalists, researchers, and the public for nearly three decades.
Imagine a newspaper publisher announcing it will no longer allow libraries to keep copies of its paper.
Thatβs effectively whatβs begun happening online in the last few months. The **Internet Archive**βthe worldβs largest digital libraryβhas preserved newspapers since it went online in the mid-1990s. The Archiveβs mission is to preserve the web and make it accessible to the public. To that end, the organization operates the **Wayback Machine**, which now contains more than one trillion archived web pages and is used daily by journalists, researchers, and courts.
But in recent months **The New York Times** began blocking the Archive from crawling its website, using technical measures that go beyond the webβs traditional robots.txt rules. That risks cutting off a record that historians and journalists have relied on for decades. Other newspapers, including **The Guardian**, seem to be following suit.
For nearly three decades, historians, journalists, and the public have relied on the **Internet Archive** to preserve news sites as they appeared online. Those archived pages are often the only reliable record of how stories were originally published. In many cases, articles get edited, changed, or removedβsometimes openly, sometimes not. The **Internet Archive** often becomes the only source for seeing those changes. When major publishers block the Archiveβs crawlers, that historical record starts to disappear.
**The Times** says the move is driven by concerns about AI companies scraping news content. Publishers seek control over how their work is used, and severalβincluding the Timesβare now suing AI companies over whether training models on copyrighted material violates the law. Thereβs a strong case that such training is fair use.
Whatever the outcome of those lawsuits, blocking nonprofit archivists is the wrong response. Organizations like the **Internet Archive** are not building commercial AI systems. They are preserving a record of our history. Turning off that preservation in an effort to control AI access could essentially torch decades of historical documentation over a fight that libraries like the Archive didnβt start, and didnβt ask for.
If publishers shut the Archive out, they arenβt just limiting bots. Theyβre erasing the historical record.
### Archiving and Search Are Legal
Making material searchable is a well-established fair use. Courts have long recognized itβs often impossible to build a searchable index without making copies of the underlying material. Thatβs why when **Google** copied entire books in order to make a searchable database, courts rightly recognized it as a clear fair use. The copying served a transformative purpose: enabling discovery, research, and new insights about creative works.
The **Internet Archive** operates on the same principle. Just as physical libraries preserve newspapers for future readers, the Archive preserves the webβs historical record. Researchers and journalists rely on it every day. According to Archive staff, **Wikipedia** alone links to more than 2.6 million news articles preserved at the Archive, spanning 249 languages. And thatβs only one example. Countless bloggers, researchers, and reporters depend on the Archive as a stable, authoritative record of what was published online.
The same legal principles that protect search engines must also protect archives and libraries. Even if courts place limits on AI training, the law protecting search and web archiving is already well established.
The **Internet Archive** has preserved the webβs historical record for nearly thirty years. If major publishers begin blocking that mission, future researchers may find that huge portions of that historical record have simply vanished. There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.