Last fall, I wrote about how the fear of AI was leading us to wall off the open internet in ways that would hurt everyone. At the time, I was worried about how companies were conflating legitimate concerns about bulk AI training with basic web accessibility. Not surprisingly, the situation has gotten worse. Now major news publishers are actively blocking the Internet Archive—one of the most important cultural preservation projects on the internet—because they’re worried AI companies might use it as a sneaky “backdoor” to access their content.
This is a mistake we’re going to regret for generations.
Nieman Lab reports that The Guardian, The New York Times, and others are now limiting what the Internet Archive can crawl and preserve:
When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots.
Specifically, Hahn said The Guardian has taken steps to exclude itself from the Internet Archive’s APIs and filter out its article pages from the Wayback Machine’s URLs interface. The Guardian’s regional homepages, topic pages, and other landing pages will continue to appear in the Wayback Machine.
The Times has gone even further:
The New York Times confirmed to Nieman Lab that it’s actively “hard blocking” the Internet Archive’s crawlers. At the end of 2025, the Times also added one of those crawlers — archive.org_bot — to its robots.txt file, disallowing access to its content.
“We believe in the value of The New York Times’s human-led journalism and always want to ensure that our IP is being accessed and used lawfully,” said a Times spokesperson. “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”
I understand the concern here. I really do. News publishers are struggling, and watching AI companies hoover up their content to train models that might then, in some ways, compete with them for readers is genuinely frustrating. I run a publication myself, remember.



This is a total lie. This has nothing to do with AI. They’ve hated archive sites because forums like this one hate their paywalls, and we prefer to be able to actually read their articles and discuss them instead of getting blackballed every time.
NYT is one of the worst offenders, and NYT as a company has turned for the worse in the last 5-10 years, maybe even worse than Amazon Post. None of the old media companies really understand how to adapt in the Internet age, so they are slowly dying. It’s like they are perpetually in an economic bubble that hasn’t figured out how to pop itself. There’s so much damn news and news places copying their own news, and regurgitating it a hundred times, that we’re forced to aggregate it and have YouTubers hawk shit like Ground News just to process it all.
I have thoughts about this, but I’ll be civil. Let’s just say I was in a budget meeting with Len Downie.