
Perplexity AI has been caught using undisclosed "stealth" crawlers to bypass website restrictions and access content that site owners explicitly blocked from AI training, according to a new report from Cloudflare.
The discovery raises serious questions about consent and transparency in AI data collection—and shows how some AI companies are willing to break established web protocols to feed their systems.
Cloudflare's investigation began after customers complained that Perplexity was still accessing their content despite blocking the company's official crawlers (PerplexityBot and Perplexity-User) through robots.txt files and firewall rules.
The robots.txt file is a simple text file placed in a website's root directory that tells web crawlers (automated bots) which pages they can or cannot access.
Think of it as a "No Trespassing" sign for bots—it's based on an honour system where well-behaved crawlers voluntarily respect these directives.
The file uses straightforward commands like "Disallow: /private/
" to block access to specific folders or "User-agent: *
" to apply rules to all bots. While robots.txt has governed web crawling etiquette for decades, it's not legally binding, and as recent AI controversies show, not all crawlers respect these polite requests.
To test this, Cloudflare created brand-new domains that were never publicly indexed and explicitly banned all automated access. Yet when researchers queried Perplexity about these secret test sites, the AI provided detailed information about their restricted content.

The smoking gun? Perplexity wasn't just using its declared crawlers. When blocked, it switched to a generic browser user agent designed to impersonate Google Chrome on macOS, processing 3-6 million daily requests through this disguised identity.
Even more troubling, the company rotated through multiple undeclared IP addresses and different network providers (ASNs) to evade detection—behaviour that violates decades-old web crawling standards outlined in RFC 9309.
Technical Trickery vs. Transparency
This stands in stark contrast to how responsible AI companies operate. OpenAI, for example, clearly identifies its crawlers, respects robots.txt directives, and stops crawling when blocked. When Cloudflare ran identical tests with ChatGPT, it properly fetched the robots file, respected the restrictions, and made no follow-up attempts to circumvent blocks.
"The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust," Cloudflare stated in their report. They've now de-listed Perplexity as a verified bot and added detection rules to block this stealth crawling behaviour.
What This Means for You
If you run a website, this revelation highlights why basic robots.txt files aren't enough anymore. Cloudflare customers with existing bot management rules are already protected, but others should consider implementing more robust blocking measures. The company offers both blocking and challenge rules (which let real humans pass while stopping bots) to help site owners maintain control.
For the broader internet, this incident exemplifies growing tensions over AI training data consent. As more websites restrict AI access—over 2.5 million sites now block AI crawlers through Cloudflare—some companies appear willing to use deceptive tactics rather than respect those boundaries.
Cloudflare expects this cat-and-mouse game to continue evolving, with both evasion techniques and detection methods becoming more sophisticated as the AI industry grapples with data access rights.