Allegations of Unauthorized Web Scraping
AI startup Perplexity is facing scrutiny for allegedly scraping content from websites that explicitly forbid such activity. Internet infrastructure provider Cloudflare reported that Perplexity ignored website restrictions and attempted to hide its crawling activities. This behavior raises serious questions about ethical AI data usage and the responsibilities of companies leveraging web content for large language models (LLMs).

Cloudflare’s analysis revealed that Perplexity was altering its web bots’ identity to bypass restrictions. Websites often use a Robots.txt file to communicate which pages should or should not be accessed by automated crawlers. Perplexity reportedly circumvented these rules by modifying its user agent, which identifies the type of device or browser visiting a website, and by changing its autonomous system number (ASN), a network-level identifier. These tactics allowed the company to scrape content even from sites that explicitly blocked its access.
Scale and Methods of Scraping
According to Cloudflare, Perplexity’s activities spanned tens of thousands of domains and involved millions of requests daily. By combining machine learning techniques with network signal analysis, Cloudflare was able to identify and fingerprint the Perplexity crawler. The report indicated that the startup also used a generic browser impersonating Google Chrome on macOS to bypass blocks when its declared bot was restricted.
This scale of data extraction highlights the challenges of enforcing content usage rights in the age of AI. LLMs rely on vast datasets to generate accurate and coherent outputs, but the methods by which these datasets are acquired can raise legal and ethical concerns, especially when scraping ignores explicit prohibitions.
Perplexity’s Response
Perplexity has dismissed Cloudflare’s claims. A spokesperson stated that the report was a “sales pitch” and claimed that the screenshots presented in Cloudflare’s blog post did not show any accessed content. The company further denied ownership of the specific bot Cloudflare identified, creating ambiguity around the allegations.
Despite these denials, Cloudflare said the behavior was first noticed after its customers reported repeated scraping by Perplexity. The company then conducted its own tests to verify that Perplexity was circumventing blocks set via Robots.txt and bot-specific exclusion rules. Cloudflare subsequently removed Perplexity’s bots from its verified list and implemented additional blocking measures.
Industry Context and Previous Allegations
This is not the first time Perplexity has faced accusations of scraping without authorization. Last year, Wired and other outlets alleged that Perplexity was plagiarizing content from news organizations. At the TechCrunch Disrupt 2024 conference, the company’s CEO Aravind Srinivas was unable to clearly define the company’s approach to plagiarism, raising questions about compliance with content usage norms.
The broader AI industry faces ongoing challenges in obtaining data responsibly. LLMs and AI tools frequently rely on scraping public web content, including text, images, and videos, often without explicit consent. Websites and publishers have attempted to limit this practice using Robots.txt, paywalls, or bot-blocking mechanisms, with varying levels of success. Cloudflare itself has taken steps to protect publishers, including launching a marketplace that allows site owners to charge AI scrapers for access.
Ethical and Legal Implications
The Perplexity case highlights ethical concerns regarding AI training and the unauthorized use of web content. Scraping against explicit website restrictions undermines trust between AI developers and content creators. Publishers argue that unrestricted scraping devalues their work and threatens their business models. Cloudflare CEO Matthew Prince has previously warned that AI systems could disrupt the traditional economics of the web, particularly for media and publishing industries.
From a legal standpoint, bypassing site restrictions can expose companies to liability under copyright, terms of service agreements, and digital property law. As AI adoption grows, companies face increasing scrutiny from both regulators and the public regarding how training data is sourced and whether consent has been obtained.
Future Outlook
The controversy surrounding Perplexity underscores the tension between AI innovation and responsible data use. While AI companies require large datasets for product development, there is mounting pressure to respect website owners’ preferences and intellectual property. Industry observers expect stricter regulations, more robust enforcement of scraping limitations, and greater transparency in AI training practices.

For AI startups like Perplexity, the incident may lead to reputational risks, legal challenges, and increased operational oversight. The company will need to demonstrate compliance with ethical standards and establish clear policies for web content usage. Meanwhile, infrastructure providers like Cloudflare are likely to continue refining their monitoring tools and enforcement mechanisms to prevent unauthorized scraping and safeguard publishers.
The Perplexity case serves as a cautionary example for the AI industry, emphasizing that rapid growth and technological capability must be balanced with responsible data practices and respect for intellectual property. It also illustrates the evolving landscape where ethical AI development and publisher rights intersect, creating a pressing need for both innovation and accountability.