Web scraping, plagiarism raps hound Perplexity AI

BY Sophia Feona Cantiller

Published 4 Jul 2024

Why trust Greenbot

We maintain a strict editorial policy dedicated to factual accuracy, relevance, and impartiality. Our content is written and edited by top industry professionals with first-hand experience. The content undergoes thorough review by experienced editors to guarantee and adherence to the highest standards of reporting and publishing.

Free A Woman with Number Code on Her Face while Looking Afar Stock Photo

Perplexity AI faces yet another challenge to conquer as ethical investigations pile up against the startup after allegations of web scraping and plagiarism impede its momentum.

The artificial intelligence (AI) company is a hybrid “answer engine” powered by a huge language model and chatbots. Unlike its top rival Google, it directly generates answers for users with detailed information and not just links.

One major difference that sets Perplexity AI apart from other generative AI tools, like ChatGPT, is its database, which does not train foundational AI models but uses open or commercially available AI models online to convert information into responses.

However, this difference has also put it on the hot seat following a series of claims that question the ethics of its approach.

Summarization or Scraping?

On June 19, Wired published a story accusing Perplexity of illegally scraping its news site and other publications related to parent company Condé Nast.

In simple terms, web scraping refers to the process when automated software, called crawlers, searches the Internet to gather data from different sites. In doing so, web scrapers must comply with the Robots Execution Protocol by first checking for the “robots.txt” file in the site’s code to determine whether or not it is allowed.

However, according to the report, the Jeff Bezos-backed business has been able to provide its service by dismissing this protocol to “surreptitiously scrape areas of websites that operators do not want bots to access, despite claiming that it won’t.”

In response, Perplexity’s head of business, Dmitry Shevelenko, insisted in an interview with TechCrunch that summarizing a uniform resource locator (URL) is different from crawling and that Perplexity only accesses a website with a “robots.txt” file when its user enters the site’s URL as a prompt.

“We’re just responding to a direct and specific user request to go to that URL,” Shevelenko stated, explaining that their AI acts as an aid that retrieves the information when asked and not as a web crawler.

Fair Use or Plagiarism?

In addition to web scraping accusations, Wired and Forbes have also raised plagiarism concerns about Perplexity, citing several articles that contained strikingly similar content.

Among these was the Wired article that alleged the answer engine’s web scraping behavior. The answer engine produced “a six-paragraph, 287-word text closely summarizing the conclusions of the story and the evidence used to reach them” that included one complete sentence copied word for word.

Likewise, Forbes’ flagged Perplexity after its coverage investigating the involvement of Google Chief Executive Eric Schmidt in the application of AI in powering drones for military application was republished in one of the latter’s features, containing “nearly identical wording” and a slightly modified image from the Forbes’ scoop without explicit mentions of the publications but only “small, easy-to-miss logos that link out to them.”

Perplexity CEO Aravind Srinivas addressed that problem by expressing their intention to cite sources more clearly next time. Still, the company has reiterated its right to use the content for summarizations within the bounds of fair use.

“Nobody has a monopoly on facts,” Shevelenko said. “Once facts are out in the open, they are for everyone to use.”

Moving Forward

Confronted with these allegations, Perplexity is “full speed ahead” on various media and revenue-sharing deals with publishers to train their models.

While the company is yet to declare any announcement, it is expected to include advertisements with its query responses, which opens opportunities for publishers to generate a portion of the revenue from having their content included in any answer.

Further, Perplexity is looking into opening its technology to publishers to pave the way for Q&A experiences within their sites and products.