YouTubers outraged as AI titans exploit their content without consent

BY

Published 18 Jul 2024

NSFW AI Why trust Greenbot

We maintain a strict editorial policy dedicated to factual accuracy, relevance, and impartiality. Our content is written and edited by top industry professionals with first-hand experience. The content undergoes thorough review by experienced editors to guarantee and adherence to the highest standards of reporting and publishing.

Disclosure

Free Person Holding Canon Dslr Camera Close-up Photo Stock Photo

YouTubers cried foul after their videos were used by large artificial intelligence (AI) companies to train models without consent.

According to an investigation conducted by Proof News, subtitles from 173,536 YouTube videos in over 48,000 channels were scraped and compiled into a large training dataset called YouTube Subtitles for the use of Anthropic, Nvidia, Apple, Salesforce, and other AI titans.

Released in 2020, YouTube Subtitles was created by nonprofit research lab EleutherAI as part of a compilation known as the Pile. Of the hundreds of thousands of YouTube videos, it contained texts from over 12,000 videos that have already been removed from the video-sharing platform. Proof News added that even the works of creators who deleted their accounts remain in the dataset.

Though actual clips of the videos are not present, YouTube Subtitles also include translations into various languages, such as German, Arabic, and Japanese.

Theft, so says content creators

YouTube Subtitles comprises video transcripts from a wide range of YouTube channels, may it be educational, news, or entertainment.

Online learning channels like Khan Academy, MIT, Harvard, Crash Course, and “Professor Dave Explains” were illegally accessed, along with The Wall Street Journal, NPR, and BBC. Likewise, videos from “The Late Show with Stephen Colbert,” “Last Week Tonight with John Oliver,” and “Jimmy Kimmel Live” did more than make viewers laugh—they also trained AI.

The most subscribed YouTuber, MrBeast, has two of his videos working for AI models, along with fellow creator PewDiePie, who has 337 videos taken. Interestingly, even flat-Earth conspiracies were found in the dataset.

“It’s theft,” stated Dave Wiskus, the CEO of streaming platform Nebula, emphasizing that it was “disrespectful” to use creators’ work without their permission at a time when generative AI can be used “to replace as many of the artists along the way as they can.”

Similarly, a popular politics channel hosted by David Pakman shared the same sentiment. “No one came to me and said, ‘We would like to use this,’” said the host after around 160 of his videos from “The David Pakman Show” were exploited to develop AI.

Pakman further complained that he and his team should be compensated if AI companies are paid for the services that they create using his data. “This is my livelihood, and I put time, resources, money, and staff time into creating this content. There’s really no shortage of work,” he added.

YouTube creators may check if their videos were lifted using a tool built by Proof News that allows them to simply enter keywords and search the content of YouTube Subtitles without requiring computer expertise.

AI Giants and the Art of Washing Hands

YouTube Subtitles belong to a larger compilation called The Pile, which was created by Eleuther AI. The Pile also includes materials from the European Parliament, English Wikipedia, and emails from Enron Corporation employees that were made public during a federal investigation.

Most of these data can be accessed by anyone with connectivity, large memory, and computing power. As stated in its goal statement on its website, the Pile may be one of EleutherAI’s approaches to extending AI development beyond Big Tech.

However, besides academics and startup developers, many wealthy AI giants have also been discovered reaping the benefits of the dataset.

Apple has indicated in its research papers and publications that its models, including OpenELM, were designed using the Pile, and so did Bloomberg, Databricks, Anthropic, Nvidia, and Salesforce.

When asked about the unlicensed use of web content associated with the dataset, these companies were quick to wash their hands.

“The Pile includes a very small subset of YouTube subtitles. YouTube’s terms cover direct use of its platform, which is distinct from the use of The Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors,” said Anthropic spokesperson Jennifer Martinez.

Similarly, Salesforce officials insisted that the Pile was “publicly available” and that their models were for “academic and research purposes.”

Meanwhile, Nvidia has refused to comment, and Apple, Databricks, and Bloomberg haven’t responded yet.

Presently, the Pile is no longer available on its official download site, though it can still be accessed on file-sharing services.

“Technology companies have run roughshod. People are concerned about the fact that they didn’t have a choice in the matter. I think that’s what’s really problematic,” noted Amy Keller, a consumer protection lawyer at DiCello Levitt, which has recently filed lawsuits against AI firms for lifting works without permission from creators.