Over 30 million videos scraped by Nvidia to train AI—report finds

Written by Aaron Infante

Published 7 Aug 2024

Fact checked by

Sophia Feona Cantiller

Why trust Greenbot

We maintain a strict editorial policy dedicated to factual accuracy, relevance, and impartiality. Our content is written and edited by top industry professionals with first-hand experience. The content undergoes thorough review by experienced editors to guarantee and adherence to the highest standards of reporting and publishing.

Nvidia is caught in the spotlight once more after leaked internal Slack conversations, emails, and documents revealed that it scraped around 30 million videos monthly for artificial intelligence (AI) training.

Following a recent controversy involving unlicensed access to over 170,000 YouTube video subtitles, an investigation from 404 Media exposed that the tech company instructed its employees to acquire full-length game footage and movies from YouTube, Netflix, and other sources for the datasets training its AI model.

Nvidia reportedly uses the model for many services, including its Omniverse 3D world generator, self-driving car systems, a “digital human” AI avatar product, and the Cosmos deep learning model. This puts potential clients at risk of copyright infringement.

Nvidia’s how-to guide to data heist

According to 404 Media’s full report, high-quality gameplay videos were illegally scraped from sources without their consent using Nvidia’s GeForceNow cloud service.

“We’ll work closely with GeForceNow and related engineering teams to set up live game data capture, scale up the pipeline, and process them for training,” were the exact words of senior research analyst Jim Fan in a Slack message.

Moreover, an internal project under the codename Cosmos had staff use 20 to 30 virtual machines on Amazon Web Services to download 80 years’ worth of YouTube videos daily, accumulating to more than 30 million URLs after only a month.

The dozens of virtual machines were also used to cover Nvidia’s tracks from the online video-sharing platform and avoid detection.

Additionally, the company’s vice president of research, Ming-Yu Liu, asked for volunteers to download movies, which were identified to be “a good source of data to get gaming-like 3D consistency and fictional content but much higher quality.”

YouTube is hurt the most

While content from Netfix and Github was also used, most of the millions of videos lifted every day came from YouTube. This threatened YouTube’s relationship with partner game developers and their parent companies, whose copyrights were also infringed.

In multiple leaked Slack conversations, several YouTube channels were identified for data scraping, with the message: “If you are still open to suggestions about YouTube channels that we could download, here are a couple of channels that might be interesting to consider.”

The chat also attached links and remarks about the channels, which included individual content creators like Enes Yilmazer and Marques Brownlee (MKBHD) and brands such as Blacktail Studio, Expedia, Architectural Digest, The Critical Drinker, and Pick Up Limes.

Silenced concerns

Nvidia’s scraping tactics did not go unnoticed by their personnel. Employees raised concerns about the legality and ethics of their assignments but were often silenced by project managers.

“This is an executive decision. We have an umbrella approval for all of the data,” Liu wrote in a chat to appease one worker’s worries. He also stated that there could not be any “negative sentiment” since what they were doing would lead to zero publications.

When 404 Media asked Nvidia about its leaked actions, Nvidia answered that it is operating “in full compliance with the letter and spirit of copyright law. […] Fair use protects the ability to use a work for a transformative purpose, such as model training.”

With the existing copyright laws still outdated, data scraping remains in a legal gray zone as it can be challenging to prove. “The best [company] policy in terms of incentives is to not tell people what you’ve trained on. So as long as you don’t tell anybody, it’s going to be really hard to prove,” Massachusetts Institute of Technology law student Robert Mahari explained.