Nvidia says scraping 80 years' worth of videos daily to train its AI models is in "the spirit of copyright law"

midian182

Posts: 10,008   +131
Staff member
A hot potato: Once again, it's been revealed that a company has been scraping data from the internet to train its AI models using a questionable interpretation of copyright law. On this occasion, Nvidia has been downloading videos from YouTube, Netflix, and other platforms to gather data for its commercial AI products.

According to internal Slack chats, emails, spreadsheets, and several other sources obtained by 404 Media, Nvidia asked workers to download videos from various online platforms to compile data to train its Omniverse, autonomous vehicles, and digital human products.

Codenamed Cosmos, the project involved using between 20 and 30 virtual machines on Amazon Web Services to download the equivalent of 80 years of videos every day. Nvidia was downloading so much that it managed to accumulate over 30 million URLs in the space of one month.

In addition to Netflix and YouTube, Nvidia workers were told to train the AI models on movie trailer database MovieNet, internal libraries of video game footage, and Github video datasets WebVid, which have since been taken down. It also used InternVid-10M, a dataset containing 10 million YouTube video IDs.

Copyright issues are always at the forefront of discussions when it comes to companies scraping data from the web. This was reportedly discussed by Nvidia employees, who used several methods to try to circumvent any potential legal blowback, including using data marked as for academic or non-commercial purposes only.

HD-VG-130M was one of the datasets Nvidia used. This library of 130 million YouTube videos states in its license that it's for academic use only, something Nvidia appears to have ignored. Employees also used Google's cloud service to download the YouTube-8M dataset, as directly downloading the videos isn't allowed under the terms of service.

"We cleared the download with Google/YouTube ahead of time and dangled as a carrot that we were going to do so using Google Cloud," wrote one person in a Slack channel. "After all, usually, for 8 million videos, they would get lots of ad impressions, revenue they lose out on when downloading for training, so they should get some money out of it."

Nvidia also reportedly used VMs with rotating IP addresses in some cases to avoid YouTube detecting what it was doing and banning the users.

In April, it was reported that in order to access more reputable English language-based text on the internet in 2021, OpenAI researchers created a speech recognition tool called Whisper. It was designed to transcribe audio from YouTube videos, giving the company a trove of data to train its LLMs. Why didn't Google object? Possibly because it also transcribed YouTube videos for its AI models, potentially violating creators' copyrighted material.

YouTube previously said that scraping data to train AI models was a "clear violation" of its terms. Nvidia told 404 Media that its actions were "in full compliance with the letter and the spirit of copyright law."

If you were wondering whether Nvidia used gameplay footage from its own GeForce Now service to train its AI – no, it didn't, though it sounds like such a thing could happen at one point. "We don't yet have statistics or video files yet, because the infras is not yet set up to capture lots of live game videos & actions," a senior Nvidia research scientist told other employees. "There're both engineering & regulatory hurdles to hop through."

Many AI firms engaging in data scraping practices defend their actions by claiming it's fair use under copyright law. Music-generating AI startups Udio and Suno are using this excuse in their copyright lawsuits filed by major record companies.

Permalink to story:

 
Google/Youtube should be the last group of people to complain about Nvidia scraping their stuff. Google has literally sent statements to national governments saying they think AI scraping should be covered under fair use. Google has also done stuff like mass-scan copyrighted books for their own book search system; they got taken to court of over that, in the case of Author's Guild vs Google, and Google won that court case.


https://www.tomshardware.com/news/google-ai-scraping-as-fair-use
 
Google/Youtube should be the last group of people to complain
Yes, not to mention scan and scrape the whole web including daily news articles so as to present summaries to its users, while selling the ads against those impressions.

If anyone's got a leg to stand on here it is the individual content creators on YouTube, who even though they have chosen to post their content publicly, might appreciate the option to clarify that means to humans only.

As to what public policy in this area should be I haven't settled on a firm opinion yet. It is important to remember that copyright is not a property right; it is an explicit compromise enacted to ensure both sufficient incentive to create new content, and sufficient public access so that we all benefit in the long run (some lawmakers seem to be forgetting that last part as with near-perpetual Disney copyrights - it was originally supposed to be lifetime of the creator.). I think we'll want a similar compromise on ensuring that AI one day becomes the amazingly helpful tool it is being marketed as now, while not putting content creators out of business as it does it.
 
I never thought I would ever support YouTube in anything but here you go, Nvidia total pricks at all levels.

Yes I do also agree with what the others have said, Google as I have said many times are the lowest of the low along with Meta, Microsoft, X, Adobe, and now Nvidia and just a tad less worse are Apple.
 
These big corporations continue to ruin the internet. Like Google, Meta, and MS, lets break the law (and know full well we have broken the law) then deal with the consequences afterwards, because we can just throw our weight around if needs be.
 
I think if its in plainview to the public it should be fair use but if its behind a paywall its not seems simple to make it work the same way it works off the internet. So public videos should be fair use in ai training
 
These big corporations continue to ruin the internet.
Please explain how your Internet experience is being ruined by NVidia scraping Youtube videos. I'll grab some popcorn while I wait.

Nor is NVidia "breaking the law", as recent court decisions have made clear, Kadrey v. Meta Inc, among them. As long as your models aren't outputting copyrighted material, they're not infringing.

I think if its in plainview to the public it should be fair use but if its behind a paywall its not seems simple to make it work the same way it works off the internet. So public videos should be fair use in ai training
The distinction in law isn't between 'paid' vs. 'free' but rather 'copyrighted' vs. 'public domain'. Copyright owners have certain rights to limit how their material is being used, but those rights are not unlimited.
 
Last edited:
Please explain how your Internet experience is being ruined by NVidia scraping Youtube videos. I'll grab some popcorn while I wait.

Nor is NVidia "breaking the law", as recent court decisions have made clear, Kadrey v. Meta Inc, among them. As long as your models aren't outputting copyrighted material, they're not infringing.

The distinction in law isn't between 'paid' vs. 'free' but rather 'copyrighted' vs. 'public domain'. Copyright owners have certain rights to limit how their material is being used, but those rights are not unlimited.
Please go troll somebody else - you are so boring!
 
Back