AI models from major tech companies including Apple, Salesforce and Antropic were trained on tens of thousands of YouTube videos without the creators’ consent, potentially violating the platform’s terms, new reports published in both publications have revealed. Proof News and Wired.
The companies trained their models using a collection by a nonprofit called “The Pile.” Eleuther AI It was put together as a way to provide useful data sets to individuals and companies that didn’t have the resources to compete with the big tech companies, but it has since been used by larger companies as well.
The pile includes books, Wikipedia articles, etc. It also includes YouTube captions collected by the YouTube Captions API, extracted from 173,536 YouTube videos from over 48,000 channels, including videos from big name YouTubers like MrBeast, PewDiePie, and popular tech commentators. Marques BrownleeAbout X, Brownlee Called He criticized Apple’s use of the data set, but acknowledged that it’s complicated to assign responsibility when Apple itself doesn’t collect the data.
Apple sources data for its AI from multiple companies
One of them scraped a ton of data and transcripts from YouTube videos, including mine.
Apple technically avoids the “flaw” because it doesn’t scrape.
But this will be an evolving issue for a long time.
It also includes channels from numerous mainstream and online media brands, including videos written, produced and published by Ars Technica and its staff, as well as many of Condé Nast’s other brands, such as Wired and The New Yorker.
Coincidentally, one of the videos used in the dataset was a short film produced by Ars Technica, which joked that the film had already been written by an AI. The Proof News article also notes that the video was trained on videos of parrots, meaning the AI model imitates parrots, imitates human speech, imitates other AI, and imitates humans.
As AI-generated content continues to proliferate on the internet, it will become increasingly difficult to put together datasets to train AI that don’t include content already generated by AI.
To be clear, some of this news is not new. Pyle is used and mentioned frequently in AI circles and has even been used by tech companies for training in the past. It has been cited in multiple lawsuits filed by intellectual property rights holders against AI and tech companies. Defendants in these lawsuits include: Includes OpenAIwho argue that this kind of scraping is fair use. The case has yet to be resolved in court.
However, Proof News investigated the details of YouTube’s use of subtitles and found that: Search for piles For individual videos or channels.
The study reveals just how powerful data collection can be and draws attention to how little control intellectual property owners have over how their work is used on the open web.
However, it is important to note that this data was not necessarily used to train models to create competitive content that would reach end users – for example, Apple may have trained on the dataset for research purposes or to improve the autocomplete feature for text input on its devices.
Creators’ reactions
Proof News reached out to some of these creators, as well as the companies that used the dataset, for comment. Most creators expressed surprise that their content had been used in this way, and those who commented criticized EleutherAI and the companies that used the dataset. For example: The David Pakman Show Said:
No one comes to me and says, “I want to use this”… This is my livelihood and I put time, resources, money, and staff time into creating this content. There’s really no shortage of work.
Julia Walsh, CEO of production company Complexly, said: Latest On Hank and John Green’s other educational content, he said:
We are outraged to learn that the educational content we carefully produced has been used in this way without our consent.
There are also questions about whether scraping this content violates YouTube’s terms, which prohibit videos from being accessed by “automated means.” EleutherAI founder Sid Black said he downloaded the subtitles via YouTube’s API using a script, just like a web browser would.
Anthropic is one of the companies that trained models on the dataset and says it has not violated any laws.
The Pile contains a small portion of YouTube subtitles… YouTube’s terms cover direct use of its platform, which is separate from use of The Pile dataset. Any concerns about possible violations of YouTube’s terms of service should be directed to the authors of The Pile.
A Google spokesperson told Proof News that Google has “taken steps for many years to prevent unauthorized scraping,” but did not provide a more specific response. This is not the first time that AI and technology companies have come under fire for training models on YouTube videos without permission. Notably, OpenAI (the company behind ChatGPT and video generation tool Sora) is believed to have used YouTube data to train models, although not all of these allegations have been confirmed.
In an interview with The Verge’s Nilay Patel, Google CEO Sundar Pichai said: was suggested Using YouTube videos to train OpenAI’s Sora would violate YouTube’s terms of service, and certainly the usage is different from scraping subtitles via an API.