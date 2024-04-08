Modern AI systems need training data–great huge piles of it–to make their models work. Heaps of information. As much as the AI makers can get their hands on, and as fast as possible. These massive amounts of raw inputs ensure that an AI system can answer a question about which actor was in such-and-such a TV show, but also play a vital role in how AI makers ensure their generative AI models are actually able to churn out useful information in coherent human-like sentences when users put questions to them.

There’s a heated, sometime litigious debate over how AI companies find this training data, and whether it’s ethical to “scrape” as much of it as possible from online information sources that may not approve of their content being used for this purpose. A high-profile legal case by The New York Times against OpenAI is a classic example, where the newspaper accuses OpenAI of stealing its intellectual property. Now the Times has leveled another accusation against the maker of ChatGPT: It’s been scraping YouTubers’ content for training purposes.

In an article published over the weekend, the Times says tech giants have been “cutting corners” to harvest training data for their AI systems. Specifically, the paper calls out OpenAI for creating a speech recognition tool called Whisper that can transcribe audio files from YouTube videos into plain text documents, effectively creating a source of data on spoken conversations to help train its next-generation GPT-4 text-based AI algorithm. The data wasn’t accidentally collected, the newspaper takes pains to note: OpenAI employees actually discussed how the data scraping may violate YouTube’s rules, which ban use of its video content for applications that are “independent” of YouTube. But the team decided to press on anyway, ultimately grabbing over a million hours’ worth of transcribed video, presumably from millions of videos uploaded by millions of YouTube users, some of whom derive part or all of their income from creating content for the video platform.

The paper also accuses Google as part of its report, which is said to have come from insiders with knowledge of the data scraping, alleging that some at Google knew of OpenAI’s process but didn’t prevent it from happening because it, too, was harvesting YouTube content to train its own AI system. That act “potentially violated the copyrights to the videos,” the Times notes.

There’s some irony there, given how aggressively YouTube is known to act when taking down user-uploaded content that the system deems to have violated someone else’s copyright. Google updated its terms of service last year, the Times said, to allow AI training on data from publicly available documents written in Google Docs, and other material like restaurant reviews that are uploaded to Google Maps. Both tech firms are having to resort to such measures, the Times says, because they’re facing a data “crunch,” even running out of new data to train AIs faster than it’s being uploaded to the internet. So much data is needed to feed the hungry AIs that licensing deals, like a recent multimillion-dollar move by Google to license content uploaded to social sharing site Reddit to train Google’s Gemini AI, aren’t enough to feed these voracious systems. And when high-quality material, such as text written by professionals who’ve verified its content and written crafted words, is lacking, then turning to data scraped from what YouTubers upload is an acceptable second-best source of grist for the AI training mills.

The potential violation of YouTubers’ intellectual property is at the heart of the controversy. Millions of people use YouTube daily, some for fun, some for educational purposes, and there are plenty of users who form part of the influencer economy–people whose livelihoods depend on the content they create and upload to the platform under their personal brand.