ADVERTISEMENT
Big tech's trapped in a glass house on AI data snatchingLarge tech players racing to build more capable AI models have reached a point where they have fewer places to look for data on the public web, and taking text from the transcripts of YouTube videos suggests OpenAI has been digging between the proverbial couch cushions, even at the risk of breaking someone’s rules.
Bloomberg Opinion
Last Updated IST
<div class="paragraphs"><p>Google logo and AI Artificial Intelligence words are seen in this illustration.</p></div>

Google logo and AI Artificial Intelligence words are seen in this illustration.

Credit: Reuters File Photo

By Parmy Olson

ADVERTISEMENT

A few weeks ago, the chief technology officer of OpenAI was asked if her company had used YouTube videos to train its AI systems. First, she gave a blank stare. Then there was a grimace. Finally, Mira Murati gave an answer that avoided the messy and furtive world she and other tech companies were operating in: “Actually, I’m not sure about that.”

According to a New York Times report, OpenAI in fact had trained its AI on “more than one million hours of YouTube videos,” using a speech recognition tool called Whisper. All the conversational text from the transcriptions was used to train GPT-4, the flagship large language model that underpins ChatGPT.

Large tech players racing to build more capable AI models have reached a point where they have fewer and fewer places to look for data on the public web, and taking text from the transcripts of YouTube videos suggests OpenAI has been digging between the proverbial couch cushions, even at the risk of breaking someone’s rules. There’s a decent chance it did. YouTube Chief Executive Officer Neal Mohan told Bloomberg News last week that if OpenAI had used YouTube videos to refine its AI, that would be a “clear violation” of YouTube’s terms of use. When asked about the possibility that OpenAI had violated those rules, a spokeswoman for the AI company said it used “publicly available information that is freely and openly available on the Internet.”

Still, it’s hard to see the tension ratcheting up between OpenAI and Google over this. Google, for one, can hardly complain about a data violation when its entire business has been built on collecting the private data of billions of consumers, often at a startling and surprising scale. Google has also scraped transcription data from some YouTube videos to train its AI models, Mohan told Bloomberg.

So ingrained is data harvesting in the business models of firms like Google and Meta Platforms Inc. that the ethics of using people’s creative work without consent or compensation seems to have become an elephant in the room that simply isn’t discussed. When a lawyer at Meta recently pointed out the ethical concerns of scraping artists’ intellectual property, they were met with silence according to the Times, which added that Meta executives considered buying a book publisher like Simon & Schuster to get access to more high-quality data, but decided that securing licenses would take too long.

In the end, a Meta executive pointed out that, “The only thing that’s holding us back from being as good as ChatGPT is literally just data volume,” the Times reported. Since OpenAI appeared to be taking copyrighted material, Meta could simply follow this “market precedent,” he added.

Of course, Meta itself established the precedent well before OpenAI did, by harvesting vast amounts of personal data from consumers and sharing it with a Byzantine network of third parties. That’s why Mark Zuckerberg himself recently talked up the mountain of Facebook and Instagram data he’s sitting on as an advantage in the AI race. "The next key part of our playbook is learning from unique data,” he told investors in February. “On Facebook and Instagram, there are hundreds of billions of publicly shared images and tens of billions of public videos.”

A spokesman for Meta said it was “transparent about the ways we collect and use people’s information to build products and features.” Google didn’t respond to a request for comment.

Has Google tried grabbing some of Meta’s data in the same way OpenAI scraped YouTube? Has Meta tried scraping any of Google’s user data to add to its AI training mountain? We may never know, but it’s plausible that the snatch-and-grab style of data gathering happening in the AI business right now goes beyond OpenAI and YouTube. Mining data is, after all, how these firms became multi-trillion-dollar businesses.

That’s also why it’s hard to see Google or Meta making much of a public fuss about their user data becoming a target for exploitation. That would not only be the ultimate example of throwing stones in glass houses, it would also remind people of how much their personal lives — and now their creative work — are being turned into someone else’s product.

ADVERTISEMENT
(Published 10 April 2024, 09:57 IST)