How AI models are grabbing the world's data

Inside AI's data frenzy: The controversial practices fueling artificial intelligence | GZERO AI

In this episode of GZERO AI, Taylor Owen, host of the Machines Like Us podcast, examines the scale and implications of the historic data land grab happening in the AI sector. According to researcher Kate Crawford, AI is the largest superstructure ever built by humans, requiring immense human labor, natural resources, and staggering amounts of data. But how are tech giants like Meta and Google amassing this data?

So AI researcher Kate Crawford recently told me that she thinks that AI is the largest superstructure that our species has ever built. This is because of the enormous amount of human labor that goes into building AI, the physical infrastructure that's needed for the compute of these AI systems, the natural resources, the energy and the water that goes into this entire infrastructure. And of course, because of the insane amounts of data that is needed to build our frontier models. It's increasingly clear that we're in the middle of a historic land grab for these data, essentially for all of the data that has ever been created by humanity. So where is all this data coming from and how are these companies getting access to it? Well, first, they're clearly scraping the public internet. It's safe to say that if anything you've done has been posted to the internet in a public way, it's inside the training data of at least one of these models.

But it's also probably the case that these scraping includes a large amount of copyrighted data, or not publicly necessarily available data. They're probably also getting behind paywalls as we'll find out soon enough as the New York Times lawsuit against OpenAI works its way through the system and they're scraping each other's data. According to the New York Times, Google found out that OpenAI was scraping YouTube, but they didn't reveal it or push or reel it to the public because they too were scraping all of YouTube themselves and didn't just want this getting out. Second, all these companies are purchasing or licensing data. This includes news licensing entering into agreements with publishers, data purchased from data brokers, purchasing companies, or getting access to company datas that have rich data sets. Meta, for example, was considering buying the publisher Simon and Schuster just for access to their copyrighted books in order to train their LLM.

The companies that have access to rich data sets themselves are obviously an advantage here. And in particular, this is Meta and Google. Meta uses all the public data that's ever been inputted into their system. And it said that even if you aren't even on their products or use their product, your data could be in their systems, either from data they've purchased outside of their products, or if you've just appeared, for example, in an Instagram photo, your face is now being used to train their AI. Google has said that they use anything public that's on their platform. So an unrestricted Google Doc, for example, will end up in their training dataset. And they're also acquiring data in creative ways to say the least. Meta has trained its large language model on a dataset called book3, which contains over 170,000 pirated and copyrighted books. So where does this all leave us citizens and users of the internet?

Well, one thing's clear is that we can't opt out of this data collection and data use. Meta's opt out tool they provide is hidden and complicated to use, and it requires you to provide proof that our data has been used to train Meta's AI system before they'll consider removing it from their data sets. This is not the kind of user tools that we should expect in democratic societies. So it's pretty clear that we're going to need to do three things. One, we're going to need to scale up our journalism. This is exactly why we have investigative journalism, is to hold powerful governments and actors and corporations in our society to account. Journalism needs to dig deep into who's collecting what data, how these models are being trained, and how they're being built on data collected on our lives and our online experiences. Second, the lawsuits are going to need to work their way through the system and the discovery that comes with them should be revealing. The New York Times' lawsuit just to take one of the many against OpenAI, will surely reveal whether paywall journalism sits within the training models of these AI systems. And finally, there is absolutely no doubt that we need regulation to provide transparency and accountability of the data collection that is driving AI.

Meta recently announced, for example, that they were going to use data they'd collected on EU citizens in training their LLM. Immediately after the Irish Data Protection Commission pushed back, they announced they were going to pause this activity. This is why we need regulations. People who live in countries or jurisdictions that have strong data protection regulations and AI transparency regimes will ultimately be better protected. I'm Taylor Owen and thanks for watching.

More from GZERO Media

Listen: President Trump has already made sweeping changes to US public health policy—from RFK Jr.’s nomination to lead the health department to withdrawing the US from the World Health Organization. On the GZERO World Podcast, New York Times science and global health reporter Apoorva Mandavilli joins Ian Bremmer for an in-depth look at health policy in the Trump administration, and what it could mean, not just for the US, but for the rest of the world.

Elon Musk walks on Capitol Hill on the day of a meeting with Senate Republican Leader-elect John Thune (R-SD), in Washington, U.S. December 5, 2024.

REUTERS/Benoit Tessier

As the deadline for federal employees to resign in exchange for eight months of pay closed in on Thursday, a federal judge in Massachusetts stepped in and temporarily blocked it. Judge George A. O’Toole Jr. ordered that a hearing be held on Monday afternoon. In response, the Office of Personnel Management – the agency Elon Musk has harnessed to carry out the Department of Government Efficiency’s efforts to downsize the government – has postponed the deadline until Monday.

Demonstrators attend a protest against U.S. President Donald Trump's plan to resettle Palestinians from Gaza, in front of the U.S. consulate in Istanbul, Turkey, February 6, 2025.
REUTERS/Umit Bektas

President Donald Trump on Thursday doubled down on his proposal to remove Palestinians from Gaza for resettlement, insisting that Israel will give the territory to the US, with no military intervention required. He then imposed sanctions on the International Criminal Court for having issued an arrest warrant last year against Israeli leader Benjamin Netanyahu.

Annie Gugliotta

Is this the end of American soft power and, if so, how should allies respond? GZERO Publisher Evan Solomon explores the shuttering of USAID and the tariff taunts between the US and Canada.

Be sure to catch next week’s groundbreaking discussions on new technologies for global energy security in disruptive times live from the MSC Energy Security Hub at the BMW Foundation Herbert Quandt Pavilion. On Friday, Feb. 1: See the exclusive keynote by Fatih Birol, executive director of International Energy Agency, entitled “Europe’s Energy Power Struggle: Rising Demand and a New Competitive Landscape”, Join an expert panel as they discuss “Net Zero for Global Security? Geopolitics of Energy Transition and Hydrogen Trade,” featuring Leila Benali (Minister of Energy Transition and Sustainable Development of Morocco), Jennifer Morgan (State Secretary and Special Envoy for International Climate Action, German Federal Foreign Office), Rainer Quitzow (professor for Sustainability and Innovation, TU Berlin), Katherina Reiche (CEO, Westenergie AG; Chairwoman, National Hydrogen Council), Narendra Taneja (energy expert & chairman, Independent Energy Policy Institute). Saturday, Feb. 15 “Shaping Tomorrow’s Renewable Energy Paradigm in Times of Uncertainty,” the keynote by William Chueh, director, Precourt Institute for Energy, associate professor of materials science and engineering, Stanford University Plus many more panels and fireside chats. If you’re eager to explore how nations can boost their competitiveness, strengthen their economies, and create a future-proof society, sign up for our free livestream here.