As I previously warned, artificial intelligence companies are running out of data. A Wall Street Journal piece from this week has sounded the alarm that some believe AI models will run out of "high-quality text-based data" within the next two years in what an AI researcher called "a frontier research problem."
Modern AI models are trained by feeding them "publicly-available" text from the internet, scraped from billions of websites (everything from Wikipedia to Tumblr, to Reddit), which the model then uses to discern patterns and, in turn, answer questions based on the probability of an answer being correct.
Theoretically, the more training data that these models receive, the more accurate their responses will be, or at least that's what the major AI companies would have you believe. Yet AI researcher Pablo Villalobos told the Journal that he believes that GPT-5 (OpenAI's next model) will require at least five times the training data of GPT-4. In layman's terms, these machines require tons of information to discern what the "right" answer to a prompt is, and "rightness" can only be derived from seeing lots of examples of what "right" looks like.
While the internet may feel limitless, Villalobos told the Journal that only a tenth of the most-commonly-used web dataset (the Common Crawl) is actually "high quality" enough data for models. Yet I can find no clear definition of what "high-quality" even means, or proof that any of these companies are being picky with what they train their data on, only that they have an insatiable hunger for more data, relying instead on thousands of underpaid contractors (with some abroad making less than $2 an hour, a growing human rights crisis in and of itself) to teach their models how to say and do the right thing when asked.
In essence, the AI boom requires more high-quality data than currently exists to progress past the point we're currently at, which is one where the outputs of generative AI are deeply unreliable. The amount of data it needs is several multitudes more than currently exists at a time when algorithms are happily-promoting and encouraging AI-generated slop, and thousands of human journalists have lost their jobs, with others being forced to create generic search-engine-optimized slop. One (very) funny idea posed by the Journal's piece is that AI companies are creating their own "synthetic" data to train their models, a "computer-science version of inbreeding" that Jathan Sadowski calls Habsburg AI.
This is, of course, a terrible idea. A research paper from last year found that feeding model-generated data to models creates "model collapse" — a "degenerative learning process where models start forgetting improbable events over time as the model becomes poisoned with its own projection of reality."
AI models, when fed content from other AI models (or their own), begin to forget (for lack of a better word) the meaning and information derived from the original content, which two of the paper's authors describe as "absorbing the misunderstanding of the models that generated the data before them."
In far simpler terms, these models infer rules from the content they're fed, identifying meaning and conventions as a result based on the commonalities of how humans structure things like code or language. As generative AI does not "know" anything, when fed reams of content generated by other models, they begin learning rules based on content generated by a machine guessing at what it should be writing rather than communicating meaning, making it somewhere between useless and actively harmful as training data.
Sidebar: I spoke with Zakhar Shumaylov, PhD Student at the Department of Mathematics at the University of Cambridge, and one of the authors of the original model collapse paper. He believes that model collapse is already happening, and training with this generated data compounds existent biases in a model and, unfortunately, errors. One interesting note he made was that human-made training data, by the nature of it being written by a human, includes errors, and models need to be robust to such errors. So what do we do if models are trained with content created without them? Do we introduce errors ourselves? How many errors are there, and how do we introduce them all?
It's tough to express how deeply dangerous this is for AI. Models like ChatGPT and Claude are deeply dependent on training data to improve their outputs, and their very existence is actively impeding the creation of the very thing they need to survive. While publishers like Axel Springer have cut deals to license their companies' data to ChatGPT for training purposes, this money isn't flowing to the writers that create the content that OpenAI and Anthropic need to grow their models much further. It's also worth considering that these AI companies may already have already trained on this data. The Times sued OpenAI late last year for training itself on "millions" of articles, and I'd bet money that ChatGPT was trained on multiple Axel Springer publications along with anything else it could find publicly-available on the web.
This is one of many near-impossible challenges for an AI industry that's yet to prove its necessity. While one could theoretically make bigger, more powerful chips (I'll get to that later), AI companies face a kafkaesque bind where they can't improve a tool for automating the creation of content without human beings creating more content than they've ever created before. Paying publishers to license their content doesn't actually fix the problem, because it doesn't increase the amount of content that they create, but rather helps line the pockets of executives and shareholders. Ironically, OpenAI's best hope for survival would be to fund as many news outlets as possible and directly incentivize them to do in-depth reporting, rather than proliferating a tech that unquestionably harms the media industry.
These problems compound when generating visual media. The consistent hallucinations (IE: visual errors) you see in OpenAI's DALL-E (image) and Sora (video) products can only be improved (note that I didn't say "fixed") with a great deal more data, and there's much, much less of it online. "Publicly-available" video data (which CTO Mira Murati said Sora is trained on) heavily suggests sources like YouTube, which can include everything from a cute video of a dog to every trailer of every Avengers movie, and if OpenAI's desperation has led it to train its model on any Disney property, Disney's $15m-a-year legal chief will personally send OpenAI to Hell.
Data desperation has led to a boom of synthetic data hype, including an incredible Fortune piece that suggests that you could train a facial recognition model to be less racist by feeding it generative pictures of "every ethnicity." Cognitive scientist Abeba Birhane recently warned in WIRED that online datasets already fueled racial stereotypes and other biases in generative models, and that synthetic data will only exacerbate these problems by shoving the same biases back into the machine, citing a Stanford paper that found a 57% increase of synthetic content on mainstream websites between January 1, 2022 and May 1, 2023.
Yet model collapse may not be limited to the tech industry, with an IDC report saying that 40% of insurers will use Synthetic Data in AI to "guarantee fairness" and "enhance AI accuracy." Healthcare executives are excited about generating synthetic data to train their datasets, doing their best to wave off the obvious privacy concerns of generating things based on personally identifiable information, and the Department of Homeland Security is awarding contracts for those who can generate synthetic data to prepare for things like cybersecurity risks.
There is, of course, one blatantly obvious problem: generative AI models are prone to hallucinations when using human data. How does synthetic data, created by the very models that need to improve, improve the situation? What happens when the majority of a dataset is synthetic, and what if that synthetic data has within it some sort of unseen bias, or problem, or outright falsehood?
Are you seriously telling me that the solution to models giving the wrong answer to questions is to feed them more information from models with the same problem? Are you fucking kidding me? I don't care that there are allegedly ways to make smaller models trained on limited synthetic datasets. That doesn't fix the overall problem that we are running out of data to feed the models, and that the solution in many cases is to use the broken tool to make more data.
I am, of course, conflating two problems — the deliberate creation of synthetic data by AI companies and AI-generated content filling the internet with synthetic data that models are then trained on. Yet the end result is the same — forcefully teaching autocomplete typos in the hopes that it'll be able to work out how to write America's next great novel.
And as I hinted above, this isn't the only existential threat facing AI.
Hey, real quick, the latest episode of Better Offline is a real banger. I sat down with Molly White to talk about why Wikipedia is so important, how it's kept its credibility, and why you should become a Wikipedia editor. You can listen to it on Apple Podcasts, Spotify, Google Podcasts, or you can plug the RSS feed in wherever you need it to.
Cloud Control
I've been hesitant to outright say that we're in an artificial intelligence bubble, or at least I was until I read that Microsoft and OpenAI are currently planning to build a $100 billion supercomputer dubbed "Stargate." The six-year-long project is contingent on "OpenAI's ability to meaningfully improve the capabilities of its AI" according to The Information's sources, and would require "as much as five gigawatts of power." To give context, one gigawatt is enough to power 750,000 homes. The Information's story is vague about the progress required to launch Stargate — it's a multi-phase operation, currently "in the middle" of its third phase, adding that the datacenter would potentially cost $10 billion to build, and that half of the $100 billion total cost could be made up of specialist AI chips.
The article is vague, likely because building Stargate is at this point somewhat theoretical. Microsoft allegedly has to find new ways to fit more GPUs (the graphics process unit that is used in AI compute) in a single rack (the way that servers are mounted in a datacenter), and find an entirely new way of cooling them to make Stargate run. It would also need to source enough energy to power an entire city.
Incidentally, Microsoft started recruiting a nuclear power specialist last September, and the company has agreed to purchase energy from Helion — a nuclear fusion startup where Sam Altman is an investor.
In short, OpenAI and Microsoft are working on a theoretically-massive computer to power the future of AI, contingent on whether Sam Altman can "meaningfully improve" ChatGPT, something that OpenAI claims is possible only with more compute. OpenAI has already failed to do so once, killing off its cost-saving "Arakkis" model — one specifically developed to impress Microsoft — after it failed to run efficiently enough to matter.
While sources talking to Business Insider claim that GPT-5, OpenAI's next model, is coming "mid-summer" and will be "materially better," there is a growing consensus that GPT -3.5 and GPT-4 have gotten worse over time, and the nagging question of profitability, both for companies like OpenAI and customers integrating models like ChatGPT, lingers.
The only companies currently profiting from the AI gold rush are those selling shovels. Nvidia's Q1 2024 earnings were astounding, with revenues increasing more than 300% and profits more than 580% year-over-year thanks to its AI-focused chips, with its next-generation "Blackwell" chips sold out through the middle of 2025. Although Nvidia is yet to announce pricing for the Blackwell generation of GPUs, company CEO Jensen Huang has suggested they may cost between $30,000 and $40,000.
Similarly, Oracle's profits have soared after its cloud infrastructure products grew to meet generative AI demands (and it secured several new contracts to build data centers for Microsoft). Oracle has now partnered with Nvidia to offer access to its "supercluster" supercomputer of more than 32,000 GPUs. In fact, Microsoft has become Oracle's largest customer, largely because Microsoft can't build out the amount of data centers it needs to support the (expected) growth of AI on its own.
Nakedly-evil data startup Palantir saw its stock pop thanks to mentioning the possibility of selling AI software to armies, announcing a month later that it had won a $178 million deal with the U.S. Army to build "AI-enabled ground stations," but when you drill down, it’s mostly been contracted to create prototypes of a vehicle that could run AI applications rather than AI itself doing anything.
If you believe the hype, selling AI infrastructure has helped Amazon Web Services — Amazon's cloud compute division — grow 12% year-over-year, Microsoft's Azure cloud services division increase revenue by 30% year-over-year, and Google's Cloud division grow revenue 28% year-over-year in its Q4 2023 earnings. Yet under the hood, outside of naked self-dealing, these firms may not be making much money from AI.
Google and Amazon have invested billions in ChatGPT competitor Anthropic (and both claim to be Anthropic's primary cloud provider), and in doing so have guaranteed — according to a source of mine — $750 million a year (Google Cloud) and $800 million a year (AWS) in revenue by mandating that Anthropic uses their services to power its "Claude" model. This is similar to the "$10 billion" investment that Microsoft gave OpenAI, most of which was made up in credits for Microsoft's Azure Cloud, and I imagine that Microsoft could count every penny of credit as revenue for its AI services.
In any case, it's also not obvious how much AI actually contributes to the bottom line. In Microsoft's Q4 2023 earnings report, CFO Amy Hood reported that 6 points of revenue growth in its Azure and cloud services division was attributed to AI services. I spoke with Jordan Novet, who covered Microsoft's earnings for CNBC, who confirmed that this actually means 6% of the 30% of year-over-year growth in Azure.
As Microsoft doesn't disclose Azure's actual revenue, it's unclear how much money this really is — but 6% of the year-over-year growth isn't exactly exciting. Elsewhere, Amazon CEO Andy Jassy said that generative AI revenue was still "relatively small" in February 2024, but that it would drive "tens of billions of dollars of revenue over the next several years," adding that "virtually every consumer business Amazon operated already had or would have generative AI offerings."
The AI boom has driven global stock markets to their best first quarter in 5 years, yet I fear that said boom is driven by a terrifyingly specious and unstable hype cycle. The companies benefitting from AI aren't the ones integrating it or even selling it, but those powering the means to use it — and while "demand" is allegedly up for cloud-based AI services, every major cloud provider is building out massive data center efforts to capture further demand for a technology yet to prove its necessity, all while saying that AI isn't actually contributing much revenue at all. Amazon is spending nearly $150 billion in the next 15 years on data centers to, and I quote Bloomberg, "handle an expected explosion in demand for artificial intelligence applications" as it tells its salespeople to temper their expectations of what AI can actually do.
For all this hype, I am seeing little evidence of actual adoption. While there are seemingly hundreds of different startups integrating AI on some level, it's unclear how deeply they integrate it, how essential it is to their operations, and whether it's actually making them any money. A joint study between Foundry and Searce found that less than 40% of organizations have successfully deployed an AI project. A cnvrg.io study from January found that enterprise AI adoption is extremely slow, with only 10% of organizations successfully launching an AI product in 2023, and a report from data consultancy Carruthers and Jackson from December 2023 said that 87% of data leaders report few or no employees using AI in the workplace. A Forrester Research study from January found that less than 20% of retailers are leveraging "advanced AI," largely because they're struggling to find actionable insights from the data they collect. According to data from data.ai (formerly known as AppAnnie), ChatGPT's downloads have also begun to drop from a high of just over 700,000 a week to a plateau around 450,000 to 500,000 a week since early 2023.
As I've discussed before, today's models are rife with hallucinations that make integrating them actively dangerous for many businesses, and the only way to have fewer hallucinations is to give the models more compute power to process and analyze data, a dwindling resource that AI is actively helping to destroy.
Another hallmark of a bubble is a high-revenue, negative-profit industry. Glossy stories about OpenAI and Anthropic's multi-billion dollar revenues never seem to discuss profits, only endless growth. Scale AI, a company that specializes in turning raw data into high-quality data to help train models, is in the process of raising a round at a $13 billion valuation, and reportedly has $675 million in revenue with no sign of profitability. C3.ai, a publicly-traded company that sells "a comprehensive Enterprise AI application development platform," has posted a loss of at least $60 million in the last four quarters, making no more than $78 million in revenue in any given quarter. Soundhound AI, an AI voice and speech recognition company, hasn't fared much better, losing $18 million on $17.15 million of earnings in Q4 2023 — an improvement over three straight quarters of losing over $20 million.
I realize I've thrown a lot of numbers at you, but the fundamental problem is fairly simple. I worry that the stock market (and the tech industry) is building vast castles on foundations of sand. And when you cut through the hype, the AI industry is investing hundreds of billions of dollars to build infrastructure for a future that may never arrive based on the vague promise of hallucination-prone models.
I feel like a crazy person every time I read glossy pieces about AI "shaking up" industries only for the substance of the story to be "we use a coding copilot and our HR team uses it to generate emails." I feel like I'm going insane when I read about the billions of dollars being sunk into data centers, or another headline about how AI will change everything that is mostly made up of the reporter guessing what it could do. I don't know why so many people are choosing to fill in the gaps, other than the fact that considering the reality of the situation may be a little too terrifying. They seem far more willing to dream of the future than reckon with the consequences of a present where AI-generated code hallucinates references to non-existent software packages, allowing bad actors to create packages matching the made-up names to trick developers into integrating malware into their code.
Generative video models like OpenAI's Sora are dead on arrival, requiring massive amounts of compute and significantly more training data — which is itself both complex and scarce, and carries with it massive legal implications. Running Sora is likely deeply unprofitable, and reducing its generative hallucinations — monkeys with multiple left arms, for example — will require such a remarkable amount of resources that I'm not sure if it'll measurably improve in the next year.
In the event that adoption slows — or, more accurately, demand never arrives — Microsoft, Amazon and Google only have so long before the markets sour on their "it's the future, we swear!" messaging, demanding real revenue to match their multi-billion dollar infrastructural bets. While all three firms can likely tread water pointing at the "growth" from handling models like ChatGPT and Claude, one has to wonder why investing in one of the fastest-growing consumer apps of all time hasn't made Microsoft any money from its profit-share.
Perhaps it's because, despite the hype, there really isn't much to sell. Artificial intelligence isn't new, these models aren't new, and after just over a year of hype, generative AI is yet to prove it can do anything substantially different other than helping tech company CEOs raise money and create headlines.
What I fear is a cascade effect where artificial intelligence fails to provide businesses with anywhere near enough value to necessitate its costs, which will vastly reduce the amount of consumers and enterprises using or integrating models like ChatGPT, or using their own models on cloud providers like AWS or Azure. This scenario is likely if the next generations of ChatGPT or Claude fail to make significant leaps in their capabilities, and as I've said above, their ability to do so is predicated on more training data and compute power than currently exists.
While I hope I'm wrong, the calamity I fear is one where the massive over-investment in data centers is met with a lack of meaningful growth or profit, leading to the markets turning on the major cloud players that staked their future on unproven generative AI. If businesses don't adopt AI at scale — not experimentally, but at the core of their operations — the revenue is simply not there to sustain the hype, and once the market turns, it will turn hard, demanding efficiency and cutbacks that will lead to tens of thousands of job cuts.
Chipping Away
Jensen Huang's recent keynote at Nvidia's GTC conference terrified me in its brevity of nonsense, in its endless, effusive hype, promising endless growth, endlessly faster chips, endless possibilities, babbling on about how you'll "assemble a team of AIs" led by a "super AI," which each AI somehow knowing how things like SAP worked, capable of building and running their own software. This is, of course, nonsense. The CEO of a $2.2 trillion company spent two hours spouting a mixture of tech specs and fan fiction to a crowd of sycophants, and the world lapped it up, ignoring the fact that triple-digit percentages of year-over-year growth are equal parts disturbing and indicative of a bubble. While Nvidia is actually building things and selling them (GamersNexus has a full rundown here) so much of the meat of Huang's AI-related sales pitch was "if you have a lot of data, paying Nvidia will somehow help you make copilots out of it."
Nvidia is likely to be the canary in the coal mine for the AI industry, and if AI fails to drive meaningful profits for both the customer and the provider, orders for Nvidia chips will dramatically slow. If Nvidia is the first to tumble, I expect far worse things to happen to the cloud giants that bet tens (or hundreds) of billions of dollars on our generative future.
This is why Sam Altman is so desperate to raise trillions to make a new chip company, or get Microsoft to build him a $100 billion supercomputer — because what he's promised isn't possible with today's technology, and may not be possible at all. He has to keep spinning the plates long enough for him to show enough progress to raise more money, but the attention of the public markets has led to far more hype (and eventual scrutiny) than he ever prepared for.
How do you solve all of these incredibly difficult problems? What does OpenAI or Anthropic do when they run out of data, and synthetic data doesn't fill the gap, or worse, massively degrades the quality of their output? What does Sam Altman do if GPT-5 — like GPT-4 — doesn't significantly improve its performance and he can't find enough compute to take the next step? What do OpenAI and Anthropic do when they realize they will likely never turn a profit? What does Microsoft, or Amazon, or Google do if demand never really takes off, and they're left with billions of dollars of underutilized data centers? What does Nvidia do if the demand for its chips drops off a cliff as a result?
I don't know why more people aren't screaming from the rooftops about how unsustainable the AI boom is, and the impossibility of some of the challenges it faces. There is no way to create enough data to train these models, and little that we've seen so far suggests that generative AI will make anybody but Nvidia money. We're reaching the point where physics — things like heat and electricity — are getting in the way of progressing much further, and it's hard to stomach investing more considering where we're at right now is, once you cut through the noise, fairly god damn mediocre. There is no iPhone moment coming, I'm afraid.
I hope I'm wrong. I really, really do. I do not like writing that things will be bad, or that I see the symptoms of a horrifying crash. Yet the last few years have shown that the tech industry has become disconnected from reality, from good business, and from actually delivering value to real people outside of the investor class.
I am sick and tired of watching billions of dollars burned on shitty ideas that don't work, and of those who continue to prop up the vague dreams of untouchable, ultra-rich technocrats.