CrowdStruck

Edward Zitron 11 min read

Soundtrack: EL-P - Tasmanian Pain Coaster (feat. Omar Rodriguez-Lopez & Cedric Bixler-Zavala)

When I first began writing this newsletter, I didn't really have a goal, or a "theme," or anything that could neatly characterize what I was going to write about other than that I was on the computer and that I was typing words.

As it grew, I wrote the Rot Economy, and the Shareholder Supremacy, and many other pieces that speak to a larger problem in the tech industry — a complete misalignment in the incentives of most of the major tech companies, which have become less about building new technologies and selling them to people and more about capturing monopolies and gearing organizations to extract things through them.

Every problem you see is a result of a tech industry — from the people funding the earliest startups to the trillion-dollar juggernauts that dominate our lives — that is no longer focused on the creation of technology with a purpose, and organizations driven toward a purpose. Everything is about expressing growth, about showing how you will dominate an industry rather than serve it, about providing metrics that speak to the paradoxical notion that you'll grow forever without any consideration of how you'll live forever. Legacies are now subordinate to monopolies, current customers are subordinate to new customers, and "products" are considered a means to introduce a customer to a form of parasite designed to punish the user for even considering moving to competitor.

What's happened today with Crowdstrike is completely unprecedented (and I'll get to why shortly), and on the scale of the much-feared Y2K bug that threatened to ground the entirety of the world's computer-based infrastructure once the Year 2000 began.

You'll note that I didn't write "over-hyped" or anything dismissive of Y2K's scale, because Y2K was a huge, society-threatening calamity waiting to happen, and said calamity was averted through a remarkable, $500 billion industrial effort that took a decade to manifest because the seriousness of such a significant single point of failure would have likely crippled governments, banks and airlines. 

People laughed when nothing happened on January 1 2000, assuming that all that money and time had been wasted, rather than being grateful that an infrastructural weakness was taken seriously, that a single point of failure was identified, and that a crisis was averted by investing in stopping bad stuff happening before it does.

As we speak, millions — or even hundreds of millions — of different Windows-based computers are now stuck in a doom-loop, repeatedly showing users the famed "Blue Screen of Death" thanks to a single point of failure in a company called Crowdstrike, the developer of a globally-adopted cyber-security product designed, ironically, to prevent the kinds of disruption that we’ve witnessed today. And for reasons we’ll get to shortly, this nightmare is going to drag on for several days (if not weeks) to come.

The product — called Crowdstrike Falcon Sensor — is an EDR system (which stands for Endpoint Detection and Response). If you aren’t a security professional and your eyes have glazed over, I’ll keep this brief. An EDR system is designed to identify hacking attempts, remediate them, and prevent them. They’re big, sophisticated, and complicated products, and they do a lot of things that’s hard to build with the standard tools available to Windows developers.

And so, to make Falcon Sensor work, Crowdstrike had to build its own kernel driver. Now, kernel drivers operate at the lowest level of the computer. They have the highest possible permissions, but they operate with the fewest amount of guardrails. If you’ve ever built your own computer — or you remember what computers were like in the dark days of Windows 98 — you know that a single faulty kernel driver can wreak havoc on the stability of your system. 

The problem here is that Crowdstrike pushed out an evidently broken kernel driver that locked whatever system that installed it in a permanent boot loop. The system would start loading Windows, encounter a fatal error, and reboot. And reboot. Again and again. It, in essence, rendered those machines useless. 

It's convenient to blame Crowdstrike here, and perhaps that's fair. This should not have happened. On a basic level, whenever you write (or update) a kernel driver, you need to know it’s actually robust and won’t shit the bed immediately. Regrettably, Crowdstrike seemingly borrowed Boeing’s approach to quality control, except instead of building planes where the doors fly off at the most inopportune times (specifically, when you’re cruising at 35,000ft), it released a piece of software that blew up the transportation and banking sectors, to name just a few.  

It created a global IT outage that has grounded flights and broken banking services. It took down the BBC’s flagship kids TV channel, infuriating parents across the British Isles, as well as Sky News, which, when it was able to resume live broadcasts, was forced to do so without graphics. In essence, it was forced back to the 1950s — giving it an aesthetic that matches the politics of its owner, Rupert Murdoch. By no means is this an exhaustive list of those affected, either. 

The scale and disruption caused by this incident is unlike anything we’ve ever seen before. Previous incidents — particularly rival ransomware outbreaks, like Wannacry — simply can’t compare to this, especially when we’re looking at the disruption and the sheer scale of the problem. 

Still, if your day was ruined by this outage, at least spare a thought for those who’ll have to actually fix it. Because those machines affected are now locked in a perpetual boot loop, it’s not like Crowdstrike can release a software patch and call it a day. Undoing this update requires some users to have to individually go to each computer, loading up safe mode (a limited version of Windows with most non-essential software and drivers disabled), and manually removing the faulty code. And if you’ve encrypted your computer, that process gets a lot harder. Servers running on cloud services like Amazon Web Services and Microsoft Azure — you know, the way most of the internet's infrastructure works — require an entirely separate series of actions.

If you’re on a small IT team and you’re supporting hundreds of workstations across several far-flung locations — which isn’t unusual, especially in sectors like retail and social care — you’re especially fucked. Say goodbye to your weekend. Your evenings. Say goodbye to your spouse and kids. You won’t be seeing them for a while. Your life will be driving from site to site, applying the fix and moving on. Forget about sleeping in your own bed, or eating a meal that wasn’t bought from a fast food restaurant. Good luck, godspeed, and God bless. I do not envy you. 

The significance of this failure — which isn't a breach, by the way, and in many respects is far worse, at least in the disruption caused — is not in its damage to individual users, but to the amount of technical infrastructure that runs on Windows, and that so much of our global infrastructure relies on automated enterprise software that, when it goes wrong, breaks everything

It isn't about the number of computers, but the amount of them that underpin things like the security checkpoints or systems that run airlines, or at banks, or hospitals, all running as much automated software as possible so that costs can be kept down.

The problem here is systemic — that there is a company that the majority of people affected by this outage had no idea existed until today that Microsoft trusted to the extent that they were able to push an update that broke the back of a huge chunk of the world's digital infrastructure. 

Microsoft, as a company, instead of building the kind of rigorous security protocols that would, say, rigorously test something that connects to what seems to be a huge proportion of Windows computers. Microsoft, in particular, really screwed up here. As pointed out by Wired, the company vets and cryptographically signs all kernel drivers — which is sensible and good, because kernel drivers have an incredible amount of access, and thus can be used to inflict serious harm — with this testing process usually taking several weeks. 

How then did this slip through its fingers? For this to have happened, two companies needed to screw up epically. And boy, they did. 

What we're seeing today isn't just a major fuckup, but the first of what will be many systematic failures — some small, some potentially larger — that are the natural byproduct of the growth-at-all-costs ecosystem where any attempt to save money by outsourcing major systems is one that simply must be taken to please the shareholder.

The problem with the digitization of society — or, more specifically, the automation of once-manual tasks — is that it introduces a single point of failure. Or, rather, multiple single points of failure. Our world, our lifestyle and our economy, is dependent on automation and computerization, with these systems, in turn, dependent on other systems to work. And if one of those systems breaks, the effects ricochet outwards, like ripples when you cast a rock into a lake. 

Today’s Crowdstrike cock-up is just the latest example of this, but it isn’t the only one. Remember the SolarWinds hack in 2020, when Russian state-linked hackers gained access to an estimated 18,000 companies and public sector organizations — including NATO, the European Parliament, the US Treasury Department, and the UK’s National Health Service — by compromising just one service — SolarWinds Orion? 

Remember when Okta — a company that makes software that handles authentication for a bunch of websites, governments, and businesses — got hacked in 2023, and then lied about the scale of the breach? And then do you remember how those hackers leapfrogged from Okta to a bunch of other companies, most notably Cloudflare, which provides CDN and DDOS protection services for pretty much the entire internet?

That whole John Donne quote — “No man is an island” — is especially true when we’re talking about tech, because when you scratch beneath the surface, every system that looks like it’s independent is actually heavily, heavily dependent on services and software provided by a very small number of companies, many of whom are not particularly good.     

This is as much a cultural failing as it is a technological one, the result of management geared toward value extraction — building systems that build monopolies by attaching themselves to other monopolies. Crowdstrike went public in 2019, and immediately popped on its first day of trading thanks to Wall Street's appreciation of Crowdstrike moving away from a focused approach to serving large enterprise clients, building products for small and medium-sized businesses by selling through channel partners — in effect outsourcing both product sales and the relationship with a client that would tailor a business' solution to a particular need.

Crowdstrike's culture also appears to fucking suck. A recent Glassdoor entry referred to Crowdstrike as "great tech [with] terrible culture" with no work life balance, with "leadership that does not care about employee well being." Another from June claimed that Crowdstrike was "changing culture for the street,” with KPIs (as in metrics related to your “success” at the company) “driving behavior more than building relationships” with a serious lack of experience in the public sector in senior management. Others complain of micromanagement, with one claiming that “management is the biggest issue,” with managers “ask[ing] way too much of you…and it doesn’t matter if you do what they ask since they’re not even around to check on you,” and another saying that “management are arrogant” and need to “stop lying to the market on product capability.”

While I can’t say for sure, I’d imagine an organization with such powerful signs of growth-at-all-costs thinking — a place where you “have to get used to the pressure” that’s a “clique that you’re not in”  — likely isn’t giving its quality assurance teams the time and space to make sure that there aren’t any Kaiju-level security threats baked into an update. And that assumes it actually has a significant QA team in-house, and hasn’t just (as with many companies) outsourced the work to a “bodyshop” like Wipro or Infosys or Tata. 

And don’t think I’m letting Microsoft off the hook, either. Assuming the kernel driver testing roles are still being done in-house, do you think that these testers — who have likely seen their friends laid off at a time when the company was highly profitable, and denied raises when their well-fed CEO took home hundreds of millions of dollars for doing a job he’s eminenly bad at — are motivated to do their best work? 

And this is the culture that’s poisoned almost the entirety of Silicon Valley. What we’re seeing is the societal cost of moving fast and breaking things, of Marc Andreessen considering “risk management the enemy,” of hiring and firing tens of thousands of people to please Wall Street, of seeking as many possible ways to make as much money as possible to show shareholders that you’ll grow, even if doing so means growing at a pace that makes it impossible to sustain organizational and cultural stability. When you aren’t intentional in the people you hire, the people you fire, the things you build and the way that they’re deployed, you’re going to lose the people that understand the problems they’re solving, and thus lack the organizational ability to understand the ways that they might be solved in the future. 

This is dangerous, and also a dark warning for the future. Do you think that Facebook, or Microsoft, or Google — all of whom have laid off over 10,000 people in the last year — have done so in a conscientious way that means that the people left understand how their systems run and their inherent issues? Do you think that the management-types obsessed with the unsustainable AI boom are investing heavily in making sure their organizations are rigorously protected against, say, one bad line of code? Do they even know who wrote the code of their current systems? Is that person still there? If not, is that person at least contracted to make sure that something nuanced about the system in question isn’t mistakenly removed? 

They’re not. They’re not there anymore. Only a few months ago Google laid off 200 employees from the core of its organization, outsourcing their roles to Mexico and India in a cost-cutting measure the quarter after the company made over $23 billion in profit. Silicon Valley — and big tech writ large — is not built to protect against situations like the one we’re seeing today,because their culture is cancerous. It valuesrowth at all costs, with no respect for the human capital that empowers organizations or the value of building rigorous, quality-focused products.

This is just the beginning. Big tech is in the throes of perdition, teetering over the edge of the abyss, finally paying the harsh cost of building systems as fast as possible. This isn’t simply moving fast or breaking things, but doing so without any regard for the speed at which you’re doing so and firing the people that broke them, the people who know what’s broken, and possibly the people that know how to fix them.

And it’s not just tech! Boeing — a company I’ve already shat on in this post, and one I’ll likely return to in future newsletters, largely because it exemplifies the short-sightedness of today’s managerial class — has, over the past 20 years or so, span off huge parts of the company (parts that, at one point, were vitally important) into separate companies, laid off thousands of employees at a time, and outsourced software dev work to $9-an-hour bodyshop engineers. It hollowed itself out until there was nothing left. 

And tell me, knowing what you know about Boeing today, would you rather get into a 737 Max or an Airbus A320neo? Enough said. 

As these organizations push their engineers harder, said engineers will turn to AI-generated code, poisoning codebases with insecure and buggy code as companies shed staff to keep up with Wall Street’s demands in ways that I’m not sure people are capable of understanding. The companies that run the critical parts of our digital lives do not invest in maintenance or infrastructure with the intentionality that’s required to prevent the kinds of massive systemic failures you see today, and I need you all to be ready for this to happen again.

This is the cost of the Rot Economy — systems used by billions of people held up by flimsy cultures and brittle infrastructure maintained with the diligence of an absentee parent. This is the cost of arrogance, of rewarding managerial malpractice, of promoting speed over safety and profit over people. 

Every single major tech organization should see today as a wakeup call — a time to reevaluate the fundamental infrastructure behind every single tech stack. 

What I fear is that they’ll simply see it as someone else’s problem - which is exactly how we got here in the first place. 

Share
Comments

Welcome to Where's Your Ed At!

Subscribe today. It's free. Please.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Ed Zitron's Where's Your Ed At.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.