Major IT outage affecting banks, airlines, media outlets across the world

@rxxrc@lemmy.ml · edit-2 9 months ago

Major IT outage affecting banks, airlines, media outlets across the world

@jedibob5@lemmy.world · 9 months ago

Reading into the updates some more… I’m starting to think this might just destroy CloudStrike as a company altogether. Between the mountain of lawsuits almost certainly incoming and the total destruction of any public trust in the company, I don’t see how they survive this. Just absolutely catastrophic on all fronts.

@RegalPotoo@lemmy.world · 9 months ago

Agreed, this will probably kill them over the next few years unless they can really magic up something.

They probably don’t get sued - their contracts will have indemnity clauses against exactly this kind of thing, so unless they seriously misrepresented what their product does, this probably isn’t a contract breach.

If you are running crowdstrike, it’s probably because you have some regulatory obligations and an auditor to appease - you aren’t going to be able to just turn it off overnight, but I’m sure there are going to be some pretty awkward meetings when it comes to contract renewals in the next year, and I can’t imagine them seeing much growth

@jedibob5@lemmy.world · edit-2 9 months ago

Don’t most indemnity clauses have exceptions for gross negligence? Pushing out an update this destructive without it getting caught by any quality control checks sure seems grossly negligent.

@rozodru@lemmy.ca · edit-2 8 months ago

deleted by creator

@Revan343@lemmy.ca · 9 months ago

explain to the project manager with crayons why you shouldn’t do this

Can’t; the project manager ate all the crayons

@candybrie@lemmy.world · 9 months ago

Why is it bad to do on a Friday? Based on your last paragraph, I would have thought Friday is probably the best week day to do it.

Lightor · edit-2 9 months ago

Most companies, mine included, try to roll out updates during the middle or start of a week. That way if there are issues the full team is available to address them.

@rozodru@lemmy.ca · edit-2 8 months ago

deleted by creator

@debil@lemmy.world · 9 months ago

And hence the term read-only Friday.

@skittle07crusher@sh.itjust.works · 9 months ago

Was it not possible for MS to design their safe mode to still “work” when Bitlocker was enabled? Seems strange.

@catloaf@lemm.ee · 9 months ago

I’m not sure what you’d expect to be able to do in a safe mode with no disk access.

@corsicanguppy@lemmy.ca · 9 months ago

rolling out an update to production that there was clearly no testing

Or someone selected “env2” instead of “env1” (#cattleNotPets names) and tested in prod by mistake.

Look, it’s a gaffe and someone’s fired. But it doesn’t mean fuck ups are endemic.

@ThrowawaySobriquet@lemmy.world · 9 months ago

I think you’re on the nose, here. I laughed at the headline, but the more I read the more I see how fucked they are. Airlines. Industrial plants. Fucking governments. This one is big in a way that will likely get used as a case study.

@Cryophilia@lemmy.world · 9 months ago

The London Stock Exchange went down. They’re fukd.

@Wooki@lemmy.world · edit-2 9 months ago

Testing in production will do that

This is fine🔥🐶☕🔥 · 9 months ago

Not everyone is fortunate enough to have a seperate testing environment, you know? Manglement has to cut cost somewhere.

@IsThisAnAI@lemmy.world · 9 months ago

What lawsuits do you think are going to happen?

@Cryophilia@lemmy.world · 9 months ago

Forget lawsuits, they’re going to be in front of congress for this one

@IsThisAnAI@lemmy.world · 9 months ago

For what? At best it would be a hearing on the challenges of national security with industry.

@Bell@lemmy.world · 9 months ago

Don’t we blame MS at least as much? How does MS let an update like this push through their Windows Update system? How does an application update make the whole OS unable to boot? Blue screens on Windows have been around for decades, why don’t we have a better recovery system?

@sandalbucket@lemmy.world · 9 months ago

Crowdstrike runs at ring 0, effectively as part of the kernel. Like a device driver. There are no safeguards at that level. Extreme testing and diligence is required, because these are the consequences for getting it wrong. This is entirely on crowdstrike.

@wizardbeard@lemmy.dbzer0.com · edit-2 9 months ago

This didn’t go through Windows Update. It went through the ctowdstrike software directly.

@bdonvr@thelemmy.club · 9 months ago

The amount of servers running Windows out there is depressing to me

Franklin · 9 months ago

The four multinational corporations I worked at were almost entirely Windows servers with the exception of vendor specific stuff running Linux. Companies REALLY want that support clause in their infrastructure agreement.

@Avatar_of_Self@lemmy.world · 9 months ago

I’ve worked as an IT architect at various companies in my career and you can definitely get support contracts for engineering support of RHEL, Ubuntu, SUSE, etc. That isn’t the issue. The issue is that there are a lot of system administrators with “15 years experience in Linux” that have no real experience in Linux. They have experience googling for guides and tutorials while having cobbled together documents of doing various things without understanding what they are really doing.

I can’t tell you how many times I’ve seen an enterprise patch their Linux solutions (if they patched them at all with some ridiculous rubberstamped PO&AM) manually without deploying a repo and updating the repo treating it as you would a WSUS. Hell, I’m pleasantly surprised if I see them joined to a Windows domain (a few times) or an LDAP (once but they didn’t have a trust with the Domain Forest or use sudoer rules…sigh).

Boomer Humor Doomergod · edit-2 9 months ago

The issue is that there are a lot of system administrators with “15 years experience in Linux” that have no real experience in Linux.

Reminds me of this guy I helped a few years ago. His name was Bob, and he was a sysadmin at a predominantly Windows company. The software I was supporting, however, only ran on Linux. So since Bob had been a UNIX admin back in the 80s they picked him to install the software.

But it had been 30 years since he ever touched a CLI. Every time I got on a call with him, I’d have to give him every keystroke one by one, all while listening to him complain about how much he hated it. After three or four calls I just gave up and used the screenshare to do everything myself.

AFAIK he’s still the only Linux “sysadmin” there.

@Hotzilla@sopuli.xyz · 9 months ago

“googling answers”, I feel personally violated.

/s

To be fare, there is not reason to memorize things that you need once or twice. Google is tool, and good for Linux issues. Why debug some issue for few hours, if you can Google resolution in minutes.

@Avatar_of_Self@lemmy.world · edit-2 9 months ago

I’m not against using Google, stack exhange, man pages, apropos, tldr, etc. but if you’re trying to advertise competence with a skillset but you can’t do the basics and frankly it is still essentially a mystery to you then youre just being dishonest. Sure use all tools available to you though because that’s a good thing to do.

Just because someone breathed air in the same space occasionally over the years where a tool exists does not mean that they can honestly say that those are years of experience with it on a resume or whatever.

@uis@lemm.ee · 9 months ago

Just because someone breathed air in the same space occasionally over the years where a tool exists does not mean that they can honestly say that those are years of experience with it on a resume or whatever.

Capitalism makes them to.

@uis@lemm.ee · 9 months ago

Companies REALLY want that support clause in their infrastructure agreement.

RedHat, Ubuntu, SUSE - they all exist on support contracts.

@Rinox@feddit.it · 9 months ago

I dunno, but doesn’t like a quarter of the internet kinda run on Azure?

atocci · 9 months ago

And 60% of Azure is running Linux lol

@Rinox@feddit.it · 9 months ago

I guess Spotify was running on the other 40%, as many other services

@corsicanguppy@lemmy.ca · 9 months ago

doesn’t like a quarter of the internet kinda run on Azure?

Said another way, 3/4 of the internet isn’t on Unsure cloud blah-blah.

And azure is - shhh - at least partially backed by Linux hosts. Didn’t they buy an AWS clone and forcibly inject it with money like Bobby Brown on a date in the hopes of building AWS better than AWS like they did with nokia? MS could be more protectively diverse than many of its best customers.

@neosheo@discuss.tchncs.de · 9 months ago

I know i was really surprised how many there are. But honestly think of how many companies are using active directory and azure

YTG123 · 9 months ago

>Make a kernel-level antivirus
>Make it proprietary
>Don’t test updates… for some reason??

@CircuitSpells@lemmy.world · 9 months ago

I mean I know it’s easy to be critical but this was my exact thought, how the hell didn’t they catch this in testing?

@grabyourmotherskeys@lemmy.world · 9 months ago

I have had numerous managers tell me there was no time for QA in my storied career. Or documentation. Or backups. Or redundancy. And so on.

The Quuuuuill · 9 months ago

Push that into the technical debt. Then afterwards never pay off the technical debt

@Voroxpete@sh.itjust.works · 9 months ago

Completely justified reaction. A lot of the time tech companies and IT staff get shit for stuff that, in practice, can be really hard to detect before it happens. There are all kinds of issues that can arise in production that you just can’t test for.

But this… This has no justification. A issue this immediate, this widespread, would have instantly been caught with even the most basic of testing. The fact that it wasn’t raises massive questions about the safety and security of Crowdstrike’s internal processes.

Midnight Wolf · 9 months ago

most basic of testing

“I ran the update and now shit’s proper fucked”

@madcaesar@lemmy.world · 9 months ago

I think when you are this big you need to roll out any updates slowly. Checking along the way they all is good.

@Voroxpete@sh.itjust.works · 9 months ago

The failure here is much more fundamental than that. This isn’t a “no way we could have found this before we went to prod” issue, this is a “five minutes in the lab would have picked it up” issue. We’re not talking about some kind of “Doesn’t print on Tuesdays” kind of problem that’s hard to reproduce or depends on conditions that are hard to replicate in internal testing, which is normally how this sort of thing escapes containment. In this case the entire repro is “Step 1: Push update to any Windows machine. Step 2: THERE IS NO STEP 2”

There’s absolutely no reason this should ever have affected even one single computer outside of Crowdstrike’s test environment, with or without a staged rollout.

@madcaesar@lemmy.world · 9 months ago

God damn this is worse than I thought… This raises further questions… Was there a NO testing at all??

@kayos@lemmy.world · 9 months ago

Tested on Windows 10S

@elrik@lemmy.world · 9 months ago

My guess is they did testing but the build they tested was not the build released to customers. That could have been because of poor deployment and testing practices, or it could have been malicious.

Such software would be a juicy target for bad actors.

@Voroxpete@sh.itjust.works · 9 months ago

Agreed, this is the most likely sequence of events. I doubt it was malicious, but definitely could have occurred by accident if proper procedures weren’t being followed.

@kaffiene@lemmy.world · 9 months ago

Yes. And Microsoft’s

@wizardbeard@lemmy.dbzer0.com · 9 months ago

How exactly is Microsoft responsible for this? It’s a kernel level driver that intercepts system calls, and the software updated itself.

This software was crashing Linux distros last month too, but that didn’t make headlines because it effected less machines.

@kaffiene@lemmy.world · 9 months ago

My apologies I thought this went out with a MS update

@jaemo@sh.itjust.works · 9 months ago

You left out > Profit

Oh… Wait…Hang on a sec.

@areyouevenreal@lemm.ee · 9 months ago

Lots of security systems are kernel level (at least partially) this includes SELinux and AppArmor by the way. It’s a necessity for these things to actually be effective.

@uis@lemm.ee · 9 months ago

You missed most important line:

>Make it proprietary

@sasquash@sopuli.xyz · 9 months ago

never do updates on a Friday.

@rozodru@lemmy.ca · edit-2 8 months ago

deleted by creator

Trailblazing Braille Taser · 9 months ago

And especially now the work week has slimmed down where no one works on Friday anymore

Excuse me, what now? I didn’t get that memo.

@rozodru@lemmy.ca · edit-2 8 months ago

deleted by creator

@corsicanguppy@lemmy.ca · 9 months ago

I changed jobs because the new management was all “if I can’t look at your ass you don’t work here” and I agreed.

I now work remotely 100% and it’s in the union contract with the 21vacation days and 9x9 compressed time and regular raises. The view out my home office window is partially obscured by a floofy cat and we both like it that way.

I’d work here until I die.

@sasquash@sopuli.xyz · 9 months ago

Actually I was not even joking. I also work in IT and have exactly the same opinion. Friday is for easy stuff!

@merc@sh.itjust.works · 9 months ago

You posted this 14 hours ago, which would have made it 4:30 am in Austin, Texas where Cloudstrike is based. You may have felt the effect on Friday, but it’s extremely likely that the person who made the change did it late on a Thursday.

@spyd3r@sh.itjust.works · 9 months ago

Never update unless something is broken.

@Passerby6497@lemmy.world · 9 months ago

That’s advice so smart you’re guaranteed to have massive security holes.

@Hotzilla@sopuli.xyz · 9 months ago

This is AV, and even possible that it is part of definitions (for example some windows file deleted as false positive). You update those daily.

Encrypt-Keeper · 9 months ago

Yeah my plans of going to sleep last night were thoroughly dashed as every single windows server across every datacenter I manage between two countries all cried out at the same time lmao

@jayandp@sh.itjust.works · 9 months ago

This is fine🔥🐶☕🔥 · 9 months ago

How many coffee cups have you drank in the last 12 hours?

@Cryophilia@lemmy.world · 9 months ago

I work in a data center

I lost count

This is fine🔥🐶☕🔥 · 9 months ago

What was Dracula doing in your data centre?

kingthrillgore · 9 months ago

Because he’s Dracula. He’s twelve million years old.

THE WORMS

@jj4211@lemmy.world · 9 months ago

I work in a datacenter, but no Windows. I slept so well.

Though a couple years back some ransomware that also impacted Linux ran through, but I got to sleep well because it only bit people with easily guessed root passwords. It bit a lot of other departments at the company though.

This time even the Windows folks were spared, because CrowdStrike wasn’t the solution they infested themselves with (they use other providers, who I fully expect to screw up the same way one day).

Encrypt-Keeper · 9 months ago

There was a point where words lost all meaning and I think my heart was one continuous beat for a good hour.

Boomer Humor Doomergod · 9 months ago

Did you feel a great disturbance in the force?

Encrypt-Keeper · 9 months ago

Oh yeah I felt a great disturbance (900 alarms) in the force (Opsgenie)

@rottingleaf@lemmy.world · 9 months ago

How’s it going, Obi-Wan?

kadotux · edit-2 9 months ago

Here’s the fix: (or rather workaround, released by CrowdStrike) 1)Boot to safe mode/recovery 2)Go to C:\Windows\System32\drivers\CrowdStrike 3)Delete the file matching “C-00000291*.sys” 4)Boot the system normally

@StV2@lemmy.world · 9 months ago

It’s disappointing that the fix is so easy to perform and yet it’ll almost certainly keep a lot of infrastructure down for hours because a majority of people seem too scared to try to fix anything on their own machine (or aren’t trusted to so they can’t even if they know how)

@thehatfox@lemmy.world · edit-2 9 months ago

Might seem easy to someone with a technical background. But the last thing businesses want to be doing is telling average end users to boot into safe mode and start deleting system files.

If that started happening en masse we would quickly end up with far more problems than we started with. Plenty of users would end up deleting system32 entirely or something else equally damaging.

@Ookami38@sh.itjust.works · 9 months ago

I do IT for some stores. My team lead briefly suggested having store managers try to do this fix. I HARD vetoed that. That’s only going to do more damage.

@Grandwolf319@sh.itjust.works · 9 months ago

I wouldn’t fix it if it’s not my responsibly at work. What if I mess up and break things further?

When things go wrong, best to just let people do the emergency process.

@nyarla@lemmy.dbzer0.com · edit-2 9 months ago

deleted by creator

@cheeseburger@lemmy.ca · 9 months ago

I’m on a bridge still while we wait for Bitlocker recovery keys, so we can actually boot into safemode, but the Bitkocker key server is down as well…

@WagnasT@lemmy.world · 9 months ago

Man, it sure would suck if you could still get to safe mode from pressing f8. Can you imagine how terrible that’d be?

@a_postmodern_hat@lemmy.world · 9 months ago

You hold down Shift while restarting or booting and you get a recovery menu. I don’t know why they changed this behaviour.

@Ookami38@sh.itjust.works · 9 months ago

That was the dumbest thing to learn this morning.

CaptainBasculin · 9 months ago

A driver failure, yeesh. It always sucks to deal with it.

@resin85@lemmy.ca · 9 months ago

Not that easy when it’s a fleet of servers in multiple remote data centers. Lots of IT folks will be spending their weekend sitting in data center cages.

@boaratio@lemmy.world · 9 months ago

CrowdStrike: It’s Friday, let’s throw it over the wall to production. See you all on Monday!

@jayandp@sh.itjust.works · edit-2 9 months ago

^^so ^^hard ^^picking ^^which ^^meme ^^to ^^use

verity_kindle · 9 months ago

Good choice, tho. Is the image AI?

@PythagreousTitties@lemm.ee · 9 months ago

It’s a real photograph from this morning.

@jayandp@sh.itjust.works · 9 months ago

Not sure, I didn’t make it. Just part of my collection.

verity_kindle · 9 months ago

Fair enough!

@frezik@midwest.social · 9 months ago

When your push to prod on Friday causes a small but measurable drop in global GDP.

LustyArgonian · 9 months ago

Actually, it may have helped slow climate change a little

@iamtrashman1312@lemmy.world · 9 months ago

The earth is healing 🙏

For part of today

@merc@sh.itjust.works · 9 months ago

With all the aircraft on the ground, it was probably a noticeable change. Unfortunately, those people are still going to end up flying at some point, so the reduction in CO2 output on Friday will just be made up for over the next few days.

@lagomorphlecture@lemm.ee · 9 months ago

Definitely not small, our website is down so we can’t do any business and we’re a huge company. Multiply that by all the companies that are down, lost time on projects, time to get caught up once it’s fixed, it’ll be a huge number in the end.

@frezik@midwest.social · edit-2 9 months ago

GDP is typically stated by the year. One or two days lost, even if it was 100% of the GDP for those days, would still be less than 1% of GDP for the year.

LustyArgonian · 9 months ago

I know people who work at major corporations who said they were down for a bit, it’s pretty huge.

@merc@sh.itjust.works · 9 months ago

Does your web server run windows? Or is it dependent on some systems that run Windows? I would hope nobody’s actually running a web server on Windows these days.

@lagomorphlecture@lemm.ee · 9 months ago

I have a absolutely no idea. Not my area of expertise.

Jesus · 9 months ago

They did it on Thursday. All of SFO was BSODed for me when I got off a plane at SFO Thursday night.

@Riccosuave@lemmy.world · 9 months ago

@merc@sh.itjust.works · 9 months ago

Was it actually pushed on Friday, or was it a Thursday night (US central / pacific time) push? The fact that this comment is from 9 hours ago suggests that the problem existed by the time work started on Friday, so I wouldn’t count it as a Friday push. (Still, too many pushes happen at a time that’s still technically Thursday on the US west coast, but is already mid-day Friday in Asia).

@trolololol@lemmy.world · 9 months ago

I’m in Australia so def Friday. Fu crowdstrike.

@merc@sh.itjust.works · 9 months ago

Seems like you should be more mad at the International Date Line.

@richtellyard@lemmy.world · 9 months ago

This is going to be a Big Deal for a whole lot of people. I don’t know all the companies and industries that use Crowdstrike but I might guess it will result in airline delays, banking outages, and hospital computer systems failing. Hopefully nobody gets hurt because of it.

@RegalPotoo@lemmy.world · 9 months ago

Big chunk of New Zealands banks apparently run it, cos 3 of the big ones can’t do credit card transactions right now

@oderus@lemm.ee · 9 months ago

It was mayhem at PakNSave a bit ago.

@index@sh.itjust.works · 9 months ago

cos 3 of the big ones can’t do credit card transactions right now

Bitcoin still up and running perhaps people can use that

@I_Miss_Daniel@lemmy.world · 9 months ago

Bitcoin Cash maybe. Didn’t they bork Bitcoin (Core) so you have to wait for confirmations in the next block?

@whotookkarl@lemmy.world · 9 months ago

Several 911 systems were affected or completely down too

@iAvicenna@lemmy.world · 9 months ago

@invisiblegorilla@sh.itjust.works · 9 months ago

Ironic. They did what they are there to protect against. Fucking up everyone’s shit

@StaySquared@lemmy.world · 9 months ago

CrowdStrike has a new meaning… literally Crowd Strike.

@uis@lemm.ee · 9 months ago

They virtually blew up airports

@AnUnusualRelic@lemmy.world · 9 months ago

An offline server is a secure server!

@Hotzilla@sopuli.xyz · 9 months ago

There is nothing unsafer than local networks.

AV/XDR is not optional even in offline networks. If you don’t have visibility on your network, you are totally screwed.

@recapitated@lemmy.world · 9 months ago

Clownstrike

@lando55@lemmy.world · 9 months ago

Crowdshite haha gotem

@WhatAmLemmy@lemmy.world · 9 months ago

CrowdCollapse

@Damage@feddit.it · 9 months ago

The thought of a local computer being unable to boot because some remote server somewhere is unavailable makes me laugh and sad at the same time.

@rxxrc@lemmy.ml · 9 months ago

I don’t think that’s what’s happening here. As far as I know it’s an issue with a driver installed on the computers, not with anything trying to reach out to an external server. If that were the case you’d expect it to fail to boot any time you don’t have an Internet connection.

Windows is bad but it’s not that bad yet.

@corsicanguppy@lemmy.ca · edit-2 9 months ago

expect it to fail to boot any time you don’t have an Internet connection.

So, like the UbiSoft umbilical but for OSes.

Edit: name of publisher not developer.

Sʏʟᴇɴᴄᴇ · 9 months ago

Yep, stuck at the airport currently. All flights grounded. All major grocery store chains and banks also impacted. Bad day to be a crowdstrike employee!

@ari_verse@lemm.ee · 9 months ago

A few years ago when my org got the ask to deploy the CS agent in linux production servers and I also saw it getting deployed in thousands of windows and mac desktops all across, the first thought that came to mind was “massive single point of failure and security threat”, as we were putting all the trust in a single relatively small company that will (has?) become the favorite target of all the bad actors across the planet. How long before it gets into trouble, either because if it’s own doing or due to others?

I guess that we now know

@SupraMario@lemmy.world · 9 months ago

No bad actors did this, and security goes in fads. Crowdstrike is king right now, just as McAfee/Trellix was in the past. If you want to run around without edr/xdr software be my guest.

@corsicanguppy@lemmy.ca · 9 months ago

Hmm. Is it safer to have a potentially exploitable agent running as root and listening on a port, than to not have EDR running on a well-secured low-churn enterprise OS - sit down, Ubuntu - adhering to best practice for least access and least-services and good role-sep?

It’s a pickle. I’m gonna go with “maybe don’t lock down your enterprise Linux hard and then open a yawning garage door of a hole right into it” but YMMV.

@SupraMario@lemmy.world · 9 months ago

Reality is, if your users are educated, then your more secure than any edr with dumbass users. But we all know this is a pipe dream.

@ansiz@lemmy.world · 9 months ago

All of the security vendors do it over enough time. McAfee used to be the king of them.

https://www.zdnet.com/article/defective-mcafee-update-causes-worldwide-meltdown-of-xp-pcs/

https://www.bleepingcomputer.com/news/security/trend-micro-antivirus-modified-windows-registry-by-mistake-how-to-fix/

https://www.techradar.com/news/microsoft-releases-fix-for-botched-windows-defender-update-but-its-still-facing-problems

Major IT outage affecting banks, airlines, media outlets across the world

Major IT outage affecting banks, airlines, media outlets across the world

Live: Major IT outage affecting banks, airlines, media outlets across the world