All our servers and company laptops went down at pretty much the same time. Laptops have been bootlooping to blue screen of death. It’s all very exciting, personally, as someone not responsible for fixing it.

Apparently caused by a bad CrowdStrike update.

Edit: now being told we (who almost all generally work from home) need to come into the office Monday as they can only apply the fix in-person. We’ll see if that changes over the weekend…

  • @jedibob5@lemmy.world
    link
    fedilink
    English
    1689 months ago

    Reading into the updates some more… I’m starting to think this might just destroy CloudStrike as a company altogether. Between the mountain of lawsuits almost certainly incoming and the total destruction of any public trust in the company, I don’t see how they survive this. Just absolutely catastrophic on all fronts.

    • @RegalPotoo@lemmy.world
      link
      fedilink
      English
      399 months ago

      Agreed, this will probably kill them over the next few years unless they can really magic up something.

      They probably don’t get sued - their contracts will have indemnity clauses against exactly this kind of thing, so unless they seriously misrepresented what their product does, this probably isn’t a contract breach.

      If you are running crowdstrike, it’s probably because you have some regulatory obligations and an auditor to appease - you aren’t going to be able to just turn it off overnight, but I’m sure there are going to be some pretty awkward meetings when it comes to contract renewals in the next year, and I can’t imagine them seeing much growth

      • @jedibob5@lemmy.world
        link
        fedilink
        English
        5
        edit-2
        9 months ago

        Don’t most indemnity clauses have exceptions for gross negligence? Pushing out an update this destructive without it getting caught by any quality control checks sure seems grossly negligent.

      • @Revan343@lemmy.ca
        link
        fedilink
        English
        89 months ago

        explain to the project manager with crayons why you shouldn’t do this

        Can’t; the project manager ate all the crayons

      • @candybrie@lemmy.world
        link
        fedilink
        English
        39 months ago

        Why is it bad to do on a Friday? Based on your last paragraph, I would have thought Friday is probably the best week day to do it.

        • Lightor
          link
          fedilink
          English
          18
          edit-2
          9 months ago

          Most companies, mine included, try to roll out updates during the middle or start of a week. That way if there are issues the full team is available to address them.

        • @catloaf@lemm.ee
          link
          fedilink
          English
          29 months ago

          I’m not sure what you’d expect to be able to do in a safe mode with no disk access.

      • @corsicanguppy@lemmy.ca
        link
        fedilink
        English
        19 months ago

        rolling out an update to production that there was clearly no testing

        Or someone selected “env2” instead of “env1” (#cattleNotPets names) and tested in prod by mistake.

        Look, it’s a gaffe and someone’s fired. But it doesn’t mean fuck ups are endemic.

    • @ThrowawaySobriquet@lemmy.world
      link
      fedilink
      English
      209 months ago

      I think you’re on the nose, here. I laughed at the headline, but the more I read the more I see how fucked they are. Airlines. Industrial plants. Fucking governments. This one is big in a way that will likely get used as a case study.

      • This is fine🔥🐶☕🔥
        link
        fedilink
        English
        69 months ago

        Not everyone is fortunate enough to have a seperate testing environment, you know? Manglement has to cut cost somewhere.

    • @Bell@lemmy.world
      link
      fedilink
      English
      -29 months ago

      Don’t we blame MS at least as much? How does MS let an update like this push through their Windows Update system? How does an application update make the whole OS unable to boot? Blue screens on Windows have been around for decades, why don’t we have a better recovery system?

      • @sandalbucket@lemmy.world
        link
        fedilink
        English
        109 months ago

        Crowdstrike runs at ring 0, effectively as part of the kernel. Like a device driver. There are no safeguards at that level. Extreme testing and diligence is required, because these are the consequences for getting it wrong. This is entirely on crowdstrike.

    • Franklin
      link
      fedilink
      English
      659 months ago

      The four multinational corporations I worked at were almost entirely Windows servers with the exception of vendor specific stuff running Linux. Companies REALLY want that support clause in their infrastructure agreement.

      • @Avatar_of_Self@lemmy.world
        link
        fedilink
        English
        199 months ago

        I’ve worked as an IT architect at various companies in my career and you can definitely get support contracts for engineering support of RHEL, Ubuntu, SUSE, etc. That isn’t the issue. The issue is that there are a lot of system administrators with “15 years experience in Linux” that have no real experience in Linux. They have experience googling for guides and tutorials while having cobbled together documents of doing various things without understanding what they are really doing.

        I can’t tell you how many times I’ve seen an enterprise patch their Linux solutions (if they patched them at all with some ridiculous rubberstamped PO&AM) manually without deploying a repo and updating the repo treating it as you would a WSUS. Hell, I’m pleasantly surprised if I see them joined to a Windows domain (a few times) or an LDAP (once but they didn’t have a trust with the Domain Forest or use sudoer rules…sigh).

        • Boomer Humor Doomergod
          link
          fedilink
          English
          12
          edit-2
          9 months ago

          The issue is that there are a lot of system administrators with “15 years experience in Linux” that have no real experience in Linux.

          Reminds me of this guy I helped a few years ago. His name was Bob, and he was a sysadmin at a predominantly Windows company. The software I was supporting, however, only ran on Linux. So since Bob had been a UNIX admin back in the 80s they picked him to install the software.

          But it had been 30 years since he ever touched a CLI. Every time I got on a call with him, I’d have to give him every keystroke one by one, all while listening to him complain about how much he hated it. After three or four calls I just gave up and used the screenshare to do everything myself.

          AFAIK he’s still the only Linux “sysadmin” there.

        • @Hotzilla@sopuli.xyz
          link
          fedilink
          English
          59 months ago

          “googling answers”, I feel personally violated.

          /s

          To be fare, there is not reason to memorize things that you need once or twice. Google is tool, and good for Linux issues. Why debug some issue for few hours, if you can Google resolution in minutes.

          • @Avatar_of_Self@lemmy.world
            link
            fedilink
            English
            3
            edit-2
            9 months ago

            I’m not against using Google, stack exhange, man pages, apropos, tldr, etc. but if you’re trying to advertise competence with a skillset but you can’t do the basics and frankly it is still essentially a mystery to you then youre just being dishonest. Sure use all tools available to you though because that’s a good thing to do.

            Just because someone breathed air in the same space occasionally over the years where a tool exists does not mean that they can honestly say that those are years of experience with it on a resume or whatever.

            • @uis@lemm.ee
              link
              fedilink
              English
              49 months ago

              Just because someone breathed air in the same space occasionally over the years where a tool exists does not mean that they can honestly say that those are years of experience with it on a resume or whatever.

              Capitalism makes them to.

      • @uis@lemm.ee
        link
        fedilink
        English
        49 months ago

        Companies REALLY want that support clause in their infrastructure agreement.

        RedHat, Ubuntu, SUSE - they all exist on support contracts.

      • @corsicanguppy@lemmy.ca
        link
        fedilink
        English
        29 months ago

        doesn’t like a quarter of the internet kinda run on Azure?

        Said another way, 3/4 of the internet isn’t on Unsure cloud blah-blah.

        And azure is - shhh - at least partially backed by Linux hosts. Didn’t they buy an AWS clone and forcibly inject it with money like Bobby Brown on a date in the hopes of building AWS better than AWS like they did with nokia? MS could be more protectively diverse than many of its best customers.

    • @neosheo@discuss.tchncs.de
      link
      fedilink
      English
      19 months ago

      I know i was really surprised how many there are. But honestly think of how many companies are using active directory and azure

  • YTG123
    link
    fedilink
    English
    1409 months ago

    >Make a kernel-level antivirus
    >Make it proprietary
    >Don’t test updates… for some reason??

    • @CircuitSpells@lemmy.world
      link
      fedilink
      English
      449 months ago

      I mean I know it’s easy to be critical but this was my exact thought, how the hell didn’t they catch this in testing?

      • @grabyourmotherskeys@lemmy.world
        link
        fedilink
        English
        399 months ago

        I have had numerous managers tell me there was no time for QA in my storied career. Or documentation. Or backups. Or redundancy. And so on.

        • The Quuuuuill
          link
          fedilink
          English
          99 months ago

          Push that into the technical debt. Then afterwards never pay off the technical debt

      • @Voroxpete@sh.itjust.works
        link
        fedilink
        English
        339 months ago

        Completely justified reaction. A lot of the time tech companies and IT staff get shit for stuff that, in practice, can be really hard to detect before it happens. There are all kinds of issues that can arise in production that you just can’t test for.

        But this… This has no justification. A issue this immediate, this widespread, would have instantly been caught with even the most basic of testing. The fact that it wasn’t raises massive questions about the safety and security of Crowdstrike’s internal processes.

        • Midnight Wolf
          link
          fedilink
          English
          59 months ago

          most basic of testing

          “I ran the update and now shit’s proper fucked”

        • @madcaesar@lemmy.world
          link
          fedilink
          English
          59 months ago

          I think when you are this big you need to roll out any updates slowly. Checking along the way they all is good.

          • @Voroxpete@sh.itjust.works
            link
            fedilink
            English
            159 months ago

            The failure here is much more fundamental than that. This isn’t a “no way we could have found this before we went to prod” issue, this is a “five minutes in the lab would have picked it up” issue. We’re not talking about some kind of “Doesn’t print on Tuesdays” kind of problem that’s hard to reproduce or depends on conditions that are hard to replicate in internal testing, which is normally how this sort of thing escapes containment. In this case the entire repro is “Step 1: Push update to any Windows machine. Step 2: THERE IS NO STEP 2”

            There’s absolutely no reason this should ever have affected even one single computer outside of Crowdstrike’s test environment, with or without a staged rollout.

            • @madcaesar@lemmy.world
              link
              fedilink
              English
              69 months ago

              God damn this is worse than I thought… This raises further questions… Was there a NO testing at all??

            • @elrik@lemmy.world
              link
              fedilink
              English
              49 months ago

              My guess is they did testing but the build they tested was not the build released to customers. That could have been because of poor deployment and testing practices, or it could have been malicious.

              Such software would be a juicy target for bad actors.

              • @Voroxpete@sh.itjust.works
                link
                fedilink
                English
                19 months ago

                Agreed, this is the most likely sequence of events. I doubt it was malicious, but definitely could have occurred by accident if proper procedures weren’t being followed.

          • @wizardbeard@lemmy.dbzer0.com
            link
            fedilink
            English
            49 months ago

            How exactly is Microsoft responsible for this? It’s a kernel level driver that intercepts system calls, and the software updated itself.

            This software was crashing Linux distros last month too, but that didn’t make headlines because it effected less machines.

    • @areyouevenreal@lemm.ee
      link
      fedilink
      English
      29 months ago

      Lots of security systems are kernel level (at least partially) this includes SELinux and AppArmor by the way. It’s a necessity for these things to actually be effective.

      • Trailblazing Braille Taser
        link
        fedilink
        English
        379 months ago

        And especially now the work week has slimmed down where no one works on Friday anymore

        Excuse me, what now? I didn’t get that memo.

          • @corsicanguppy@lemmy.ca
            link
            fedilink
            English
            29 months ago

            I changed jobs because the new management was all “if I can’t look at your ass you don’t work here” and I agreed.

            I now work remotely 100% and it’s in the union contract with the 21vacation days and 9x9 compressed time and regular raises. The view out my home office window is partially obscured by a floofy cat and we both like it that way.

            I’d work here until I die.

      • @sasquash@sopuli.xyz
        link
        fedilink
        English
        19 months ago

        Actually I was not even joking. I also work in IT and have exactly the same opinion. Friday is for easy stuff!

    • @merc@sh.itjust.works
      link
      fedilink
      English
      29 months ago

      You posted this 14 hours ago, which would have made it 4:30 am in Austin, Texas where Cloudstrike is based. You may have felt the effect on Friday, but it’s extremely likely that the person who made the change did it late on a Thursday.

      • @Hotzilla@sopuli.xyz
        link
        fedilink
        English
        29 months ago

        This is AV, and even possible that it is part of definitions (for example some windows file deleted as false positive). You update those daily.

  • Encrypt-Keeper
    link
    fedilink
    English
    989 months ago

    Yeah my plans of going to sleep last night were thoroughly dashed as every single windows server across every datacenter I manage between two countries all cried out at the same time lmao

  • kadotux
    link
    fedilink
    English
    78
    edit-2
    9 months ago

    Here’s the fix: (or rather workaround, released by CrowdStrike) 1)Boot to safe mode/recovery 2)Go to C:\Windows\System32\drivers\CrowdStrike 3)Delete the file matching “C-00000291*.sys” 4)Boot the system normally

    • @StV2@lemmy.world
      link
      fedilink
      English
      449 months ago

      It’s disappointing that the fix is so easy to perform and yet it’ll almost certainly keep a lot of infrastructure down for hours because a majority of people seem too scared to try to fix anything on their own machine (or aren’t trusted to so they can’t even if they know how)

      • @thehatfox@lemmy.world
        link
        fedilink
        English
        25
        edit-2
        9 months ago

        Might seem easy to someone with a technical background. But the last thing businesses want to be doing is telling average end users to boot into safe mode and start deleting system files.

        If that started happening en masse we would quickly end up with far more problems than we started with. Plenty of users would end up deleting system32 entirely or something else equally damaging.

        • @Ookami38@sh.itjust.works
          link
          fedilink
          English
          69 months ago

          I do IT for some stores. My team lead briefly suggested having store managers try to do this fix. I HARD vetoed that. That’s only going to do more damage.

      • @Grandwolf319@sh.itjust.works
        link
        fedilink
        English
        29 months ago

        I wouldn’t fix it if it’s not my responsibly at work. What if I mess up and break things further?

        When things go wrong, best to just let people do the emergency process.

    • @cheeseburger@lemmy.ca
      link
      fedilink
      English
      319 months ago

      I’m on a bridge still while we wait for Bitlocker recovery keys, so we can actually boot into safemode, but the Bitkocker key server is down as well…

    • @WagnasT@lemmy.world
      link
      fedilink
      English
      99 months ago

      Man, it sure would suck if you could still get to safe mode from pressing f8. Can you imagine how terrible that’d be?

    • @resin85@lemmy.ca
      link
      fedilink
      English
      29 months ago

      Not that easy when it’s a fleet of servers in multiple remote data centers. Lots of IT folks will be spending their weekend sitting in data center cages.

  • @boaratio@lemmy.world
    link
    fedilink
    English
    779 months ago

    CrowdStrike: It’s Friday, let’s throw it over the wall to production. See you all on Monday!

        • @merc@sh.itjust.works
          link
          fedilink
          English
          19 months ago

          With all the aircraft on the ground, it was probably a noticeable change. Unfortunately, those people are still going to end up flying at some point, so the reduction in CO2 output on Friday will just be made up for over the next few days.

      • @lagomorphlecture@lemm.ee
        link
        fedilink
        English
        119 months ago

        Definitely not small, our website is down so we can’t do any business and we’re a huge company. Multiply that by all the companies that are down, lost time on projects, time to get caught up once it’s fixed, it’ll be a huge number in the end.

        • @frezik@midwest.social
          link
          fedilink
          English
          6
          edit-2
          9 months ago

          GDP is typically stated by the year. One or two days lost, even if it was 100% of the GDP for those days, would still be less than 1% of GDP for the year.

        • LustyArgonian
          link
          fedilink
          English
          39 months ago

          I know people who work at major corporations who said they were down for a bit, it’s pretty huge.

        • @merc@sh.itjust.works
          link
          fedilink
          English
          19 months ago

          Does your web server run windows? Or is it dependent on some systems that run Windows? I would hope nobody’s actually running a web server on Windows these days.

    • Jesus
      link
      fedilink
      English
      99 months ago

      They did it on Thursday. All of SFO was BSODed for me when I got off a plane at SFO Thursday night.

    • @merc@sh.itjust.works
      link
      fedilink
      English
      29 months ago

      Was it actually pushed on Friday, or was it a Thursday night (US central / pacific time) push? The fact that this comment is from 9 hours ago suggests that the problem existed by the time work started on Friday, so I wouldn’t count it as a Friday push. (Still, too many pushes happen at a time that’s still technically Thursday on the US west coast, but is already mid-day Friday in Asia).

  • @richtellyard@lemmy.world
    link
    fedilink
    English
    689 months ago

    This is going to be a Big Deal for a whole lot of people. I don’t know all the companies and industries that use Crowdstrike but I might guess it will result in airline delays, banking outages, and hospital computer systems failing. Hopefully nobody gets hurt because of it.

    • @Hotzilla@sopuli.xyz
      link
      fedilink
      English
      -19 months ago

      There is nothing unsafer than local networks.

      AV/XDR is not optional even in offline networks. If you don’t have visibility on your network, you are totally screwed.

  • @Damage@feddit.it
    link
    fedilink
    English
    519 months ago

    The thought of a local computer being unable to boot because some remote server somewhere is unavailable makes me laugh and sad at the same time.

    • @rxxrc@lemmy.mlOP
      link
      fedilink
      English
      519 months ago

      I don’t think that’s what’s happening here. As far as I know it’s an issue with a driver installed on the computers, not with anything trying to reach out to an external server. If that were the case you’d expect it to fail to boot any time you don’t have an Internet connection.

      Windows is bad but it’s not that bad yet.

      • @corsicanguppy@lemmy.ca
        link
        fedilink
        English
        1
        edit-2
        9 months ago

        expect it to fail to boot any time you don’t have an Internet connection.

        So, like the UbiSoft umbilical but for OSes.

        Edit: name of publisher not developer.

  • Sʏʟᴇɴᴄᴇ
    link
    fedilink
    English
    449 months ago

    Yep, stuck at the airport currently. All flights grounded. All major grocery store chains and banks also impacted. Bad day to be a crowdstrike employee!

  • @ari_verse@lemm.ee
    link
    fedilink
    English
    379 months ago

    A few years ago when my org got the ask to deploy the CS agent in linux production servers and I also saw it getting deployed in thousands of windows and mac desktops all across, the first thought that came to mind was “massive single point of failure and security threat”, as we were putting all the trust in a single relatively small company that will (has?) become the favorite target of all the bad actors across the planet. How long before it gets into trouble, either because if it’s own doing or due to others?

    I guess that we now know