Fess up. You know it was you.

  • @tquid@sh.itjust.works
    link
    fedilink
    71
    edit-2
    1 year ago

    One time I was deleting a user from our MySQL-backed RADIUS database.

    DELETE * FROM PASSWORDS;

    And yeah, if you don’t have a WHERE clause? It just deletes everything. About 60,000 records for a decent-sized ISP.

    That afternoon really, really sucked. We had only ad-hoc backups. It was not a well-run business.

    Now when I interview sysadmins (or these days devops), I always ask about their worst cock-up. It tells you a lot about a candidate.

    • @RacerX@lemm.eeOP
      link
      fedilink
      321 year ago

      Always skeptical of people that don’t own up to mistakes. Would much rather they own it and speak to what they learned.

      • chameleon
        link
        fedilink
        131 year ago

        It’s difficult because you have a 50/50 of having a manager that doesn’t respect mistakes and will immediately get you fired for it (to the best of their abilities), versus one that considers such a mistake to be very expensive training.

        I simply can’t blame people for self-defense. I interned at a ‘non-profit’ where there had apparently been a revolving door of employees being fired for making entirely reasonable mistakes and looking back at it a dozen years later, it’s no surprise that nobody was getting anything done in that environment.

        • @ilinamorato@lemmy.world
          link
          fedilink
          111 year ago

          Incredibly short-sighted, especially for a nonprofit. You just spent some huge amount of time and money training a person to never make that mistake again, why would you throw that investment away?

    • cobysev
      link
      fedilink
      English
      151 year ago

      I was a sysadmin in the US Air Force for 20 years. One of my assignments was working at the headquarters for AFCENT (Air Forces Central Command), which oversees every deployed base in the middle east. Specifically, I worked on a tier 3 help desk, solving problems that the help desks at deployed bases couldn’t figure out.

      Normally, we got our issues in tickets forwarded to us from the individual base’s Communications Squadron (IT squadron at a base). But one day, we got a call from the commander of a base’s Comm Sq. Apparently, every user account on the base has disappeared and he needed our help restoring accounts!

      The first thing we did was dig through server logs to determine what caused it. No sense fixing it if an automated process was the cause and would just undo our work, right?

      We found one Technical Sergeant logged in who had run a command to delete every single user account in the directory tree. We sought him out and he claimed he was trying to remove one individual, but accidentally selected the tree instead of the individual. It just so happened to be the base’s tree, not an individual office or squadron.

      As his rank implies, he’s supposed to be the technical expert in his field. But this guy was an idiot who shouldn’t have been touching user accounts in the first place. Managing user accounts in an Airman job; a simple job given to our lowest-ranking members as they’re learning how to be sysadmins. And he couldn’t even do that.

      It was a very large base. It took 3 days to recover all accounts from backup. The Technical Sergeant had his admin privileges revoked and spent the rest of his deployment sitting in a corner, doing administrative paperwork.

  • 𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍
    link
    fedilink
    46
    edit-2
    1 year ago

    Accidentally deleted an entire column in a police department’s evidence database early in my career 😬

    Thankfully, it only contained filepaths that could be reconstructed via a script. But I was sweating 12+1 bullets. Spent two days rebuilding that.

  • Quazatron
    link
    fedilink
    341 year ago

    Did you know that “Terminate” is not an appropriate way to stop an AWS EC2 instance? I sure as hell didn’t.

    • Billegh
      link
      fedilink
      21 year ago

      It doesn’t help that the webui used to hide stop. I think it still does.

  • Kata1yst
    link
    fedilink
    31
    edit-2
    1 year ago

    It was the bad old days of sysadmin, where literally every critical service ran on an iron box in the basement.

    I was on my first oncall rotation. Got my first call from helpdesk, exchange was down, it’s 3AM, and the oncall backup and Exchange SMEs weren’t responding to pages.

    Now I knew Exchange well enough, but I was new to this role and this architecture. I knew the system was clustered, so I quickly pulled the documentation and logged into the cluster manager.

    I reviewed the docs several times, we had Exchange server 1 named something thoughtful like exh-001 and server 2 named exh-002 or something.

    Well, I’d reviewed the docs and helpdesk and stakeholders were desperate to move forward, so I initiated a failover from clustered mode with 001 as the primary, instead to unclustered mode pointing directly to server 10.x.x.xx2

    What’s that you ask? Why did I suddenly switch to the IP address rather than the DNS name? Well that’s how the servers were registered in the cluster manager. Nothing to worry about.

    Well… Anyone want to guess which DNS name 10.x.x.xx2 was registered to?

    Yeah. Not exh-002. For some crazy legacy reason the DNS names had been remapped in the distant past.

    So anyway that’s how I made a 15 minute outage into a 5 hour one.

    On the plus side, I learned a lot and didn’t get fired.

  • @treechicken@lemmy.world
    link
    fedilink
    27
    edit-2
    1 year ago

    I once “biased for action” and removed some “unused” NS records to “fix” a flakey DNS resolution issue without telling anyone on a Friday afternoon before going out to dinner with family.

    Turns out my fix did not work and those DNS records were actually important. Checked on the website halfway into the meal and freaked the fuck out once I realized the site went from resolving 90% of the time to not resolving at all. The worst part was when I finally got the guts to report I messed up on the group channel, DNS was somehow still resolving for both our internal monitoring and for everyone else who tried manually. My issue got shoo-shoo’d away, and I was left there not even sure of what to do next.

    I spent the rest of my time on my phone, refreshing the website and resolving domain names in an online Dig tool over and over again, anxiety growing, knowing I couldn’t do anything to fix my “fix” while I was outside.

    Once I came home I ended up reversing everything I did which seemed to bring it back to the original flakey state. Learned the value of SOPs and taking things slow after that (and also to not screw with DNS).

    If this story has a happy ending, it’s that we did eventually fix the flakey DNS issue later, going through a more rigorous review this time. On the other hand, how and why I, a junior at the time, became the de facto owner of an entire product’s DNS infra remains a big mystery to me.

  • @shyguyblue@lemmy.world
    link
    fedilink
    English
    241 year ago

    Updated WordPress…

    Previous Web Dev had a whole mess of code inside the theme that was deprecated between WP versions.

    Fuck WordPress for static sites…

  • @necrobius@lemm.ee
    link
    fedilink
    201 year ago
    1. Create a database,
    2. Have organisation manually populated it with lots of records using a web app,
    3. accidentally delete database.

    All in between the backup window.

  • doc
    link
    fedilink
    201 year ago

    UPDATE without a WHERE.

    Yes in prod.

    Yes it can still happen today (not my monkey).

    Yes I wrap everything in a rollback now.

    • @madkins@lemmy.ml
      link
      fedilink
      51 year ago

      I did something similar. It was a list box with a hidden first row representing the id. Somehow the header row got selected and an update where id=id got ran.

    • @Skyhighatrist@lemmy.ca
      link
      fedilink
      31 year ago

      I did this once. But only once. The panic I felt in that moment is something I will never forget. I was able to restore the data from a recent backup before it became a problem, though.

  • FaceDeer
    link
    fedilink
    201 year ago

    It wasn’t “worst” in terms of how much time it wasted, but the worst in terms of how tricky it was to figure out. I submitted a change list that worked on my machine as well as 90% of the build farm and most other dev and QA machines, but threw a baffling linker error on the remaining 10%. It turned out that the change worked fine on any machine that used to have a particular old version of Visual Studio installed on it, even though we no longer used that version and had phased it out for a newer one. The code I had written depended on a library that was no longer in current VS installs but got left behind when uninstalling the old one. So only very new computers were hitting that, mostly belonging to newer hires who were least equipped to figure out what was going on.

  • @pastermil@sh.itjust.works
    link
    fedilink
    191 year ago

    I acidentally destroyed the production system completely thru improper partition resize. We got the database snapshot, but it’s in that server as well. After scrambling around for half a day, I managed to recover some of the older data dumps.

    So I spun up the new server from scratch, restored the database with some slightly outdated dump, installed the code (which was thankfully managed thru git), and configured everything to run all in an hour or two.

    The best part: everybody else knows this as some trivial misconfiguration. This happened in 2021.

  • slazer2au
    link
    fedilink
    English
    191 year ago

    I took down an ISPfor a couple hours because I forgot the ‘add’ keyword at the end of a Cisco configuration line

    • @sloppy_diffuser@sh.itjust.works
      link
      fedilink
      English
      81 year ago

      That’s a rite of passage for anyone working on Cisco’s shit TUI. At least its gotten better with some of the newer stuff. IOS-XR supported commits and diffing.

  • @Burninator05@lemmy.world
    link
    fedilink
    191 year ago

    I spent over 20 years in the military in IT. I took took down the network at every base I was ever at each time finding a new way to do it. Sometimes, but rarely, intentionally.

    • @mojofrododojo@lemmy.world
      link
      fedilink
      English
      61 year ago

      took out a node center by applying the patches gd recommended… took an entire weekend to restore all the shots and my ass got fed 3/4ths into the woodchipper before it came out that the vendor was at fault for this debacle.

  • @hperrin@lemmy.world
    link
    fedilink
    191 year ago

    I fixed a bug and gave everyone administrator access once. I didn’t know that bug was… in use (is that the right way to put it?) by the authentication library. So every successful login request, instead of being returned the user who just logged in, was returned the first user in the DB, “admin”.

    Had to take down prod for that one. In my four years there, that was the only time we ever took down prod without an announcement.

  • Rob Bos
    link
    fedilink
    181 year ago

    Plugged a serial cable into a UPS that was not expecting RS232. Took down the entire server room. Beyoop.

  • @Albbi@lemmy.ca
    link
    fedilink
    181 year ago

    Broke teller machines at a bank by accidentally renaming the server all the machines were pointed to. Took an hour to bring back up.