The Great Software Quality Collapse: How We Normalized Catastrophe

onehundredsixtynine@sh.itjust.works · 12 days ago

The Great Software Quality Collapse: How We Normalized Catastrophe

IrateAnteater@sh.itjust.works · 12 days ago

I think a substantial part of the problem is the employee turnover rates in the industry. It seems to be just accepted that everyone is going to jump to another company every couple years (usually due to companies not giving adequate raises). This leads to a situation where, consciously or subconsciously, noone really gives a shit about the product. Everyone does their job (and only their job, not a hint of anything extra), but they’re not going to take on major long term projects, because they’re already one foot out the door, looking for the next job. Shitty middle management of course drastically exacerbates the issue.

I think that’s why there’s a lot of open source software that’s better than the corporate stuff. Half the time it’s just one person working on it, but they actually give a shit.

panda_abyss@lemmy.ca · 12 days ago

I’ve been working at a small company where I own a lot of the code base.

I got my boss to accept slower initial work that was more systemically designed, and now I can complete projects that would have taken weeks in a few days.

The level of consistency and quality you get by building a proper foundation and doing things right has an insane payoff. And users notice too when they’re using products that work consistently and with low resources.

Log in | Sign up@lemmy.world · edit-2 11 days ago

(I write only internal tools and I’m a team of one. We have a whole department of people working on public and customer focused stuff.)

My boss let me spend three months with absolutely no changes to functionality or UI, just to build a better, more configurable back end with a brand new config UI, partly due to necessity (a server constraint changed), otherwise I don’t think it would have ever got off the ground as a project. No changes to master for three months, which was absolutely unheard of.

At times it was a bit demoralising to do so much work for so long with nothing to show for it, but I knew the new back end would bring useful extras and faster, robust changes.

The backend config ui is still in its infancy, but my boss is sooo pleased with its effect. He is used to a turnaround for simple changes of between 1 and 10 days for the last few years (the lifetime of the project), but now he’s getting used to a reply saying I’ve pushed to live between 1 and 10 minutes.

Brand new features still take time, but now that we really understand what it needs to do after the first few years, it was enormously helpful to structure the whole thing to be much more organised around real world demands and make it considerably more automatic.

Feels food. Feels really good.

Reginald_T_Biter@lemmy.world · 11 days ago

That’s awesome. Your manager had some rare foresight in that case.

Log in | Sign up@lemmy.world · edit-2 11 days ago

He’s a great boss. He really is.

I had goodwill stored up because like me, he uses the tool to several times a day, he really likes it because it makes some tasks far easier (v0.1) and I added loads of extras over the years, and it was me that dreamed it up in the first place.

The new server constraint affected me on the daily but wasn’t going to affect him at all for most of those three months, and even then, not often and there was a workaround for his usage, but he trusted me and he wants my end to be as convenient as his is (very fair minded guy indeed).

I would go a long long way for him. I went to his wedding in 2023 and we sometimes have drinks after work. He knows how it is, has been there, done that and got the T shirt and isn’t afraid to tell truth to power:

You know you like to have X? We’re gonna need Y…

Remember the prioritisation of Y you were going to do?..

Yeah, so no, sorry, we don’t quite have X, partly because of this and that mistake we made, but also we weren’t able to get very close to X because we never got Y.

Genuinely, cue recommitment of senior management to Y in the next quarter! It might not happen, but no shouting, no blaming, and rationality all round.

I don’t think they like it at all when he says stuff like that, but they love that the crises pretty much dwindled out when they put him in charge and as he gradually recruited more people who put more effort into making things better than shouting and blaming, and as the shouters and blamers left to find employment elsewhere where shouting and blaming was effective. It simply does not work on my boss even a little bit, and he simply never does it. Customers now praise his department instead of complain about it, so he gets a lot of leeway from management to do things his way.

Reginald_T_Biter@lemmy.world · 10 days ago

Brilliant. It’s so valuable to have a manager that actually treats you like a professional in these situations. Sounds like a diamond in the rough alright.

Some agency when working goes a long way to fostering a really good working relationship. I’m still a lot earlier in my career, so generally in my first non-internship role I was expecting to be given little bits of work like change this button, widen this form, that kind of stuff.

Turns out I’d joined one of those “sink or swim” smaller companies where you have to wear a lot of hats. Initially I thought quite negatively about it but once I started to gain some confidence I realised he was giving me the time and space to properly learn stuff and develop it until it was “good”. He, thankfully, still shoots down my sillier ideas but if I have a good one he throws his full support behind me.

Currently he was like, I need you to investigate how to set up automated fraud prevention checks and flag, let’s say things, for clients to investigate further, and he sent me off for a week to analyse the problem, speak to everyone involved and gather a list of data points and how to calculate them. Then he gave me the time to design the system, including the mental room to develop our first shared lib after .net framework.

Really I’m rambling a bit, but my point is, you can get a lot of good work out of people if you invest in them and allow them some agency. Maybe some can’t work well without constant pressure, but I think a lot of people thrive when supported and enabled correctly by management.

Log in | Sign up@lemmy.world · 10 days ago

Yeah. Some people love to micro manage and play power games, but it’s so refreshing to work for someone who has the confidence to just concentrate on doing a great job.

chunes@lemmy.world · 12 days ago

Software has a serious “one more lane will fix traffic” problem.

Don’t give programmers better hardware or else they will write worse software. End of.

nelson@lemmy.world · 12 days ago

This is very true. You don’t need a bigger database server, you need an index on that table you query all the time that’s doing full table scans.

PattyMcB@lemmy.world · 11 days ago

Or sharding on a particular column

squaresinger@lemmy.world · edit-2 11 days ago

The article is very much off point.

Software quality wasn’t great in 2018 and then suddenly declined. Software quality has been as shit as legally possible since the dawn of (programming) time.
The software crisis has never ended. It has only been increasing in severity.
Ever since we have been trying to squeeze more programming performance out of software developers at the cost of performance.

The main issue is the software crisis: Hardware performance follows moore’s law, developer performance is mostly constant.

If the memory of your computer is counted in bytes without a SI-prefix and your CPU has maybe a dozen or two instructions, then it’s possible for a single human being to comprehend everything the computer is doing and to program it very close to optimally.

The same is not possible if your computer has subsystems upon subsystems and even the keyboard controller has more power and complexity than the whole apollo programs combined.

So to program exponentially more complex systems we would need exponentially more software developer budget. But since it’s really hard to scale software developers exponentially, we’ve been trying to use abstraction layers to hide complexity, to share and re-use work (no need for everyone to re-invent the templating engine) and to have clear boundries that allow for better cooperation.

That was the case way before electron already. Compiled languages started the trend, languages like Java or C# deepened it, and using modern middleware and frameworks just increased it.

OOP complains about the chain “React → Electron → Chromium → Docker → Kubernetes → VM → managed DB → API gateways”. But he doesn’t even consider that even if you run “straight on bare metal” there’s a whole stack of abstractions in between your code and the execution. Every major component inside a PC nowadays runs its own separate dedicated OS that neither the end user nor the developer of ordinary software ever sees.

But the main issue always reverts back to the software crisis. If we had infinite developer resources we could write optimal software. But we don’t so we can’t and thus we put in abstraction layers to improve ease of use for the developers, because otherwise we would never ship anything.

If you want to complain, complain to the mangers who don’t allocate enough resources and to the investors who don’t want to dump millions into the development of simple programs. And to the customers who aren’t ok with simple things but who want modern cutting edge everything in their programs.

In the end it’s sadly really the case: Memory and performance gets cheaper in an exponential fashion, while developers are still mere humans and their performance stays largely constant.

So which of these two values SHOULD we optimize for?

The real problem in regards to software quality is not abstraction layers but “business agile” (as in “business doesn’t need to make any long term plans but can cancel or change anything at any time”) and lack of QA budget.

Reginald_T_Biter@lemmy.world · 11 days ago

The software crysis has never ended

MAXIMUM ARMOR

JohnAnthony@lemmy.dbzer0.com · edit-2 11 days ago

I agree with the general idea of the article, but there are a few wild takes that kind of discredit it, in my opinion.

“Imagine the calculator app leaking 32GB of RAM, more than older computers had in total” - well yes, the memory leak went on to waste 100% of the machine’s RAM. You can’t leak 32GB of RAM on a 512MB machine. Correct, but hardly mind-bending.
“But VSCodium is even worse, leaking 96GB of RAM” - again, 100% of available RAM. This starts to look like a bad faith effort to throw big numbers around.
“Also this AI ‘panicked’, ‘lied’ and later ‘admitted it had a catastrophic failure’” - no it fucking didn’t, it’s a text prediction model, it cannot panic, lie or admit something, it just tells you what you statistically most want to hear. It’s not like the language model, if left alone, would have sent an email a week later to say it was really sorry for this mistake it made and felt like it had to own it.

humanspiral@lemmy.ca · 11 days ago

You can’t leak 32GB of RAM on a 512MB machine.

32gb swap file or crash. Fair enough point that you want to restart computer anyway even if you have 128gb+ ram. But calculator taking 2 years off of your SSD’s life is not the best.

squaresinger@lemmy.world · 10 days ago

It’s a bug and of course it needs to be fixed. But the point was that a memory leak leaks memory until it’s out of memory or the process is killed. So saying “It leaked 32GB of memory” is pointless.

It’s like claiming that a puncture on a road bike is especially bad because it leaks 8 bar of pressure instead of the 3 bar of pressure a leak on a mountain bike might leak, when in fact both punctures just leak all the pressure in the tire and in the end you have a bike you can’t use until you fixed the puncture.

squaresinger@lemmy.world · 11 days ago

Yeah, that’s quite on point. Memory leaks until something throws an out of memory error and crashes.

What makes this really seam like a bad faith argument instead of a simple misunderstanding is this line:

Not used. Not allocated. Leaked.

OOP seems to understand (or at least claims to understand) the difference between allocating (and wasting) memory on purpose and a leak that just fills up all available memory.

So what does he want to say?

Valmond@lemmy.world · 11 days ago

Yeah what I hate that agile way of dealing with things. Business wants prototypes ASAP but if one is actually deemed useful, you have no budget to productisize it which means that if you don’t want to take all the blame for a crappy app, you have to invest heavily in all of the prototypes. Prototypes who are called next gen project, but gets cancelled nine times out of ten 🤷🏻‍♀️. Make it make sense.

squaresinger@lemmy.world · 11 days ago

This. Prototypes should never be taken as the basis of a product, that’s why you make them. To make mistakes in a cheap, discardible format, so that you don’t make these mistake when making the actual product. I can’t remember a single time though that this was what actually happened.

They just label the prototype an MVP and suddenly it’s the basis of a new 20 year run time project.

In my current job, they keep switching around everything all the time. Got a new product, super urgent, super high-profile, highest priority, crunch time to get it out in time, and two weeks before launch it gets cancelled without further information. Because we are agile.

azertyfun@sh.itjust.works · 10 days ago

THANK YOU.

I migrated services from LXC to kubernetes. One of these services has been exhibiting concerning memory footprint issues. Everyone immediately went “REEEEEEEE KUBERNETES BAD EVERYTHING WAS FINE BEFORE WHAT IS ALL THIS ABSTRACTION >:(((((”.

I just spent three months doing optimization work. For memory/resource leaks in that old C codebase. Kubernetes didn’t have fuck-all to do with any of those (which is obvious to literally anyone who has any clue how containerization works under the hood). The codebase just had very old-fashioned manual memory management leaks as well as a weird interaction between jemalloc and RHEL’s default kernel settings.

The only reason I spent all that time optimizing and we aren’t just throwing more RAM at the problem? Due to incredible levels of incompetence business-side I’ll spare you the details of, our 30 day growth predictions have error bars so many orders of magnitude wide that we are stuck in a stupid loop of “won’t order hardware we probably won’t need but if we do get a best-case user influx the lead time on new hardware is too long to get you the RAM we need”. Basically the virtual price of RAM is super high because the suits keep pinky-promising that we’ll get a bunch of users soon but are also constantly wrong about that.

afk_strats@lemmy.world · 12 days ago

Accept that quality matters more than velocity. Ship slower, ship working. The cost of fixing production disasters dwarfs the cost of proper development.

This has been a struggle my entire career. Sometimes, the company listens. Sometimes they don’t. It’s a worthwhile fight but it is a systemic problem caused by management and short-term profit-seeking over healthy business growth

dual_sport_dork 🐧🗡️@lemmy.world · 12 days ago

“Apparently there’s never the money to do it right, but somehow there’s always the money to do it twice.”

Management never likes to have this brought to their attention, especially in a Told You So tone of voice. One thinks if this bothered pointy-haired types so much, maybe they could learn from their mistakes once in a while.

ozymandias117@lemmy.world · 12 days ago

We’ll just set up another retrospective meeting and have a lessons learned.

Then we won’t change anything based off the findings of the retro and lessons learned.

PattyMcB@lemmy.world · 11 days ago

Post-mortems always seemed like a waste of time to me, because nobody ever went back and read that particular confluence page (especially me executives who made the same mistake again)

shalafi@lemmy.world · 11 days ago

Post mortems are for, “Remember when we saw something similar before? What happened and how did we handle it?”

tehn00bi@lemmy.world · 12 days ago

Twice? Shiiiii

PattyMcB@lemmy.world · 11 days ago

Amateur numbers, lol

ryathal@sh.itjust.works · 11 days ago

There’s levels to it. True quality isn’t worth it, absolute garbage costs a lot though. Some level that mostly works is the sweet spot.

neclimdul@lemmy.world · 12 days ago

“AI just weaponized existing incompetence.”

Daamn. Harsh but hard to argue with.

PattyMcB@lemmy.world · 11 days ago

Weaponized? Probably not. Amplified? ABSOLUTELY!

_stranger_@lemmy.world · 11 days ago

It’s like taping a knife to a crab. Redundant and clumsy, yet strangely intimidating

PattyMcB@lemmy.world · 11 days ago

Love that video. Although it wasn’t taped on. The crab was full on about to stab a mofo

Reginald_T_Biter@lemmy.world · 11 days ago

Yeah, crabby boi fully had stabbin’ on his mind.

vane@lemmy.world · edit-2 11 days ago

Quality in this economy ? We need to fire some people to cut costs and use telemetry to make sure everyone that’s left uses AI to pay AI companies because our investors demand it because they invested all their money in AI and they see no return.

panda_abyss@lemmy.ca · 12 days ago

Fabricated 4,000 fake user profiles to cover up the deletion

This has got to be a reinforcement learning issue, I had this happen the other day.

I asked Claude to fix some tests, so it fixed the tests by commenting out the failures. I guess that’s a way of fixing them that nobody would ever ask for.

Absolutely moronic. These tools do this regularly. It’s how they pass benchmarks.

Also you can’t ask them why they did something, they have no capacity of introspection, they can’t read their input tokens, they just make up something that sounds plausible for “what were you thinking”.

FishFace@lemmy.world · 12 days ago

The model we have at work tries to work around this by including some checks. I assume they get farmed out to specialised models and receive the output of the first stage as input.

Maybe it catches some stuff? It’s better than pretend reasoning but it’s very verbose so the stuff that I’ve experimented with - which should be simple and quick - ends up being more time consuming than it should be.

panda_abyss@lemmy.ca · 12 days ago

I’ve been thinking of having a small model like a long context qwen 4b run and do quick code review to check for these issues, then just correct the main model.

It feels like a secondary model that only exists to validate that a task was actually completed could work.

FishFace@lemmy.world · 12 days ago

Yeah, it can work, because it’ll trigger the recall of different types of input data. But it’s not magic and if you have a 25% chance of the model you’re using hallucinating, you probably end up still with an 8.5% chance of getting bullshit after doing this.

MelodiousFunk@slrpnk.net · 12 days ago

Also you can’t ask them why they did something, they have no capacity of introspection, (…) they just make up something that sounds plausible for “what were you thinking”.

It’s uncanny how it keeps becoming more human-like.

geoff@midwest.social · 12 days ago

Anyone else remember a few years ago when companies got rid of all their QA people because something something functional testing? Yeah.

The uncontrolled growth in abstractions is also very real and very damaging, and now that companies are addicted to the pace of feature delivery this whole slipshod situation has made normal they can’t give it up.

PattyMcB@lemmy.world · 11 days ago

I must have missed that one

shalafi@lemmy.world · 11 days ago

That was M$, not an industry thing.

geoff@midwest.social · 11 days ago

It was not just MS. There were those who followed that lead and announced that it was an industry thing.

sugar_in_your_tea@sh.itjust.works · 11 days ago

Article describing it.

cygnus@lemmy.ca · 12 days ago

I wonder if this ties into our general disposability culture (throwing things away instead of repairing, etc)

anamethatisnt@sopuli.xyz · 12 days ago

That and also man hour costs versus hardware costs. It’s often cheaper to buy some extra ram than it is to pay someone to make the code more efficient.

IninewCrow@lemmy.ca · 12 days ago

Planned Obsolescence … designing things for a short lifespan so that things always break and people are always forced to buy the next thing.

It all originated with light bulbs 100 years ago … inventors did design incandescent light bulbs that could last for years but then the company owners realized it wasn’t economically feasible to produce a light bulb that could last ten years because too few people would buy light bulbs. So they conspired to engineer a light bulb with a limited life that would last long enough to please people but short enough to keep them buying light bulbs often enough.

Pika@sh.itjust.works · 12 days ago

I’m glad that they added CloudStrike into that article, because it adds a whole extra level of incompetency in the software field. CS as a whole should have never happens in the first place if Microsoft properly enforced their stance they claim they had regarding driver security and the kernel.

The entire reason CS was able to create that systematic failure was because they were(still are?) abusing the system MS has in place to be able to sign kernel level drivers. The process dodges MS review for the driver by using a standalone driver that then live patches instead of requiring every update to be reviewed and certified. This type of system allowed for a live update that directly modified the kernel via the already certified driver. Remote injection of un-certified code should never have been allowed to be injected into a secure location in the first place. It was a failure on every level for both MS and CS.

PattyMcB@lemmy.world · 11 days ago

Non-technical hiring managers are a bane for developers (and probably bad for any company). Just saying.

oyzmo@lemmy.world · edit-2 11 days ago

64k demos show what is possible with skill

squaresinger@lemmy.world · 11 days ago

They mainly show what’s possible if you

don’t have a deadline
don’t have business constantly pivoting what the project should be like, often last minute
don’t have to pass security testing
don’t have customers who constantly demand something else
don’t have constantly shifting priorities
don’t have tight budget restrictions where you have to be accountable to business for every single hour of work
don’t have to maintain the project for 15-20 years
don’t have a large project scope at all
don’t have a few dozen people working on it, spread over multiple teams or even multiple clusters
don’t have non-technical staff dictating technical implementations
don’t have to chase the buzzword of the day (e.g. Blockchain or AI)
don’t have to work on some useless project that mostly exists for political reasons
can work on the product as long as you want, when you want and do whatever you want while working at it

Comparing hobby work that people do for fun with professional software and pinning the whole difference on skill is missing the point.

The same developer might produce an amazing 64k demo in their spare time while building mass-produced garbage-level software at work. Because at work you aren’t doing what you want (or even what you can) but what you are ordered to.

In most setups, if you deliver something that wasn’t asked for (even if it might be better) will land you in trouble if you do it repeatedly.

In my spare time I made the Fairberry smartphone keyboard attachment and now I am working on the PEPit physiotherapy game console, so that chronically ill kids can have fun while doing their mindnumbingly monotonous daily physiotherapy routine.

These are projects that dozens of people are using in their daily life.

In my day job I am a glorified code monkey keeping the backend service for some customer loyalty app running. Hardly impressive.

If an app is buggy, it’s almost always bad management decisions, not low developer skill.

humanspiral@lemmy.ca · edit-2 11 days ago

32gb+ memory leaks require reboot on any machine, and need higher level than critical.

The AI later admitted: “This was a catastrophic failure on my part. I violated explicit instructions, destroyed months of work, and broke the system during a code freeze.” Source: The Register

When I started using LLM’s, and would yell at its stupidity and how to fix it, most models (Open AI excepted) were good enough to accept their stupidity. Deleting production databases certainly feels better with AI’s woopsie. But being good at apologizing is not best employee desired skill.

Collapse (Coming soon) Physical constraints don’t care about venture capital

This is naive, though the collapse part is worse. Venture capital doesn’t care about physical constraints. Ridiculously expensive uneconomic SMRs will save us in 10 (ok 15) years. Kill solar now to permit it. But, scarcity is awesome for venture capital. Just buy the utilities, and get a board seat, get cheap, current price lock in, power for datacenters, and raise prices on consumers and non-WH-gifting-guest businesses by 100% to 200%. Physical constraints means scarcity means profits. Surely the only political solution is to genocide the mexican muslim rapists.

odama626@lemmy.world · 11 days ago

Accurate but ironically written by chatgpt

BillBurBaggins@lemmy.world · 11 days ago

And you can’t even zoom into the images on mobile. Maybe it’s harder than they think if they can’t even pick their blogging site without bugs

AnarchistArtificer@slrpnk.net · 11 days ago

Is it? I didn’t get that sense. What causes you to think it’s written by chatGPT? (I ask because whilst I’m often good at discerning AI content, there are plenty of times that I don’t notice it until someone points out things that they notice that I didn’t initially)

odama626@lemmy.world · edit-2 9 days ago

Not x. Not y. Z.

It wasn’t that --em dash–it’s this.

It loves grouping of 3s

Those are just the ones I noticed immediately again when skimming it, there was a lot more I noticed when I originally read it. I read it aloud to my wife while cooking the first time and we were both laughing about how obviously chatjipity it was lol.