It’s the same photo, the same model, the same question. But you won’t get the same answer. Not even close — and the differences are large enough to cause a hypoglycaemic emergency.
OK I wonder if there’s something wrong with the photo.
The photo:

WTF!!??
That’s like estimating the carbs in 2 slices of standard sandwich bread! Of course not all bread has the same amount of sugar, but a reasonable range based on an average should be a dead easy answer.I thought the headline sounded crazy, but try to read the article, and it actually becomes worse. I have said it many times before, these AI chatbots should not be legal, they put lives at risk.
When are people going to realize that an LLM is not a calculator and doesn’t actually know anything?
Well first AI tech corporations need to do advertising that AIs can keep doing all this.
That it is not a calculator and is horrible at determinism is not debatable, however its (very biased) huge knowledge is its core feature
How come it’s inaccurate about 40% of the time when I know the answer then? It’s a bullshit factory. A chatbot that’s fundamentally designed to sound like a person and be able to respond to any prompt. But truth isn’t any part of the fundamental architecture of an LLM.
Bullshit factory is very apt. I was using it for an open book exam and it gave answers entirely skewed to the way the question was asked.
For example, if I asked “is X bacteria a pathogen in Y disease”, it would say yes, it was a very bad pathogen.
If I asked “what effects does X bacteria have in this body system”, it said it was a beneficial bacteria.
Never trust the AI summary, you have to fully read the studies.
It does lie and hallucinate a lot, especially with biased context in the question (the bullshit part). The (biased) knowledge is hiding somewhere in its weights, it is just that it is sometimes quite hard to recover.
Your 40% depends a lot on how you ask the questions and the field of these questions. Humanity’s last exam is a morr obiective benchmark for measuring the wide knowledge of LLMs.
Your 40% depends a lot on how you ask the questions and the field of these questions.
Dude, they fail that exam with even worse error rates than I see!
When you can verify it, it’s OFTEN and REGULARLY wrong. It’s stupid to trust if for anything you can’t personally verify.
The designed purpose of LLMs is to respond to human interaction, not to be correct. They are the showoff who pretends he can answer every question. They are the confident drunkard at the bar who will tell you anything that pops into their head. Intelligent, knowledgeable people say “I don’t know” when they don’t know. LLMs don’t do that. Ever. Trouble is, they don’t “know” anything. They’re a chatbot from the bottom up. Chatbot through and through. It’s their fundamental nature.
Yes there was knowledge and deep understanding in their training data. Also, I ate chicken curry for tea. However, I am not a chicken, I do not cluck, I haven’t started eating worms, I cannot produce any chicken, and my poop is not chicken either. My poop smells faintly of curry. So it is with LLMs and the knowledge and understanding in their training data.
They beat any human on that knowledge benchmark, completely unrelated to your 40% “test”. Try to answer any of the example questions on the main page.
I don’t need a metaphor I know LLMs are hallucinating, lying, bullshitting. That doesn’t invalidate my point.
The models themselves are actually entirely deterministic. The non-determinism you see is actually artificially introduced at the application layer to make the output seem more human. It’s usually controlled by a setting called “heat”, which when set to 0 will give completely reproducible results.
This is correct, I suppose you’re talking about the final softmax layer? When I said they are bad at determinism, I was talking about reasoning on deterministic rules not having deterministic output. For example, LLMs make logical deduction errors, calculation errors etc.
Waste of energy. It’s like asking a person to estimate a non-trivial angle. Either use a model trained for that task, or don’t bother.
The point is they are advertising that these models can do it.
You’d expect the same answer each time. It’s the same photo, the same model, the same question. But you won’t get the same answer.
I don’t know what ads show that, but anyone who knows the first thing about LLMs knows you don’t get the same answer twice.
I’d get this expectation 5 years ago when most people weren’t familiar with it, but come on… you don’t need to feed it an image 500 times to see that.
The point is that:
- It is being used for ut, even though it is obviously not capable of giving a reliable and realistic answer
- It allows this usage, even though it is dangerous and not within it’s capabilities
- Each model gives answers that vary wildly, something that a human wouldn’t do. A human wouldn’t give you answers that are 10x more for the same question randomly.
I tried to build a deck with my smartphone, it couldn’t drive a single nail.
Maybe get a stronger case. 🤷♂️😄
But the guy at the phone store told me it was practically indestructible, I used it practically and it destructable’d.
I’m starting to think this whole ‘phone’ thing is doomed to failure.
I’m basing this entirely on a single anecdotal evidence and all of the other evidence that I’ve selected which confirms my worldview on the topic. I have done my own research (but not with a phone).
imagine that. software that performs strictly language specific operations can’t do math.
And the US is about to, if they haven’t already, put AI in charge of the Internal Revenue Service.
That should be fun.
LLMs are not detetministic like calculators. Wrong tool for the job.
If you supplied humans with the same image and asked for the same estimate I’d be curious to know the difference in results.
Mine would be: “I have no idea” - An answer the LLMs generally refuse to give by their nature (usually declining to answer is rooted in something in the context indicating refusing to answer being the proper text).
If you really pressed them, they’d probably google each thing and sum the results, so the estimates would be as consistent as first google results.
LLMs have a tendency to emit a plausible answer without regard for facts one way or the other. We try to steer things by stuffing the context with facts roughly based on traditional ‘fact’ based measures, but if the context doesn’t have factual data to steer the output, the output is purely based on narrative consistency rather than data consistency. It may even do that if the context has fact based content in it sometimes.
Custom built LLMs are awesome for specific purposes in terms of dealing with data and providing resources however chatbots ain’t that.
Humans want to follow whatever makes sense to them, they use AI because it’s confident. AI just replaced their god.
Bruh a couple of months ago I asked it (Gemini) to check the number of characters, including spaces, in a potential game character name because I was working at the time and couldn’t stop to check my in-head count. It told me 21–I had counted 20. I thought I must have gotten distracted and miscounted. Later when I had time to actually focus on the issue it turned out AI had miscounted a 20 character string (maybe counting the null terminating character?).
AI doesn’t see individual characters, it sees tokens, with most tokens being a word or part of a word. That’s why per-character questions have such a high failure rate.
If it doesn’t understand the simple concept of the number of letters and spaces, it needs to be reprogrammed.
ETA: sorry folks, not gonna change my view and simp for shit A.I., continue with the downvotes.
It doesn’t understand anything though? It never will. It’s a probability machine. If you choose to believe its output, that’s on you. I use it as a coding assistant to get boring things done faster. Fire a prompt at claude code, grab a coffee, check out the diff. But that last step is crucial. Can’t trust AI output blindly.
The embedding layer post tokenization is not just a probability machine the way you’re suggesting it. You can argue that it is probabilistic with inferred sentiment, but too many people think it works like how text prediction on your phone does and that is just factually inaccurate.
Verify output of course, but saying “it doesn’t understand anything” and “probability machine” is a borderline erroneous short sell. At the level of tokens it “understands” relationships, and those relationships are not probabilistic, though they are fundamentally approximated based on a training corpus.
ah right, and my eyes need to be recreated because they can’t see ultraviolet
People should read the top comments on Hackernews instead of anyone here, they’re more informed on the topic than Lemmy is
Yeah - if you’re after AI fanbois you should head over there. They’re not that bright, but if you check show and tell you can see what claude’s been ut to last two days
HN is full of techno fascists
Better yet, download Qwen 3.5/3.6, with a “raw” notepad like Mikupad. Try it yourself:
https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF
https://github.com/lmg-anon/mikupad
One might observe:
-
Chat formating, and how janky the “thinking” block is.
-
How words are broken up into tokens, not characters.
-
How particularly funky that gets with numbers.
-
Precisely how sampling “randomizes” the answers by visualizing “all possible answers” with the logprobs display.
-
And, thus, precisely how and why carb counting in ChatGPT fails, yet a measly local LLM on a desktop/phone could get it right with a little tooling or adjustment.
This is exactly what OpenAI/Anthropic don’t want you to do. They want users dumb and tethered, like a cloud subscription or social media platform. Not cognizant of how tools they are peddling as magic lamps actually work. And why, and how, they’re often stupid.
-
deleted by creator



