How often are calculators confidently wrong?

japhyr · on April 3, 2023

As a math teacher, this is such a funny comparison to keep reading.

Yes, there's a difference between a deterministic outcome and a non-deterministic one. But throw humans into the loop, and it becomes more interesting. I can't count the number of times I've listened to someone argue their answer must be right because they got it from the calculator. And it's not just students; as a teacher I've always paid attention to how adults use math.

With calculators or GPT tools, or any other automated assistant, judgement and validation continues to matter.

Wowfunhappy · on April 3, 2023

> I can't count the number of times I've listened to someone argue their answer must be right because they got it from the calculator.

Answers from calculators are always right! But the human may have asked the wrong question.

actsasbuffoon · on April 3, 2023

There are a bunch of well-known areas where popular calculators tend to give incorrect answers: https://apcentral.collegeboard.org/courses/resources/example...

It’s mostly fine until it isn’t. AI will probably operate in the same capacity. We already have so much incorrect information out there that’s part of our pop culture. Even down to things like the fact that Darth Vader never said, “Luke, I am your father,” and Mae West never said, “Why don’t you come see me sometime?”

Even basic movie quotes are beyond our ability to get right. Hilariously, I just asked ChatGPT about these quotes and it explained that these are common misquotes, told me what was actually said in these movies, and explained some relevant context.

Sherlock never said, “Elementary, my dear Watson” even once in the books. Kirk never said, “Beam me up, Scotty.” We’re much less correct than we like to think. And somehow we’ve survived.

ChatGPT is fallible just like we are. We’ll manage, just like we always have.

jakubmazanec · on April 3, 2023

I have another theory about all those quotes. Regarding that Darth Vader quote, if quoted exactly, i.e. "I am you father", it isn't immediately obvious the quote is from Star Wars. "Luke" gives you a context. Sherlock and Kirk quotes are synthesized from what the characters actually said, and arguably the precise wording doesn't matter, because the point of the quote is to bring up images of the characters and situations, not of those specific words.

lolsowrong · on April 3, 2023

Go type .1*.2 into any JavaScript console.

Edit: slapping a few more in here:

https://learn.microsoft.com/en-us/office/troubleshoot/excel/...

https://daviddeley.com/pentbug/index.htm

Wowfunhappy · on April 3, 2023

The answer in the Javascript console is still a correct answer. The user did not specify a level of precision, and web browsers are programmed to use a precision level which is reasonable under most circumstances. If the user needs a higher level of precision, he or she needs to specify that as part of the question (such as by not using floating point numbers).

I don't mean to be pedantic. I teach coding to elementary school students, and this is something fundamental I try to make them understand. A computer will always do what you tell it to do. A bug is when you accidentally tell a computer to do something different than what you'd intended.

Going back to the calculator example, if a student used a calculator and got the wrong answer, the problem didn't come from the calculator. This is useful to understand; it can help the student work backwards to figure out what did go wrong.

AI is different in that we've instructed the computer to develop and follow its own instructions. When ChatGPT gives the wrong answer, it is in fact giving the right answer according to the instructions it was instructed to write for itself. With this many layers of abstraction, however, the maxim that computers "always do what you tell them" is no longer useful. No human truly knows what the computer is trying to do.

ModernMech · on April 3, 2023

> I don't mean to be pedantic.

I'm sorry in advance, but this reply is just to meet pedantry with pedantry.

> A computer will always do what you tell it to do.

This is the Bohr model of computers. It's the kind of thing you tell elementary school students because it's conceptually simple and mostly right, but I think we know better here on HN. Pedantically, computers don't always do what you tell them to, because the don't always hear what you tell them, and what you tell them can be corrupted even when they do hear it.

For instance, random particles from outer space can cause a computer to behave quite randomly: https://www.thegamer.com/how-ionizing-particle-outer-space-h...

  why was nobody able to pull it off, even when replicating exactly the inputs that DOTA_Teabag had used? Simple: this glitch requires a phenomenon known as a single-event upset, which is very much out of any player's control.

I don't think we can reasonably say that in this instance, the computer behaved according to what the user told it to do. In fact, it responded to the user and the environment.

Wowfunhappy · on April 3, 2023

That's true. An earlier version of my comment called out hardware problems as an exception—insufficient error correction for neutrino bit flips is fundamentally a hardware problem—but I removed it before posting. In a way, I feel hardware bugs do still follow this principle: The electrons in the circuits are behaving as they always do, just not in the way we intended. But I agree this gets philosophically messy—no one "programmed" the electrons.

My underlying point is that, at least in 99.999% of cases, the problem isn't the calculator, it's the human using the calculator incorrectly. And although you could draw some parallels between calculators and AIs with regard to selecting the right tool and knowing when and how to use it, I'd say the randomness involved in an LLM is fundamentally different.

bradjohnson · on April 3, 2023

I don't think it's fundamentally different, and I think you're conflating complexity with randomness.

bradjohnson · on April 3, 2023

>The answer in the Javascript console is still a correct answer.

It's wrong in the same way that saying 1/1 = 1.0004 is wrong. It's not a matter of chosen precision in that it doesn't make the answer correct when you increase the number of zeros between 1 and 4.

lanstin · on April 3, 2023

It makes it less wrong. For most calculations people do we don’t that very many digits of precision for any one calculation.

bradjohnson · on April 3, 2023

That's true. I think that it is analogous to the discussion of AI limitations. Both of these are tools and are not categorically exclusive.

In the case of translation of floating point numbers from base-2 to base-10 we have to make approximations which will often be slightly wrong forever without regard for amount of precision.

With AI, depending on the pre-conditions, the AI could be stuck in a state of being slightly wrong forever for a specific question without regard to further refinement of the query.

These are both still useful as tools. We just need to be able to work on the amount of refinement of the answer that the AI gives, which may be able to be solved fairly well through prompt engineering, if not through the advancement of GPT itself.

tricksforfree · on April 3, 2023

One deck is made of wood. One deck is made made of steel. They will behave differently after years of weathering.

Just because they are both both decks doesn't mean they are the same.

lolsowrong · on April 3, 2023

Are both useful in some context?

whynaut · on April 3, 2023

In the same way that nearly any two arbitrary objects are useful in some context.

falcolas · on April 3, 2023

From the perspective of an investor who just wants their stonks to go up, sure. From the perspective of a sailor who wants the deck to not crumble beneath their feet in a storm, no.

AlecSchueler · on April 3, 2023

Answers from AI might not always be right and a human has to learn to judge them or refine their prompts accordingly. In either case there's a tool that a human must use and become savvy with.

A4ET8a8uTh0 · on April 3, 2023

<< Answers from calculators are always right! But the human may have asked the wrong question.

I actually agree with you, but, in the same vein, does it not mean that user did not ask correct prompt?

arroz · on April 3, 2023

No they are not

EugeneOZ · on April 3, 2023

Calculators have no hallucinations, LLMs do. They can literally say that 1+1 is not 2.

hn_throwaway_99 · on April 3, 2023

People keep saying this, pointing out the "mistakes with confidence" aspect of LLMs, but as someone who is continually amazed by ChatGPT and finds it very useful in my day-to-day, it's hard for me to take this objection seriously if presented as a reason not to use AI.

That is, for me, the output of ChatGPT or other AI tools is the starting point of my investigation, not the end output. Yes, if you just blindly paste the output from an AI tool you're going to have a bad time, but we also standardize code reviews into the human code-writing process - this isn't that different.

Just giving one specific example, I find ChatGPT to my an incredibly efficient "documentation lookup tool". E.g. it's great if I'm working with a new technology or API and I want to know "what my options are", but I don't know what keywords to search for, it can help give me a really good "lay of the land", and from there I can read on my own to get more specifics.

the_doctah · on April 3, 2023

Maybe you haven't used it enough. ChatGPT is wrong all the time for me, sometimes insultingly wrong. The confidence in it's incorrect answers just makes it that much worse.

I can't buy any of this hype for a "word-putting-together" algorithm. It's not real intelligence.

hn_throwaway_99 · on April 3, 2023

Please give some examples then. I've found the GPT-4 version to be remarkably accurate, and when it makes mistakes it's not hard to spot them.

For example, I commented last week that I've found ChatGPT to be a great tool for managing my task list, and for whatever reason the "verbal" back-and-forth works much better for my brain than a simple checklist-based todo app: https://news.ycombinator.com/item?id=35390644 . But, I also pointed out how it will get the sums for my "task estimate totals by group" wrong. But it's so easy to see this mistake, and after using it for a while I have a good understanding for when it's likely to occur, that it doesn't lessen the value I get from using the tool.

timr · on April 3, 2023

OK, here's one: this substack [1] was flying around a week or two ago, asserting that the marginal value of programmers will fall to zero by 2030. What a dream! No more annoying nerds!

The code in the post is wrong. For this "trivial" example, if you just blindly copied it into your code, it would not do what you want it to do. I love this example not just because it's ironic, but because it's a perfect illustration of how you need to know the answer before you ask for the solution. If you don't know what you're doing, you're gonna have a bad time.

I'm not at all concerned about the value of programmers falling to zero. I'm concerned that a lot of bad programmers are going to get their pants pulled down.

[1] https://skventures.substack.com/p/societys-technical-debt-an...

(Edit: and as a totally hot take, while I'm not worried about good programmers, I think the marginal value of multi-thousand word, think-piece blogposts is rapidly falling to zero. Who needs to pay Paul Kedrosky and Eric Norlon to write silly, incorrect articles, when ChatGPT will do it for free?)

hn_throwaway_99 · on April 3, 2023

OK, so we are 100% in agreement then? I absolutely don't believe the marginal value of programmers will fall to zero by 2030 (but, to clarify, the way you phrased your original sentence I thought it was that an LLM made this assertion, not some random VC dudes). I also highlighted in my posts that I use AI as an aid to my processes, "That is, for me, the output of ChatGPT or other AI tools is the starting point of my investigation, not the end output. Yes, if you just blindly paste the output from an AI tool you're going to have a bad time, but we also standardize code reviews into the human code-writing process - this isn't that different."

Also, I think the coding example in that substack highlights that one of the most important characteristics of good programmers has always been clarifying requirements. I had to read the phrase "remove all ASCII emojis except the one for shrugs" a couple times because it wasn't immediately clear to me what was meant by "ASCII emojis". I think this example also highlights what happens when you have 2 "VC bros" who don't know what they're talking about highlighting the "clever" nature of what ChatGPT did, because it is totally wrong. Still, I'd easily bet that I could create a much clearer prompt and give it to ChatGPT and get better results, and still have it save me time in writing the boiler plate structure for my code.

timr · on April 3, 2023

You asked for an example and I provided one that I thought illustrated the mistakes GPT makes in a vivid way -- mistakes that are already leading people astray. The fact that this particular example was coupled with a silly prediction is just gravy.

In short, I don't know if we "agree", but I think OP is/was correct that GPT generates lots of subtle mistakes. I'd go so far as to say that the folks filling this thread with "I don't see any problems!" comments are probably revealing that they're not very critical readers of the output.

Now for a wild prediction of my own: maybe the rise of GPT will finally mean the end of these absurd leetcode interview problems. The marginal value of remembering leetcode soltutions is falling to zero. The marginal value of detecting an error in code is shooting up. Completely different skills.

hn_throwaway_99 · on April 3, 2023

Getting back to that example from that post, though, thinking about it more, "remove all ASCII emojis except the one for shrugs" makes absolutely no sense, because you can't represent shrugs (either with a unicode "Person shrugging" character emoji, or the "kaomoji" version from that code sample that uses Japanese characters) in ASCII, at all. So yes, asking an LLM a non-sensical question is likely to get you a non-sensical response, and it's important to know when you're asking a non-sensical question.

timr · on April 3, 2023

Well, explain it however you like, but the point is that GPT is more than happy to confidently emit gibberish, and if you don't know enough to write the code yourself (or you're outsourcing your thinking to it), then you're going to get fooled.

I'd possibly argue that knowing how to ask the right question is tantamount to knowing the answer.

lanstin · on April 3, 2023

That code is wrong and I wonder if the author is familiar with the property of code encapsulated in the halting problem. Generically, reading code does not grant one the ability to predict what will happen when that code runs.

Whatever, time will tell. I still haven’t quite figured out how to make good use of GPT-4 in my daily work flow, tho it seems it might be possible.

Has anyone asked it to make an entry for the IOCC?

the_doctah · on April 3, 2023

For a time, I was attempting to use it for game advice during my recent playthrough of Demon's Souls remake (What's the best build for X? What's the best weapon for X?). I asked ChatGPT where to find the NPC The Filthy Woman in a certain level. ChatGPT answered that that NPC doesn't exist, and perhaps I had the wrong game? That NPC most certainly does exist.

I was also using it to generate some Java code for a bit. That is, until it started giving me maven dependencies that didn't exist, and classes that didn't exist, but definitely looked like they would at first glance.

hn_throwaway_99 · on April 3, 2023

> I asked ChatGPT where to find the NPC The Filthy Woman in a certain level. ChatGPT answered that that NPC doesn't exist, and perhaps I had the wrong game? That NPC most certainly does exist.

OK, wow - that example kind of perfectly proves my point. If I were to ask ChatGPT an extremely specific, low-level question about an extremely niche topic, then I would absolutely be on "high alert" that it wouldn't know the answer. And while I agree the "confidence" with which ChatGPT asserts its answers (though I'd argue the GPT-4 version does a much better job at not being over-confident than 3.5) is off-putting, I think it's pretty easy to detect where it's wrong.

I'd also be curious about your Java example. There was a good YouTube video of a guy that got ChatGPT to write a "population" game for him. In some cases on first try it would output code that had compile errors, e.g. because it had wrong versions of Python dependencies. He would just paste the errors back in to ChatGPT and ChatGPT would correct itself. Again, though, this highlights my point that I use ChatGPT as the start of my processes, a 1st draft if you will. I don't just ask it to write some code, then when I get an error throw my hands up and say "see how dumb ChatGPT is." To each their own, though.

the_doctah · on April 3, 2023

>OK, wow - that example kind of perfectly proves my point. If I were to ask ChatGPT an extremely specific, low-level question about an extremely niche topic, then I would absolutely be on "high alert" that it wouldn't know the answer. And while I agree the "confidence" with which ChatGPT asserts its answers (though I'd argue the GPT-4 version does a much better job at not being over-confident than 3.5) is off-putting, I think it's pretty easy to detect where it's wrong.

I don't consider a popular video game from 2009 to be "extremely niche", and I also shouldn't have to know what ChatGPT knows. And no, I don't think it's easy to detect where it's wrong if you don't know the right answer, and it's actually pretty useless when you have to spend time confirming answers.

sharemywin · on April 3, 2023

I think these type of errors gets mostly resolved with a search plugin.

KoolKat23 · on April 3, 2023

Out of curiosity was this 3.5 or 4?

the_doctah · on April 3, 2023

I don't believe it was version 4 yet.

avereveard · on April 3, 2023

Do you happen to know what messages are gonna get dropped by the client if the conversation becomes too long?

EugeneOZ · on April 3, 2023

It's still just guessing. Ask ChatGPT to provide some links to documentation and check them.

LLMs are great for “creative” work: images, poems, games - entertainment based on imaginary things.

pwinnski · on April 3, 2023

There are three types of lies, as the saying goes: lies, damned lies, and statistics. But why are statistics considered lies?

Because of how they're used.

If you think of AI as a source of truth, obviously you're going to run into trouble: it "lies"! But if instead of thinking of it in isolation, you think of the person+AI producing results, then you should trust that person exactly as much as you would whether or not they use AI.

lolsowrong · on April 3, 2023

Depends on the calculator. Floating point imprecision is well documented.

tricksforfree · on April 3, 2023

That's true. But we know exactly why it can't do it.

ben_w · on April 3, 2023

How important is the "knowing why" if the mistakes are still there? And in reverse, we "know" GPT doesn't use a calculator unless specially pointed at one.

Floating point errors creeping in is why we have to use quaternions instead of matrixes for 3D games. Apparently. I'd already given up on doing my own true-3D game engine by that point.

In some sense we "know why" humans make mistakes too — and in many fields from advertising to political zeitgeist we manipulate using knowledge of common human flaws.

On this basis I think the application of pedagogical and psychological studies to AI will be increasingly important.

kgwxd · on April 3, 2023

Well documented is the key difference.

ipaddr · on April 3, 2023

As often as people put in incorrect values. As often as someone goes beyond the range. Always when someone tries to add yellow to blue. Often when the wrong formula is used.

And in this case you don't get to put in the formula.

onion2k · on April 3, 2023

AI is confidently wrong about a lot of things, but that doesn't mean it's useless. It means you need to verify what it generates. Doing that for code is much easier than prose. AI that produces wrong code is immediately and obviously wrong. It can't really fool you. It's easy to test. You can even ask AI to write tests for the code it produces to demonstrate it's correct.

weatherlite · on April 3, 2023

Tests help but tests can be wrong as well, if this was all so easy we wouldn't have any bugs.

lanstin · on April 3, 2023

Just because it is making obvious errors doesn’t mean it isn’t also making subtle errors.

wkat4242 · on April 3, 2023

This is the thing. ChatGPT shouldn't be that confident in its wording IMO.. If it just said "according to me" instead of stating things as fact, people would have much less problems with it.

We know this wording is just show but people still get swayed by it and believe it must be true.

AlecSchueler · on April 3, 2023

Yeah and it can actually reflect on itself and its mistakes when prompted, so I can see a fix like this coming soon. Sometimes just asking, Are you sure? is enough for it to apologize and say a part of its answer wasn't based in known fact.

Also another point while I'm here: Many many humans I've met are often confident when incorrect as well and can and will bullshit an answer when it suits their comfort.

lanstin · on April 3, 2023

But no one asserts the existence of those people will forever change society.

sharemywin · on April 3, 2023

People are confidently wrong all the time and yet we still seem to get stuff done.

A tool is a tool. it has good uses and not good uses. as the human you figure out where it works and where it doesn't.

fnordpiglet · on April 3, 2023

Every time the human supervising it makes a mistake in their role in the relationship between user and tool.

Veen · on April 3, 2023

There's a lot of work happening at the moment around self-reflection and getting LLMs to identify and correct their own hallucinations and mistakes.

https://arxiv.org/pdf/2303.11366.pdf

lanstin · on April 3, 2023

I suspect some temporality will need to be added. There are times when writing the code you have a question because the code exposes an unexpressed choice in the requirements. When you are coding in linear time, you then know to go ask the question. I am not sure that just generating the most likely or most rewarded response will do that easily. It seems to just arbitrarily pick the most likely requirement.

carlosjobim · on April 3, 2023

Every time they are wrong, which is every time the user slips on a key.

alex_sf · on April 3, 2023

How often are people?

SketchySeaBeast · on April 3, 2023

I think the difference, right now at least, is that people will go, "well, I'm not sure about this so I think we should look it up, but this is what I think" - the AI doesn't do that. It lies in the same exact way it tells truths. How are you supposed to make decisions based off of that information?

marcusjt · on April 3, 2023

Does it lie? Or just get things wrong sometimes?

Lying requires knowledge that what you are saying is not the truth, and usually there's a motive for doing so.

I don't think ChatGPT is there yet... or is it?

giaour · on April 3, 2023

Technically, what ChatGPT is doing is bullshitting because it doesn't have any knowledge of or concern for truthfulness.

https://en.m.wikipedia.org/wiki/On_Bullshit

SketchySeaBeast · on April 3, 2023

Sure, it's not lying, you're right, there's no will there, I'm anthropomorphism. It is producing entirely wrong facts / pseudo-opinions (as it can't actually have an opinion).

ben_w · on April 3, 2023

I was about to suggest "pathologically dishonest", but then I looked up the term and that seems to require being biased in favour of the speaker and knowing that you're saying falsehoods.

"Confabulate" however, appears to be a good description. Confabulation is, I'm told, associated with Alzheimer's, and GPT's output does sometimes remind me of a few things my mum said while she was ill.

alex_sf · on April 3, 2023

Presumably the same way you make decisions on any piece of information. You should not be blindly trusting a single source.

SketchySeaBeast · on April 3, 2023

I was too vague I think. The only place where I can see it being acceptable right now is code because I have a whole other system that will call out failures - I can rely on my IDE and my own expertise to hopefully catch issues when they appear.

Outside of the code use case, what should I rely on ChatGPT for that won't have me also looking for the information somewhere else? I suppose subjective soft things, like writing communications. But I can't rely on it for information.

alex_sf · on April 3, 2023

Again, the idea that you should rely on any single source for information is the issue. Nothing changes with ChatGPT other than the apparent expectation that it is infallible.

SketchySeaBeast · on April 3, 2023

So what are the use cases that I would use ChatGPT to find information that would speed up my work but would still require me to verify the information? If nothing changes with ChatGPT what is its use as a tool (assuming you want to use it to get information)?

lanstin · on April 3, 2023

It seemed to do a good job of outlining a JoJo season in the style of a Shakespearean comedy.

I wouldn’t ride in a vehicle it designed tho, based on my week of asking it to do go programming.

SketchySeaBeast · on April 3, 2023

Sometimes it does, but I asked ChatGPT (not 4) to give me song lyrics for a song that it should have had data for and it gave me entirely wrong lyrics. I asked again and it gave me me more bad lyrics, not even close, and it didn't even pretend it didn't know, the lyrics would have been convincing. If I didn't already know the material I wouldn't know it was confabulating.

RcouF1uZ4gsC · on April 3, 2023

Calculators use floating point and can have catastrophic errors if not used correctly. So yes, calculators can be confidently wrong.