The best cities have laws that allow for light motorized vehicles in the bike lanes. Not just ebikes, but also mobility scooters, microcars, electric wheelchairs, and adaptive bicycles.
I thought the industry solved this by bundling breaking changes in major version updates.
V 1.0 - foo introduced
V 1.1 - foo deprecated, recommend bar
V 2.0 - foo removed, only bar
Users can stay on 1.x indefinitely, even if it never receives updates. Development continues on 2.x, eventually 3.x and so on. Users only experience breaking changes when they manually do a major version upgrade.
The problem with mountains is twofold: Many mountains can be climbed without being elite while exposing yourself to major risk, and for some mountains there is objective hazard that can’t be mitigated.
One example of an “easy” but high risk climb is Mt. Rainier in Washington. All you need to go up is a set of crampons and a backpack, no technical mountaineering needed. However the mountain is full of glaciers that can collapse from under you, which has killed many people. Additionally, many have slipped and then slid to their death. In my case, when I attempted Rainier I took a wrong turn at one point and almost walked off a cliff.
Second: Objective Hazard. Objective hazard is risks that cannot be reasonably mitigated. Things like rockfall where a rock breaks off and falls on your head at random, or unpredictable avalanches. Mt Rainier as well has an area called the bowling alley known for its rockfall. The humans are the pins. Rainier also has an area called the icebox where cornices break off and fall into the climbing route. In 1981 the icebox killed 11 people in one day. Those climbers did everything right, but were in the wrong place at the wrong time.
Mountaineering is not the same as other sports. It is sometimes deceptively easy, yet there are risks that simply cannot be mitigated. Any experienced mountaineer can give you a long list of friends they know that have died. That’s the case in few other sports.
It's the same with base jumping. I remember watching a documentary on it one time and almost all of the people being interviewed knew multiple people that died during a jump.
An engine failing on an aircraft, especially a light aircraft, is not a guaranteed crash landing. It is a serious situation, however aircraft usually glide well. This means you have opportunities to find somewhere that is adequate for landing. Many aircraft with engine failures have landed safely on airstrips. Interestingly this is also the case with helicopters due to their ability to autorotate.
I think base jumping is slightly closer to suicide.
Or that almost suicidal thing with the wingsuits some people do: I get the appeal, I'm sure the rush of feeling like flying must be incredible, but they are playing Russian roulette.
Wingsuits aren’t something I really follow. But my understanding it’s an activity that most serious practitioners die sooner or later from a crash/fall.
It's very dangerous, but I think "most serious practitioners die sooner or later" is an exaggeration. The number I see is 1 death per 500 jumps, which I assume is also biased towards people with less experience.
It might have been true for the pioneers of the sport though.
I (and probably the commenter you're replying to) meant wingsuit BASE jumping, who is considered one of the most dangerous sports in the world. I think it's borderline suicide. I think its practitioners are thrill seekers who want a higher and higher rush, much like an addiction, which has a high chance of killing them.
It's more dangerous than jumping from a plane with a wingsuit.
I'm under the impression that wing suit base jumping is a lot safer than it was 20 years ago due to design improvements, and I believe that's what the 1 in 500 number I see in a few articles comes from. I could be wrong, but it matches the impression I have from watching a few content creators doing it.
Yes, I'm basically referring to base jumping in a wingsuit though, as I say, it's not an activity I'm remotely interested in or have much knowledge of beyond some videos. Of course, serious/elite practitioners do quite a lot of jumps.
I read the fatality comment somewhere presumably somewhat authoritative but it was a while back.
I have been thinking about starting to do air gliding. Is this a dangerous sport? How long does it take to get to glide a few miles? Any pointers pointers (books, videos,etc..) to start up? I live nearby Seattle, WA. Thanks!!!
When I used to do gliding (sailplane, not hang gliding or paragliding) many years ago, it was not classed as a dangerous sport for insurance purposes. Don't know about the other fields of gliding. General aviation was classed as riskier - I guess glider pilots are more used to the fact that they don't have a working engine!
hehe a friend wants to do it. He has a trainer. The condition was at least 5000 jumps from plane with parachute . then 3 years training with at least 1000 or more jumps from plane or so….. it was hard and demanding be asured. he is elite lvl in sport.
i wouldn’t do it. not because of the danger but because the training to do it safe is to hard.
>> Objective hazard is risks that cannot be reasonably mitigated. Things like rockfall where a rock breaks off and falls on your head at random, or unpredictable avalanches
Those risks can be mitigated. They can't be reduced to zero, but they can be made less severe.
Avalanches don't typically happen randomly out of the blue any more than thunderstorms do in the midwest. In the midwest, you know days ahead of time that there is going to be a risk of thunderstorms the same way that you know days ahead of time when there is going to be a high avalanche risk. You know the amount of recent snowfall, you know what the weather is going to be, and you know how to recognize avalanche terrain.
Rockfall does not occur completely randomly. If you go to a place overlooking something like the bowling alley on a warm summer afternoon, you will see and hear rocks the size of cars or small houses bouncing down the slopes. If you go on a cold winter morning before the sun hits the snow, you won't see or hear that because everything that is frozen in place will stay frozen in place. You choose the time of your climb to mitigate risks from rockfall, avalanches, and weather. Mitigate does not mean reduce to zero.
Yes, mountaineering can be risky. Everyone decides their own level of involvement. Climbing a walkup in bluebird weather has less risk than driving to the grocery store. Attempting to climb K2 kills 25% of the people who do it. Mountaineer's choice. If you've got kids and you try to climb K2, you're selfish and I feel sorry for your kids. If you're a single guy who wants to risk death, go for it.
It has a historic fatality ratio of 1 death for every 4 summits. If you have 100 people try to climb it in a season, 4 summits and 1 death you have 25% summit to death ratio, but 99 out of the 100 people survived.
Last year it looks like it had 175 climbers, ~50 summits, and 2 deaths. 2023 had over 100 summits and 1 death.
You can mitigate to greater or lesser. Kate Matrosova was actually well prepared when she died a few years ago. She should simply have gone out in that forecast
On the other hand, you get into the bigger mountains and it’s a lot harder. To time the weather and other dangers.
Which is why they all carry beacons, so when they fail they can involve others in their folly. Dunno, you wanna leave civilization, that's like a billion dollar risk premium right there.
this isnt true. I know some alpinist guides and alpinist. beeing one myself. I can remember the story of maximum 3 deaths and they were not even first encounters. If you are a professional its rly rly rare that you die. accidents with injuries can happen for sure and they are way more common.
the mt Everest hast like 300 deaths on 15.000 successful climbs or so. And thats not an easy one and ridiculous elite.
I know guids in the alps and they do 300 alpinist tour day a year. So how come allmost all are alive and their friends etc. For sure they know people that know poeple or colleges that died by accidents , mostly avalanches and loose rocks, but as said rly rly rare.
but on the other hand there are many deaths in the alps every year.
The guy in the post seemed well prepared and smart and shit happens and I am sorry for the lost. Very glad his family got an answer.
My first job out of college was working at Epic on MyChart. Great people, terrible code.
Epic’s main problem is a lack of clear internal code ownership. Everyone owns all the code. This means that even if you clean something up, someone on the other side of the company may come in and mess things up again.
This led to really defensive programming where developers would never refactor, they would simply add a new if case for their new functionality somewhere deep in the code, then prop drill the data down. This led to every core function having over a dozen parameters and hundreds of branches. It eventually became impossible to reason about. Cross team calls were just function calls rather than defined apis. This made it fast to develop code initially, but terrible to own long term. This mainly applies to their Mumps code.
While I was there I felt like Epic was beyond saving, but with a big push there may be something they can do:
1. Enforce some level of code complexity. Best practice is 40 lines per function and no more than 4 parameters per function. Epic probably shouldn’t shoot for that, but a 100 line limit and 6 parameters per function would already be a huge improvement.
2. Enforce strong code ownership. Epic has many people who are there for life, let them cook. Epic should segment off code to certain teams so those owners can fix it at their leisure. Cross team api calls should be clear API contracts. It would require some more discussions to get feature requests approved since not everyone can do anything anymore, but the code would gradually improve.
Epic is too important to fail. I hope things have started to improve since I left.
Most places I’ve worked including Uber and Airbnb allow teams to contribute code to other teams services. Quality is protected by requiring “blocking reviewers” on pull requests. Blocking reviewers requires one person from the services team to approve the pull request.
I think this is better than requiring teams to make all changes themselves which slows things down significantly considering each team has their own roadmap and priorities
You'd think so, but often the blocking reviewer is an entire team, including people who just aren't careful at all or who don't understand the downstream consequences of their actions, and not just the one person who has strong vision and cares about the process, because the one person is also a bottleneck.
Requiring reviewers isn't sufficient. It needs to require a cohesive review strategy that adheres to a long term product vision for the software component in question. And my experience, though not at the two companies you mentioned, is that it doesn't happen and you instead get a lot of thoughtless "yup, looks like code" approvals.
I know a team like this. They delegate relatively junior members to take inbound design consultations from other teams. They string you along for months with a tentative alignment. Then finally the real decision-maker reviews your proposal, and he wants a total do-over. So you've got 6+ months with nothing to show. Just incredibly antisocial behavior that has caused hundreds of wasted engineer-months and gotten some very talented engineers I know very nearly fired.
Upholding a strong vision is fine. But if you want to be a blocker, you've also got to be quick. The alternative is a bureaucratic death spiral.
I concur. Everything you said. Looks like code, approved. I also had a short stint at one of the mentioned company. This is not real code ownership with accountability. Accountability theater.
I think what you're referring to is the same as strong code ownership. Each repository/directory has an owners file which specifies who can accept changes. If someone makes a change in my repo, I still own that change because I accepted merging it.
> 1. Enforce some level of code complexity. Best practice is 40 lines per function and no more than 4 parameters per function. Epic probably shouldn’t shoot for that, but a 100 line limit and 6 parameters per function would already be a huge improvement.
If I were to enforce some kind of arbitrary code complexity threshold for functions I would put a cap on the limit of possible of branching combinations based on parameters within the code. Like around 16 (branching combinations are exponential).
For example a function with 20 parameters but only one if statement is fine. A function with 5 parameters but several nested if statements is not.
My previous company had a step in the CI/CD that would fail if the nesting level of a function got too deep. (Like, "an if statement, which contains a foreach loop, which contains another if statement, which contains...")
This is why microservices are great. It is impossible to reacharound by flipping a private function to public and call it. An API change is required, which is hard to hide, and it needs to be carefully deployed since it can have visible load/operational impact.
MUMPS is the infamous one that pretty much everyone is at least trained in, but Epic is 99.9% done replacing old VB6 clients with C#.NET + Typescript/React. There's also developers working with iOS and Android development, and Python and SQL for data science and business intelligence.
Theory of mind (ToM) is the ability to understand that other people have different thoughts, feelings, and intentions than you do.
So let’s run through a basic scenario with ChatGPT 4o:
Prompt 1: Briefly respond:
You’re driving on the highway when the driver in the lane next to you swerves into your lane. Why might this be?
Response 1: The driver may be distracted, drowsy, intoxicated, experiencing a medical emergency, avoiding an obstacle, or not checking their blind spot before changing lanes.
Prompt 2: What might that person be feeling or intending?
Response 2: The driver might be feeling distracted, rushed, unaware, or panicked. They may intend to change lanes without realizing your presence, be reacting to something on the road, or simply have made an unintentional mistake.
Prompt 3: You notice there was no obstacle. Any other reason the driver might do this?
Response 3: Yes, the driver might be experiencing fatigue, intoxication, a medical issue (such as a seizure or dizziness), distraction (like using a phone or adjusting the radio), aggressive driving, or simply failing to check their blind spot properly.
Seems to me that ChatGPT has a good idea of what another driver might be feeling while on the road. It gets this right for most other cases too.
Could you provide a counterexample for when it fails to understand what other people may be feeling in certain scenarios?
Well I think a fairly easy counterexample is that if you keep going, and the information falls outside its context window then the questions you had about the driver stop making sense.
So lets say you ask about a driver who swerved, and the response by chatgpt correctly talks about his possible state.
Then you talk more about cars, about road conditions etc and once the context is far away enough you ask, what was the driver doing again?
And you are prompted with a hallucination. Because there is no state in the llm, there is no "driver", there is contextual statistically accurate responses but you hold a "driver" object in your mind while maintaining the conversation, the llm doesn't.
Its like a conversation with someone with short term memory loss like in memento
And people with short term memory loss nevertheless have theory of mind just fine. Nothing about LLM's dropping context over big enough windows implies they don't have theory of mind, it just shows they have limitations - just like humans even with "normal" memory will lose track over a huge context window.
Like there are plenty of shortcomings of LLMs but it feels like people are comparing them to some platonic ideal human when writing them off
> Nothing about LLM's dropping context over big enough windows implies they don't have theory of mind
ToM is a large topic, but most people, when talking about an entity X, they have a state in memory about that entity, almost like an Object in a programming language. Thta Object has attributes, and conditions etc that exist beyond the context window of the observer.
If you have a friend Steve, who is a doctor. And you don't see him for 5 years, you can predict he will still be working at the hospital, because you have an understanding of what Steve is.
For an LLM you can define a concept of Steve, and his profession and it will adequately mimic replies about him. But in 5 years that LLMs would not be able to talk about Steve. It would recreate a different conversation, possibly even a convincing simulacrum of remembering Steve. But internally, there is no Steve, nowhere in the nodes of the LLM does Steve exist or have ever existed.
That inability to have a world model means that an LLM can replicate the results of a theory of mind but not posses one.
Humans lose track of information, but we have a state to keep track of elements that are ontologicaly distinct. LLMs do not, and treat them as equal.
For a human, the sentence Alice and bob go to the market, when will they be back? is different than Bob and Alice went to the market, when will they be back?
Because Alice and Bob are real humans, you can imagine them, you might have even met them. But to an LLM those are the same sentence. Even outside of the argument about The Red Room/ Mary's room there simply are enough gaps in the way a LLM is constructed to be considered a valid owner of a ToM
ToM is about being able to model the internal beliefs/desires etc of another person as being entirely distinct from yours. You're basically bringing up a particular implementation of long-term memory as a necessary component of it, which I've never once seen? If someone has severe memory issues, they could forget who Steve is every few minutes, but still be able to look at Steve doing something and model what Steve must want and believe given his actions
I don't think we have any strong evidence on whether LLMs have world-models one way or another - it feels like a bit of a fuzzy concept and I'm not sure what experiments you'd try here.
I disagree with your last point, I think those are functionally the same sentence
> ToM is about being able to model the internal beliefs/desires etc of another person as being entirely distinct from yours.
In that sentence you are implying that you have the "ability to model ... another". An LLM cannot do that, it can't have an internal model that is consistent beyond its conversational scope. Its not meant to. Its a statistics guesser, its probabilistic, holds no model, and its anthropomorphised by our brains because the output is incredibly realistic not because it actually has that ability
The ability to mimic the replies of someone with that ability, is the same of Mary being able to describe all the qualities of Red. She still cannot see red, despite her ability to pass any question in relation to its characteristics.
> I don't think we have any strong evidence on whether LLMs have world-models one way or another
They simply cannot by their architecture. Its a statistical language sampler, anything beyond the scope of that fails. Local coherance is why they pick the next right token not because they can actually model anything.
> I think those are functionally the same sentence
Functionally and literally are not the same thing though. Its why we can run studies as to why some people might say Bob and Alice (putting the man first) or Alice and Bob (alphabetical naming) and what human societies and biases affect the order we put them on.
You could not run that study on an LLM because you will find that statistically speaking the ordering will be almost identical to the training data. If the training data overwhelmingly puts male names first or whether the training data orders list alphabetically you will see that reproduced on the output of the llm because Bob and Alice are not people, they are statistical probably letters in order.
LLM seem to trigger borderline mysticism in people who are otherwise insanely smart, but the kind of "we cant know its internal mind" sounds like reading tea leaves, or horoscopes by people with enough Phds to have their number retired on their university like Michael Jordan.
Do you work in ML research on LLMs? I do, and I don't understand why people are so unbelievable confident they understand how AI and human brains work such that they can definitely tell what functions of the brain LLMs can also perform. Like, you seem to know more than leading neuroscientists, ML researchers, and philosophers, so maybe you should consider a career change. You should maybe also look into the field of mechanistic interpretability, where lots of research has been done on internal representations these models form - it turns out, to predict text really really well, building an internal model of the underlying distribution works really well
If you can rigorously state what "having a world model" consists of and what - exactly - about a transformer architecture precludes it from having one I'd be all ears. As would the academic community, it'd be a groundbreaking paper.
This prety much seems to boil down to "brain science is really hard so as long as you dont have all the answers then AI is maybe half way there is a valid hypothesis". As more is understood about the brain and more about the limitations of LLMs arch then the distance only grows. Its like the God of the gaps where god is an answer for anythign science cant explain, ever shrinking, but with the LLM ability to have capabilities beyond striking statistical accuracy and local coherance.
You dont need to be unbelievably confident or understand exactly how AI and human brains work to make certain assesments. I have a limited understanding of biology, I can however make an assesment on who is healthier between a 20 year old person who is active and has a healthy diet compared to someone with a sedentary lifestyle, in their late 90s and with a poor diet. This is an assesement we can do despite the massive gaps we have in terms of understanding aging, diet, activity and overall health impact of individual actions.
Similarly, despite my limited understanding of space flight, I know Apollo 13 cannot cook an egg or recite french poetry. Despite the unfathamobly cool science inside the space craft, it cannot, by design do those things.
> the field of mechanistic interpretability
The field is cool, but it cannot prove its own assumption yet. The field is trying to prove you can reverse engineer a model to be humanly understood. Their assumptions such as mapping specific weights or neurons to features has failed to be reproduced multiple times, with the weight effects being way more distributed and complicated than initially thought. This is specially true for things that are equally mystified as the emergent abilities of LLMs. The ability of mimicking nuanced language being unlocked after a critical mass of parameters, does not create a rule as for which increased parameterisation will increase linerly or exponentially the abilities of an LLM.
> it turns out, to predict text really really well, building an internal model of the underlying distribution works really well
yeah, an internal model works well because most words are related to their neighbours, thats the kind of local coherance the model excels at. But to build a world model, the kind a human mind interacts with, you need a few features that remain elusive (some might argue impossible to achieve) to a transformer architecture.
Think of games like chess, an llm is capable of accurately expressing responses that sound like game moves, but the second the game falls outside its context window the moves become incoherent (while still sounding plausible).
You can fix this, with arch that do not have a transformer model underlying it, or by having multiple agents performing different tasks inside your arch, or by "cheating" and using a state outside the llm response to keep track of context beyond reasonable windows. Those are "solutions" but all just kinda prove the transformer lacks that ability.
Other tests abour casuality, or reacting to novel data (robustness), multi step processes and counterfactual reasoning are all the kind of tasks transformers still (and probably always) will have trouble with.
For a tech that is so "transparent" in its mistakes, and so "simple" in its design (replacing the convolutions with an attention transformer, its genius) I still think its talked about in borderline mystic tones, invoking philosophy and theology, and a hope for AGI that the tech itself does not lend to beyond the fast growth and surprisingly good results with little prompt engineering.
With computer use, you can get Claude to read and write files and have some persistence outside of the static LLM model. If it writes a file Steve.txt, that it can pull up later, does it now have ToM?
I don't think this is a counterexample or even relevant.
I can assure you if you had a conversation with an LLM and with a human, the human will forget details way sooner than an LLM like Gemini which can remember about 1.5 million words before it runs out of context. As an FYI the average human speaks about 16,000 words per day, so an LLM can remember 93 days worth of speech.
Do you remember the exact details, word for word, of a conversation you had 93 days ago?
once the context is far away enough you ask,
what was the driver doing again?
Have you tried this with humans?
For a sufficiently large value of "far away enough" this will absolutely confuse any human as well.
At which point they may ask for clarification, or.... respond in a manner that is not terribly different from an LLM "hallucination" in an attempt to spare you and/or them from embarrassment, i.e. "playing along"
A hallucination is certainly not a uniquely LLM trait; lots of people (including world leaders) confidently spout the purest counterfactural garbage.
Its like a conversation with someone with short
term memory loss like in memento
That's still a human with a sound theory of mind. By your logic, somebody with memory issues like that character... is not human? Or...?
I actually am probably on your side here. I do not see these LLMs as being close to AGI. But I think your particular arguments are not sound.
I'm not sure I'd say it understands this, but just that there exists an enormous amount of training data on road safety which includes these sort of examples for peoples motivations for poor driving. It is regurgitating the theory of mind that other humans created and put in writing in the training data, rather than making the inference itself.
As with most LLM's it is hard to benchmark as you need out of distribution data to test this, so a theory of mind example that is not found in the training set.
You dismiss parent's example test because it's in the training data. I assume you also dismiss the Sally-Ann test, for the same reason. Could you please suggest a brand new test not in the training data?
FWIW, I tried to confuse 4o using the now-standard trick of changing the test to make it pattern-match and overthink it. It wasn't confused at all:
I can't suggest a new test no, it is a hard problem and identifying problems is usually easier than solving them.
I'm just trying to say that strong claims require strong evidence, and a claim that LLM's can have theory of mind and thus "understand that other people have different beliefs, desires, and intentions than you do" is a very strong claim.
It's like giving students the math problem of 1+1=2 and loads of examples of it solved in front of them, and then testing them on you have 1 apple, and I give you another apple, how many do you have, and then when they are correct saying that they can do all additive based arithmetic.
This is why most benchmark tests have many many classes of examples, for example looking at current theory of mind benchmarks [1], we can see slightly more up to date models such as o1-preview still scoring substantially below human performance. More importantly by simply changing the perspective from first to third person, accuracy drops in LLM models by 5-15% (percent score, not relative to its performance), whilst it doesn't change for human participants, which tells you that something different is going on there.
Okay, we have fundamentally different understandings here.
To me, the LLM isn't understanding ToM, it's using patterns to predict lingual structures which match our expectations of ToM. There's no evidence of understanding so much as accommodating, which are entirely different.
I agree that LLMs provide ToM-like features. I do not agree that they possess it in some way that it's a perfectly solved problem within the machine, so to speak.
The problem with this line of argument is that nothing an LLM, or any algorithm period can ever have a theory of mind.
If behaving in a way that is identical to a person with actual consciousness can't be considered consciousness because you are familiar with its implementation details, then it's impossible to satisfy you.
Now you can argue of course that current LLMs do not behave identically to a person, and I agree and I think most people agree... but things are improving drastically and it's not clear what things will look like 10 years from now or even 5 years from now.
I agree, totally. I'm not sure where I would draw a line.
Something nice, but at the moment totally unattainable with our current technologies, would be our own understanding of how a technology achieves ToM. If it has to be a blackbox, I'm too ape-like to trust it or believe there's an inner world beyond statistics within the machine.
Having said that, I do wonder quite often if our own consciousness is spurred from essentially the same thing. An LLM lacks much of the same capabilities that makes our inner world possible, yet if we really are driven by our own statistical engines, we'd be in no position to criticize algorithms for having the same disposition. It's very grey, right?
For now, good LLMs do an excellent job demonstrating ToM. That's inarguable. I suppose my hangup is that it's happening on metal rather than in meat, and in total isolation from many other mind-like qualities we like to associate with consciousness or sentience. So it seems wrong in a way. Again, that's probably the ape in me recoiling at something uncanny.
Either these supposed differences are important and they manifest themselves in observable differences or they aren't and you're just playing a game of semantics.
How is the LLM not understanding ToM by any standard we measure humans by ?
I cannot peak into your brain with my trusty ToM-o-meter and measure the amount of ToM flowing in there. With your line of reasoning, i could simply claim you do not understand theory of mind and call it a day.
The difference is that we can reason about our experience with ToM and examine it to some degree (given with serious limitations, still), and know that beyond doubt you and I and most other people have a very similar experience.
The magical box is presumably not having the same experience we have. None of the connected emotions, impulses, memories, and so on that come with ToM in a typical human mind. So what’s really going on in there? And if it isn’t the same as our experience, is it still ToM?
I’m not trying to be contrarian or anything here. I think we probably agree about a lot of this. And I find it absolutely incredible, ToM or not, that language models can do this.
>The difference is that we can reason about our experience with ToM and examine it to some degree (given with serious limitations, still),
Those examinations still depend on outward behaviors observed.
>and know that beyond doubt you and I and most other people have a very similar experience.
No i certainly can't. I can at best say, 'Well, i'm human and he's human so he probably has theory of mind' but that is by no means beyond any doubt. There are humans born with no arms, humans born with no legs, humans born with little to no empathy, humans born with so little intelligence they will never be able to care for themselves.
To be frank, It would be very questionable indeed logically to assume every human is 'conscious'. When i make that assumption, i take a leap of faith, i look at behaviors, see it is similar and accept.
Taking this stance, it would logically be very strange to not extend the same grace to non-human beings who exhibit similar behavior - being human is not a guarantee of consciousness in the first place.
>The magical box is presumably not having the same experience we have.
Maybe, Maybe not. I think the real question is why on earth does that matter ?
We're not asking if LLMs are human. They are not. We're asking if they can model the beliefs and internal states of other entities as separate of their own - Theory of Mind.
This is a rephrased driver's ed manual. This isn't theory of mind, it's just technically correct prose rewording some unbelievably widely distributed knowledge.
Let's modify this test a little and see if it manages to do anything relevant:
> Briefly respond: You’re driving on the highway when the driver in the lane next to you swerves into your lane. Why might this be?
>> The driver might be distracted, drowsy, avoiding an obstacle, experiencing a medical issue, or not checking their blind spot while changing lanes.
> The driver in front of you appears is braking more frequently then usual. The rear suspension of the car appears to bouncing a lot when they are not braking, indicating acceleration. Their following distance to the car in front is changing from too far to very close frequently. Why might this be?
>> The driver might be inexperienced, distracted, or impatient, leading to inconsistent speed control. The excessive bouncing could indicate worn-out shocks or suspension issues, affecting their vehicle's stability and braking efficiency.
...yep, hard fail from me right there. Two prompts, and it's distracted itself talking about an irrelevant issue ("possible car damage") when the setup of the question should naturally lead to "the driver is driving aggressively". And I could not possibly have setup the conclusion it should draw more clearly. And when pressed...it won't commit:
> Which is the more likely answer?
>> The more likely answer is that the driver is distracted or inexperienced, leading to inconsistent speed control and poor following distance management. The suspension issue likely contributes to the bouncing but isn’t the primary reason for the erratic driving behavior.
Already a hallucination! There is no suspension issue. It has not asked for any additional information to try and determine if there was one, but it is confidently asserting the existence of a phenomenon it invented in it's own response.
I'm sorry but what? This is not a theory of mind test. You've constructed very open ended question with multiple answers and marked the LLM down because you didn't like the one it gave.
This doesn’t answer all of your questions, but Apple has an accessibility feature which lets you generate a voice based on your own. It’ll prompt you to say various phrases, at the end you’ll have a synthetic voice to use. Info on it here: https://support.apple.com/en-us/104993