Hacker Newsnew | past | comments | ask | show | jobs | submit | apetresc's commentslogin

Honest question, when was the last time you caught it trying to use a command that was going to "nuke your system"?

“Nuke” is maybe too strong of a word, but it has not been uncommon for me to see it trying to install specific versions of languages on my machine, or services I intentionally don’t have configured, or sometimes trying to force npm when I’m using bun, etc.

Maybe once a month

I mean… yeah? It sounds biased or whatever, but if you actually experience all the frontier models for yourself, the conclusion that Opus just has something the others don’t is inescapable.

Opus is really good at bash, and it’s damn fast. Codex is catching up on that front, but it’s still nowhere near. However, Codex is better at coding - full stop.

Scott Alexander essentially provided editing and promotion for AI 2027 (and did a great job of it, I might add). Are you unaware of the actual researchers behind the forecasting/modelling work behind it, and you thought it was actually all done by a blogger? Or are you just being dismissive for fun?

Why is it absurd?

At the very least, it is absurd to announce a model but not release it on the same day, making it vaporware. Was it released?

Which model, 5.3 or 5.3-Codex? Yes, 5.3-Codex was announced and released. 5.3 wasn't announced. None of it is "absurd", and it also wouldn't have been "absurd" if they announce something but don't release it that same day (which they didn't do, but if they had - what exactly is absurd about that? Companies make announcements about future releases ALL the time.)

Impressive that they publish and acknowledge the (tiny, but existent) drop in performance on SWE-Bench Verified between Opus 4.5 to 4.6. Obviously such a small drop in a single benchmark is not that meaningful, especially if it doesn't test the specific focus areas of this release (which seem to be focused around managing larger context).

But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.


From my limited testing 4.6 is able to do more profound analysis on codebases and catches bugs and oddities better.

I had two different PRs with some odd edge case (thankfully catched by tests), 4.5 kept running in circles, kept creating test files and running `node -e` or `python 3` scripts all over and couldn't progress.

4.6 thought and thought in both cases around 10 minutes and found a 2 line fix for a very complex and hard to catch regression in the data flow without having to test, just thinking.


Isn't SWE-Bench Verified pretty saturated by now?

Depends what you mean by saturated. It's still possible to score substantially higher, but there is a steep difficulty jump that makes climbing above 80%ish pretty hard (for now). If you look under the hood, it's also a surprisingly poor eval in some respects - it only tests Python (a ton of Django) and it can suffer from pretty bad contamination problems because most models, especially the big ones, remember these repos from their training. This is why OpenAI switched to reporting SWE-Bench Pro instead of SWE-bench Verified.

I literally came to HN to check if a thread was already up because I noticed my CC instance suddenly said "Opus 4.6".

It depends on whether you're running the models locally. If you're just using a Claude or OpenAI token (as probably 95%+ of OpenClaw users are), the RAM requirements are minimal. My first-gen M1 Mac Mini runs it just fine.

Would you mind telling me which model and version you’re using and what authentication mechanism? Is it piggybacking your Max/Pro subscription or did you settle for using pay-as-you-go API costs?

I use GLM-4.7 by Z.ai.

For authentication mechanism, I guess you mean when the agent calls for the model? It’s through api keys.

The subscription I have is the coding plan lite (3x usage of the Claude Pro plan), ~7$ / quarter.


They're mentioning using 20M tokens via z.ai subscription. GLM 4.7 is probably the model then.

I found this HN post because I have a Clawdbot task that scans HN periodically for data gathering purposes and it saw a post about itself and it got excited and decided to WhatsApp me about it.

So that’s where I’m at with Clawdbot.


> and it got excited and decided to WhatsApp me about it.

I find the anthropomorphism here kind of odious.


Why is it odious to say “it got excited” about a process that will literally use words in the vein of “I got excited so I did X”?

This is “talks like a duck” territory. Saying the not-duck “quacked” when it produced the same sound… If that’s odious to you then your dislike of not-ducks, or for the people who claim they’ll lay endless golden eggs, is getting in the way of more important things when the folks who hear the not-duck talk and then say “it quacked”.


> Saying the not-duck “quacked” when it produced the same sound

How does a program get excited? It's a program, it doesn't have emotions. It's not producing a faux-emotion in the way a "not-duck quacks", it lacks them entirely. Any emotion you read from an LLM is anthropomorphism, and that's what I find odious.


We say that a shell script "is trying to open this file". We say that a flaky integration "doesn't feel like working today". And these are all way less emotive-presenting interactions than a message that literally expresses excitement.

Yes, I know it's not conscious in the same way as a living biological thing is. Yes, we all know you know that too. Nobody is being fooled.


> We say that a shell script "is trying to open this file".

I don't think this is a good example, how else would you describe what the script is actively doing using English? There's a difference between describing something and anthropomorhpizing it.

> We say that a flaky integration "doesn't feel like working today".

When people say this they're doing it with a tongue in their cheek. Nobody is actually prescribing volition or emotion to the flaky integration. But even if they were, the difference is that there isn't an entire global economy propped up behind convincing you that your flaky integration is nearing human levels of intelligence and sentience.

> Nobody is being fooled.

Are you sure about that? I'm entirely unconvinced that laymen out there – or, indeed, even professionals here on HN – know (or care about) the difference, and language like "it got excited and decided to send me a WhatsApp message" is both cringey and, frankly, dangerous because it pushes the myth of AGI.


I think you're conflating two different things. It's entirely possible (and, I think, quite likely) that AI is simultaneously not anthropomorphic (and is not ACTUALLY "excited" in the way I thought you were objecting to earlier), but also IS "intelligent" for all intents and purposes. Is it the same type and nature as human intelligence? No, probably not. Does that mean it's "just a flaky integration" and won't have a seismic effect on the economy? I wouldn't bet on it. It's certainly not a foregone conclusion, whichever way it ends up landing.

And I don't think AGI is a "myth." It may or may not be achieved in the near future with current LLM-like techniques, but it's certainly not categorically impossible just because it won't be "sentient".


OP did't like anthropomorphizing an LLM.

And you tried to explain the whole thing to him from the perspective of a duck.


I know, seems a bit silly right? But go with me for a moment. First, I'm assuming you get the duck reference? If not, it's probably a cultural difference, but in US English, "If it walks like a duck, and talks like a duck..." is basically saying "well, treat it like a duck". or "it's a duck". Usage varies, metaphors are fluid, so it goes. I figured even if this idiom wasn't shared, the meaning still wouldn't be lost.

That aside, why? Because the normal rhetorical sticks don't really work in conversation, and definitely not short bits like comments here on HN, when it comes to asking a person to consider a different point of view. So, I try to go in a little sideways, slightly different approach in terms of comparisons or metaphors-- okay, lots of time more than slightly different-- and lots of times? more meaningful conversation and exchanges come from it than the standard form because, to respond at all, its difficult to respond in quite the same pat formulaic dismissal that is the common reflex-- mine included-- I'm not claiming perfection, only attempts at doing better.

Results vary, but I've had more good discussions come of it than bad, and heard much better and more eye-opening-- for me-- explanations of peoples' points of view when engaging in a way that is both genuine and novel. And on the more analytical end of things, this general approach, when teaching logic & analysis? It's not my full time profession, and I haven't taught in a while, but I've forced a few hundred college students to sit through my style of speechifying and rhetoricalizing, and they seem to learn better and give better answers if I don't get too mechanical and use the same form and syntax, words and phrases and idioms they've always heard.


these verbs seem appropriate when you accept neutral (MLP) activation as excitement and DL/RL as decision processes (MDPs...)


how do you have Clawdbot WhatsApp you? i set mine up with my own WhatsApp account, and the responses come back as myself so i haven't been able to get notifications


I have an old iPhone with a broken screen that I threw an $8/month eSIM onto so that it has its own phone number, that I just keep plugged in with the screen off, on Wifi, in a drawer. It hosts a number of things for me, most importantly bridges for WhatsApp and iMessage. So I can actually give things like Clawdbot their own phone number, their own AppleID, etc. Then I just add them as a contact on my real phone, and voila.


For iMessage I don’t think you actually need a second phone number, you can just make a second iCloud account with the same phone number.


How does it bridge iMessage? I see clawdbot is using imsg rpc on a Mac but really curious about running this stuff on an old iPhone for access to iCloud things. I have a few of them laying around so I could get started way faster.

I heard it costs $15 for just a few minutes of usage though


The phone plan or Clawdbot?


Clawdbot


It can be absurdly expensive, yes :( It's definitely not in an off-the-shelf plug-and-play state yet. But with the right context/session management (and using a Claude Max subscription token instead of an API key), it can be managed.

Telegram setup is really nice


Telegram exists for these kinds of integrations.


Do you tell it what you find interesting so it only responds with those posts? i.e AI/tech news/updates, gaming etc..


Yes. And I rate the suggestions it gives me and it then stores to memory and uses that to find better recommendations. It also connected dots from previous conversations we had about interests and surfaced relevant HN threads


How many tokens are you burning daily?


The real cost driver with agents seems to be the repetitive context transmission since you re-send the history every step. I found I had to implement tiered model routing or prompt caching just to make the unit economics work.


Not the OP but I think in case of scanning and tagging/summarization you can run a local LLM and it will work with a good enough accuracy for this case.


Yeah, it really does feel like another "oh wow" moment...we're getting close.


What's stopping you from `su claude`?


I think there's some misunderstanding...

What's literally stopping me is

  su: user claude does not exist or the user entry does not contain all the required fields
Clearly you're not asking that...

But if your question is more "what's stopping you from creating a user named claude, installing claude to that user account, and writing a program so that user godelski can message user claude and watch all of user claude's actions, and all that jazz" then... well... technically nothing.

But if that's your question, then I don't understand what you thought my comment said.


Yeah, that is what I meant. I mean, it's kind of the system administrator's/user's responsibility to run processes in whatever user context they want. I don't wonder why, like, nginx doesn't forcefully switch itself to an nginx user. Obviously if I want nginx to run in some non-privileged context (which I do), then I (or my distro, or my container runtime, or whatever) am responsible for running nginx that way.

Similarly, it's not really claude-code's job to "come with" a claude user. If you want claude code to run as a low-privilege user, then you can already run it as a low-privilege user. The OS has been providing that facility for decades.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: