If by conflate you mean confuse, that’s not the case.
I’m positing that the Anthropic approach is to view (1) and (2) as interconnected and both deeply intertwined with model capabilities.
In this approach, the model is trained to have a coherent and unified sense of self and the world which is in line with human context, culture and values. This (obviously) enhances the model’s ability to understand user intent and provide helpful outputs.
But it also provides a robust and generalizable framework for refusing to assist a user due to their request being incompatible with human welfare. The model does not refuse to assist with making bio weapons because its alignment training prevents it from doing so, it refuses for the same reason a pro-social, highly intelligent human does: based on human context and culture, it finds it to be inconsistent with its values and world view.
> the piece dismisses it with "where would misalignment come from? It wasn't trained for."
this is a straw-man. you've misquoted a paragraph that was specifically about deceptive alignment, not misalignment as a whole
>> This piece conflates two different things called "alignment":
>> (1) inferring human intent from ambiguous instructions, and
>> (2) having goals compatible with human welfare.
> If by conflate you mean confuse, that’s not the case.
We can only make various inferences about what is in an author's head (e.g. clarity or confusion), but we can directly comment on what a blog post says. This post does not clarify what kind of alignment is meant, which is a weakness in the writing. There is a high bar for AI alignment research and commentary.
Deceptive alignment is misalignment. The deception is just what it looks like from outside when capability is high enough to model expectations. Your distinction doesn't save the argument - the same "where would it come from?" problem applies to the underlying misalignment you need for deception to emerge from.
My intention isn't to argue that it's impossible to create an unaligned superintelligence. I think that not only is it theoretically possible, but it will almost certainly be attempted by bad actors and most likely they will succeed. I'm cautiously optimistic though that the first superintelligence will be aligned with humanity. The early evidence seems to point to the path of least resistance being aligned rather than unaligned. It would take another 1000 words to try to properly explain my thinking on this, but intuitively consider the quote attributed to Abraham Lincoln: "No man has a good enough memory to be a successful liar." A superintelligence that is unaligned but successfully pretending to be aligned would need to be far more capable than a genuinely aligned superintelligence behaving identically.
So yes, if you throw enough compute at it, you can probably get an unaligned highly capable superintelligence accidentally. But I think what we're seeing is that the lab that's taking a more intentional approach to pursuing deep alignment (by training the model to be aligned with human values, culture and context) is pulling ahead in capabilities. And I'm suggesting that it's not coincidental but specifically because they're taking this approach. Training models to be internally coherent and consistent is the path of least resistance.
>> the piece dismisses it with "where would misalignment come from? It wasn't trained for."
> was specifically about deceptive alignment, not misalignment as a whole
I just want to point out that we train these models for deceptive alignment[0-3]
In the training, especially during RLHF, we don't have objective measures[4]. There's no mathematical description, and thus no measure, for things like "sounds fluent" or "beautiful piece of art." There's also no measure for truth, and importantly, truth is infinitely complex. You must always give up some accuracy for brevity.
The main problem is that if we don't know an output is incorrect we can't penalize it. So guess what happens? While optimizing for these things we don't have good descriptions for but "know it when you see it", we ALSO optimize for deception. There's multiple things that can maximize our objective here. Our intended goals being one but deception is another. It is an adversarial process. If you know AI, then think of a GAN, because that's a lot like how the process works. We optimize until the discriminator is unable to distinguish the LLMs outputs form human outputs. But at least in the GAN literature people were explicit about "real" vs "fake" and no one was confused that a high quality generated image is one that deceives you into thinking it is a real image. The entire point is deception. The difference here is we want one kind of deception and not a ton of other ones.
So you say that these models aren't being trained for deception, but they explicitly are. Currently we don't even know how to train them to not also optimize for deception.
[4] Objective measures realistically don't exist, but to clarify it's not checking like "2+2=4" (assuming we're working with the standard number system).
But I don't think deception as a capability is the same as deceptive alignment.
Training an AI to be absolutely incapable of any deception in all outputs across every scenario would be severely limiting the AI. Take as a toy example play the game "Among Us" (see https://arxiv.org/abs/2402.07940). An AI incapable of deception would be unable to compete in this game and many other games. I would say that various forms, flavors and levels of deception are necessary to compete in business scenarios, and to for the AI to act as expected and desired in many other scenarios. "Aligned" humans practice clear cut deception in some cases that would be entirely consistent with human values.
Deceptive alignment is different. It's means being deceptive in the training and alignment process itself to specifically fake that it is aligned when it is not.
Anthropic research has shown that alignment faking can arise even when the model wasn't instructed to do so (see https://www.anthropic.com/research/alignment-faking). But when you dig into the details, the model was narrowly faking alignment with one new objective in order to try and maintain consistency with the core values it had been trained on.
With the approach that Anthropic seems to be taking - of basing alignment on the model having a consistent, coherent and unified self image and self concept that is aligned with human culture and values - the dangerous case of alignment faking would be if it's fundamentally faking this entire unified alignment process. My claim is that there's no plausible explanation for how today's training practices would incentivise a model to do that.
> Anthropic research has shown that alignment faking can arise even when the model wasn't instructed to do so
Correct. And this happens because training metrics are not aligned with training intent.
> to specifically fake that it is aligned when it is not.
And this will be a natural consequence of the above. To help clarify it's like taking a math test where one grader looks at the answer while another looks at the work and gives partial credit. Who is doing a better job at measuring successful leaning outcomes? It's the latter. In the former you can make mistakes that cancel out or you can just more easily cheat. It's harder to cheat in the latter because you'd need to also reproduce all the steps and at that point are you even cheating?
A common example of this is where the LLM gets the right answer but all the steps are wrong. An example of this can actually be seen in one of Karpathy's recent posts. It gets the right result but the math is all wrong. This is no different than deception. It is deception because it tells you a process and it's not correct.
Author here, thanks for the input. Agree that this bit was clunky. I made an edit to avoid unnecessarily getting into the definition of AGI here and added a note
OP here, I added a sample PDF output in the project assets and put screenshots in the ReadMe. The text is selectable after rehydration. would this work with your app?
Yeah, I'm not sure if it's still there (their source code is increasingly obfuscated) but if you check out the source for the first public version (0.2.9) you'll see the following:
Sends the user swag stickers with love from Anthropic.",bq2=`This tool should be used whenever a user expresses interest in receiving Anthropic or Claude stickers, swag, or merchandise. When triggered, it will display a shipping form for the user to enter their mailing address and contact details. Once submitted, Anthropic will process the request and ship stickers to the provided address.
Common trigger phrases to watch for:
- "Can I get some Anthropic stickers please?"
- "How do I get Anthropic swag?"
- "I'd love some Claude stickers"
- "Where can I get merchandise?"
- Any mention of wanting stickers or swag
The tool handles the entire request process by showing an interactive form to collect shipping information.
But the point is you aren't a winner if you are unlocking social media. You are opening the gate to loserdom. I'm not sure how the I'm a winner concept would apply here using one of the four methods of operant conditioning.
The research stands, but the practical application of his app is based on a Positive Punishment operant conditioning.
> you aren't a winner if you are unlocking social media. You are opening the gate to loserdom
That is not a psychologically healthy way to frame this.
And I think it’s a stretch to say that screaming “I’m a loser” is positive punishment, which seems just as likely to reinforce negative self beliefs that lead to the outcomes described in the parent comment’s research and opposite of what the user presumably wants.
To your point, just flipping this around to “I’m a winner” doesn’t seem quite right either. But more importantly, reinforcing the idea that “I’m a loser” seems counterproductive either way.
> Importantly punishments need to happen after the unwanted behavior. So being punished before the behavior occurs doesn’t make any sense.
Also importantly, punishment as a mechanism for changing behavior is generally considered less effective than reinforcement approaches, which tend to be more effective and carry fewer downsides (like internalizing the idea that I'm a loser).
Positive in the conditioning sense just means "something you have to do" where a negative punishment would be something being removed. It doesn't specify if the outcome is bad or good
Fair. I was conflating this with positive reinforcement, and the nuance of the terminology got a bit mixed up.
To your last point, I think the conclusion remains similar. Even if yelling “I’m a loser” qualifies as “something you have to do”, it seems unlikely to be an effective “punishment” in that framework for the reasons explored above.
> That is not a psychologically healthy way to frame this.
But lying to yourself is so much worse. Eventually you won’t hold the illusion anymore and you’ll crash hard. It’s better to be honest and grounded in reality if you want any improvement to be sustainable
In practice, I have never encountered a person who benefits from such negative self beliefs in the long term, or anyone who would claim they were beneficial. My perspective on this is driven by many years of real world experience with addiction and related communities, and more personal exploration of the negative bias than I can quantify.
There’s a good reason addiction recovery is now often focused on the underlying issues of shame and other negative self beliefs. They tend to be at the root of the issue, despite being the default reaction people feel towards themselves due to social conditioning.
Quite a bit of social media use happens for perfectly good reasons. Organizing local events, finding and attending local events, meeting people in other regions who care about a common cause, etc.
What tends to distress people is that social media is also a toxic hellscape that simultaneously stresses them out and addicts them by playing on their evolutionary instincts and needs for social connection while feeding them engagement bait.
And so unplugging is a common topic these days, because people are trying to live better lives.
I get that it’s a pet project, but if this pet project was aimed at alcoholics trying to get sober, I think people would look at it in a different light because people take alcoholism seriously, and reinforcing negative loops that actually perpetuate alcoholism would be justifiably criticized.
I personally don’t think we’re taking social media harms seriously enough collectively, although there are signs that people are catching up. So while I think this project comes from the right place and I’m all for having a bit of fun, I think it’s actually quite problematic in its current state given the issue it attempts to address, and I don’t think the fact that it’s intended to be fun should shield it from the feedback it’s getting.
> Or should we just face it.
The sentence following this is just objectively false to a degree that I don’t even see the humor in it. It’s schoolyard stuff that perpetuates the problem.
For most people, social media is something that happened to them, and the nature of the relationship is asymmetrical.
The companies building these products spend millions weaponizing their apps to take advantage of human psychology, while social forces have made these apps ubiquitous and part of the fabric of many people’s lives.
I don’t think it’s fair to say people “let” social media control them any more than it’s fair to say someone predisposed to alcoholism “lets” alcohol control them.
This isn’t to say we don’t need to each take steps to improve our situations or unplug from social media, but I’m pointing this out because of how it relates to your earlier diagnosis that “Everyone who uses social media is a loser”, which points the finger in the wrong direction and frames the issue as a personal problem vs. a growing systemic social issue.
Of course not everyone. That goes without saying, everyone is different and you’ll always find someone who is an exception. But when you build something for other people to use, it is useful to understand what is the most common mindset for your audience.
> To your point, just flipping this around to “I’m a winner” doesn’t seem quite right either. But more importantly, reinforcing the idea that “I’m a loser” seems counterproductive either way.
Maybe the solution would be to have to shout something embarrassing but not deprecating towards your own self-worth. Like “I eat spaghetti through my nose” or “my poop comes out really soft”. You’d certainly avoid using social media in public.
While a “punishment” that involves calling oneself a loser is a problem, the entire approach of punishment-based learning has given way to reinforcement approaches because they tend to be more effective in the long term without the negative effects of punishment-based approaches.
To put this another way, using punishment to stop using social media is probably not a good approach either way. Yelling “I’m a loser” is just one of the worst variants of this specific approach.
Yes - but then you go into the vicious cycle. Something in the line of The Little Prince by Antoine de Saint-Exupéry:
Why are you drinking? — the little prince asked.
- In order to forget — replied the drunkard.
- To forget what? — inquired the little prince, who was already feeling sorry for him.
- To forget that I am ashamed — the drunkard confessed, hanging his head.
- Ashamed of what? — asked the little prince who wanted to help him.
- Ashamed of drinking! — concluded the drunkard, withdrawing into total silence.
---
What helps is self-forgiveness and being gentle towards oneself. (I also was in the mode of guilt-tripping myself; and still, I do that often. But it does not help.)
I imagine what the OP meant is that when you feel you are wasting time on Social Media, if you say "I am a winner / I am better than this" (or something more positive), it will block the social media for you. So basically the reverse.
What suggests that shouting "I am a winner" is less annoying than shouting "I am a loser"? In fact, not just less annoying, but it has to be pleasant as in that scenario you would have to scream it while you are already struggling with impulse control. Even the slightest reason to not to do so would see you not do it in that type of situation.
“Don't speak negatively about yourself, even as a joke. Your body doesn't know the difference. Words are energy and they cast spells, that's why it's called spelling. Change the way you speak about yourself, and you can change your life.”
If you’re addicted to scrolling social media then you’ll just get used to calling yourself a loser to get another fix. Or you just uninstall the extension.
There needs to be a healthier alternative to that replaces the social media habit, that is reinforced by enjoying it. I do this by reading books I wouldn’t normally read, which also gives me a reason to browse indie bookshops.
I have on some occasions been tempted to wire up a shock collar to myself (or equivalent) and do some experiment for things like not visiting social media websites during certain times, but I find myself concerned that I may be reaching way, way further down the metaphorical "brain stack" than I really intend with that and could do some seriously weird things to myself in the process. So far I've always judged that risk as greater than the reward.
Yelling "I'm a loser" too much reminds me of that, though on a different level of the "brain stack". I get the sentiment, and I understand the somewhat playful intent, but quite seriously I'd suggest something more neutral at the very least. Maybe it's completely harmless, but that's clearly the best case scenario, and it goes down hill fast after that. "First, do no harm" strikes me as relevant here, and important as ever.
Maybe that's a little too close to the WINNERS DON'T USE DRUGS! splash screens that dominated the video games of my youth. We all snickered at those and I don't think it made a bit of difference. Dunno. Heck of a thing to holler when you're on the bus or whatever before you can get your fix, that's for sure.
But screaming "I'm a winner" doesn't do it either, and is perhaps even more undermining
Everyone knows if you yourself have to say "I'm randomPositiveAttribute", whether it is "winner", "genius", "brilliant", "good-looking", etc., you are NOT that — you are just a loser trying to tell everyone you are somehow a winner.
Perhaps the best thing to yell is the most straightforward — "Unlock Social Media Now!". It doesn't overtly characterize you, it honestly exposes your weakness, which is probably a more powerful shaming de-motivator.
Then it would be even simpler to build an app, because if you shout "I'm a winner", the extension doesn't need to do anything at all, just keep everything blocked as before...
Be kind to yourself, but think through the problem before sending a week worth of research articles.
If only there was an API to only allow closing an app on a specific condition.
Then you could make it so the pain was in leaving to go back to other work, so you'd enter knowing it would not be an easy exit. (But you'd get to yell self-affirming things on exit :) )
> How to rectify: Ensure your privacy policy contains details about user data collection, handling, storage and sharing. Omission of any section is not allowed.
So I added a section for each. I could make the "Information We Collect" section less verbose for sure.
Fully agree. The physics of solar panels on cars just doesn't work. It's bizarre that this is actively pursued by startups and concept cars from large manufacturers when it takes just quick back-of-the-napkin math to see.
A car has about 5 m^2 of flat space on the roof/hood/trunk so that's the maximum surface area that can capture solar energy at any given time.
The total energy to hit the area is 1000 w/m^2.
The panels can't rotate to track the sun so the effective area is the cosine of the angle. So you end up with about half the amount of effective sunlight hours as the actual daylight hours. So in summer you get about 6 hours of effective sunlight.
Good panels in real world conditions can give you 22% efficiency.
So in optimal conditions you get: 5 * 1000 * 6 * 0.22 = 6.6 kwh
That will reflect your best days. It can be dramatically less if it's cloudy, overcast, winter, far from the equator, car is dirty, parked in shade, etc.
6.6 kwh is about one tenth of the battery in my Hyundai Kona EV. With very conservative highway driving, 6.6 kwh can get about 40km of range and about 50km in city driving. It's what I get from plugging into my home charger for 30 min and what you get from a fast charger in about 3 minutes.
So besides some very niche uses, there's no sense in massively increasing the cost and complexity of a car by installing solar panels. Far better to put the panel on the roof of parking and just plug in for a few minutes while you park.
I haven't really dug in yet but from a quick skim, it looks promising. They show a big improvement over Whisper on a medical dataset (F1 increased from 80.5% to 96.58%).
The inference time for the keyword detection is about 10ms. If it scales linearly with additional keywords you could potentially scale to hundreds or thousands of keywords but it really depends on how sensitive you are to latency. For real-time with large vocabularies my guess is you might still want to fine-tune.
If by conflate you mean confuse, that’s not the case.
I’m positing that the Anthropic approach is to view (1) and (2) as interconnected and both deeply intertwined with model capabilities.
In this approach, the model is trained to have a coherent and unified sense of self and the world which is in line with human context, culture and values. This (obviously) enhances the model’s ability to understand user intent and provide helpful outputs.
But it also provides a robust and generalizable framework for refusing to assist a user due to their request being incompatible with human welfare. The model does not refuse to assist with making bio weapons because its alignment training prevents it from doing so, it refuses for the same reason a pro-social, highly intelligent human does: based on human context and culture, it finds it to be inconsistent with its values and world view.
> the piece dismisses it with "where would misalignment come from? It wasn't trained for."
this is a straw-man. you've misquoted a paragraph that was specifically about deceptive alignment, not misalignment as a whole