Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not necessarily. For example Anthropic's ConstitutionalAI (CAI) leverages the model to substitute human judgments in RLHF, effectuating essentially RLAIF. CAI information is used to fine-tune the Claude model.

Broadly speaking, you require statistics at echelon N+1 when you are at rung N. We can amplify models by providing them additional time, self-reflexion, demand step by step planning, allow external tools, tune it on human preferences, or give it feedback from executing a code, or from a robot.



Yeah, it makes some sense that you could use a more intense introspection to train weaker ones… I wonder what the human analogue for that looks like.

Maybe working up a proof and then quizzing yourself on it?

As long as we get >N supervision and the difference is more than the model retrograde, it seems that could work. But it seems like there is a definite limit to that. The N-n1 difference will only stay above the improvement delta up to a point.


The model would learn from feedback, not just regurgitate the training set, as long as the model is part of a system that can generate this feedback. AlphaGo Zero had self play for feedback. Robots can check task execution success. Even chatting with us generates feedback to the model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: