Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sorry, but this is a lot of marketing for the same thing over and over again. I'm not against Aloha as an _affordable_ platform, but skimping on hardware is kind of a bug not a feature. Moreover it's not even _lowcost_, its BoM is still like 20k and collecting all the data is labor intensive and not cheap.

And if we're focusing on the idea, it has existed since the 1950s and they were doing it relatively well then:

https://www.youtube.com/watch?v=LcIKaKsf4cM



> skimping on hardware is kind of a bug not a feature.

I have to disagree here. Not for 20k, but if you could really build a robot arm out of basically a desk lamp, some servos and a camera and had some software to control it as precisely as this video claims it does, this would be a complete game changer. We'd probably see an explosion of attempts to automate all kind of everyday household tasks that are infeasible to automate cost-effectively today (folding laundry, cleaning up the room, cooking, etc)

Also, every self-respecting maker out there would probably try to build one :)

> And if we're focusing on the idea, it has existed since the 1950s and they were doing it relatively well then:

I don't quite understand how the video fits here. That's a manually operated robot arm. The point of Aloha is that it's fully controlled by software, right?


If you want a robot that can fold your laundry, clean your room and cook, you need a lot more than cheap hardware. You need an autonomous agent (i.e. "an AI") that can guide the hardware to accomplish the task.

We're still very far from that and you certainly can't do that with ALOHA, in practice, despite what the videos may seem to show. For each of the few, discrete, tasks that you see in the videos, the robot arms have to be trained by demonstration (via teleoperation) and the end result is a system that can only copy the operator's actions with very little variation.

You can check this in the Mobile ALOHA paper on arxiv (https://arxiv.org/abs/2401.02117) where page 6 shows the six tasks the system has been trained to perform, and the tolerances in the initial setup. So e.g. in the shrimp cooking task, the initial position of the robot can vary by 10cm and the position of the implements by 2cm. If everything is not set up just so, the task will fail.

What all this means is that if you could assemble this "cheap" system you'd then have to train it by a few hundred demonstrations to fold your laundry, and maybe it could do it, probably not, and if you moved the washing machine or got a new one, you'd have to train all over again.

As to robots cleaning up your room and cooking, those are currently in the realm of science fiction, unless you're a zen ascetic living in an empty room and happy to eat beans on toast every day. Beans from a can, that is. You'll have to initialise the task by opening the can yourself, obviously. You have a toaster, right?


> If you want a robot that can fold your laundry, clean your room and cook, you need a lot more than cheap hardware. You need an autonomous agent (i.e. "an AI") that can guide the hardware to accomplish the task.

Yes, that's my point. Cheap hardware is far harder to control than expensive hardware, so if Google actually developed some AI that can do high-precision tasks on "wobbly", off-the-shelf hardware, that would be the breakthrough.

I agree that extensive training for each single device would be prohibitive, but that feels like a problem that could be solved with more development: With many machine learning tasks, we started with individual training a model for each specific use case and environment. Today we're able to make generalized model which are trained once and can be deployed in a wide variety of environments. I don't see why this shouldn't be possible for a vision-based robot controller either.

Managing the actual high-level task is easy as soon as you're able to do all the low-level tasks: I.e., converting a recipe into a machine-readable format, dividing it into a tree of tasks and subtasks etc is easy. The hard parts are actually cutting the vegetables, de-boning the meat, etc. The amount of complex movement planning necessary for that doesn't exist yet. But this project looks as if it's a step in exactly that direction.


These videos are all autonomous. They didn't have that in the 1950s.


I can appreciate that, but also they are recording and replaying motor signals from specific teleoperation demonstrations. Something that _was_ possible in the 1950s. You might say that it is challenging to replay demonstrations well on lower-quality hardware. And so there is academic value in trying to make it work on worse hardware, but it would not be my goto solution for real industry problems. E.g. this is not a route I would fund for a startup, for example.


They do not replay recorded motor signals. They use recorded motor signals only to train neural policies, which then run autonomously on the robot and can generalize to new instances of a task (such as the above video generalizing to an adult size sweater when it was only ever trained on child size polo shirts).

Obviously some amount of generalization is required to fold a shirt, as no two shirts will ever be in precisely the same configuration after being dropped on a table by a human. Playback of recorded motor signals could never solve this task.


> recorded motor signals only to train neural policies

Is interesting that they are using "Leader Arms" [0] to encode tasks instead of motion capture. Is it just a matter of reduced complexity to get off the ground? I suppose the task of mapping human arm motion to what a robot can do is tough.

0. https://www.trossenrobotics.com/widowx-aloha-set


I appreciate that going from polo shirts to sweaters is a form of "generalisation" but that's only interesting because of the extremely limited capability for generalisation that systems have when they're trained by imitation learning, as ALOHA.

Note for example that all the shirts in the videos are oriented in the same direction, with the neck facing to the top of the video. Even then, the system can only straighten a shirt that lands with one corner folded under it after many failed attempts, and if you turned a shirt so that the neck faced downwards, it wouldn't be able to straighten it and hang it no matter how many times it tried. Let's not even talk about getting a shirt tangled in the arms themselves (in the videos, a human intervenes to free the shirt and start again). It's trained to straighten a shirt on the table, with the neck facing one way [1].

So the OP is very right. We're no nearer to real-world autonomy than we were in the '50s. The behaviours of the systems you see in the videos are still hard-coded, only they're hard-coded by demonstration, with extremely low tolerance for variation in tasks or environments, and they still can't do anything they haven't been painstakingly and explicitly shown how to do. This is a sever limitation and without a clear solution to it there's no autonomy.

On the other hand, ιδού πεδίον δόξης λαμπρόν, as we say in Greek. This is a wide open field full of hills to plant one's flag on. There's so much that robotic autonomy can't yet do that you can get Google to fund you if you can show a robot tying half a knot.

__________________

[1] Note btw that straightening the shirt is pointless: it will straighten up when you hang it. That's just to show the robot can do some random moves and arrive at a result that maybe looks meaningful to a human, but there's no way to tell whether the robot is sensing that it achieved a goal, or not. The straightening part is just a gimmick.


We're building software for neuromorphic cameras specifically for robotics. If robots could actually understand motion in completely unconstrained situations, then both optimal control and modern ML techniques would easily see uplift in capability (i.e. things work great in simulation, but you can't get good positions and velocities accurately and at high enough rate in the real world). Robots already have fast, accurate motors, but their vision systems are like seeing the world through a strobe light.


This is not preprogrammed replay. Replay would not be able handle even tiny variations in the starting positions of the shirt.


So, a couple things here.

It is true that replay in the world frame will not handle initial position changes for the shirt. But if the commands are in the frame of the end-effector and the data is object-centric, replay will somewhat generalize.(Please also consider the fact that you are watching the videos that have survived the "should I upload this?" filter.)

The second thing is that large-scale behavior cloning (which is the technique used here), is essentially replay with a little smoothing. Not bad inherently, but just a fact.

My point is that there was an academic contribution made back when the first aloha paper came out and they showed doing BC on low-quality hardware could work, but this is like the 4th paper in a row of sort of the same stuff.

Since this is YC, I'll add - As an academic (physics) turned investor, I would like to see more focus on systems engineering and first-principles thinking. Less PR for the sake of PR. I love robotics and really want to see this stuff take off, but for the right reasons.


> large-scale behavior cloning (which is the technique used here), is essentially replay with a little smoothing

A definition of "replay" that involves extensive correction based on perception in the loop is really stretching it. But let me take your argument at face value. This is essentially the same argument that people use to dismiss GPT-4 as "just" a stochastic parrot. Two things about this:

One, like GPT-4, replay with generalization based on perception can be exceedingly useful by itself, far more so than strict replay, even if the generalization is limited.

Two, obviously this doesn't generalize as much as GPT-4. But the reason is that it doesn't have enough training data. With GPT-4 scale training data it would generalize amazingly well and be super useful. Collecting human demonstrations may not get us to GPT-4 scale, but it will be enough to bootstrap a robot useful enough to be deployed in the field. Once there is a commercially successful dextrous robot in the field we will be able to collect orders of magnitude more data, unsupervised data collection should start to work, and robotics will fall to the bitter lesson just as vision, ASR, TTS, translation, and NLP before.


"Limited generalisation" in the real world means you're dead in the water. Like the Greek philosopher Heraclitus pointed out 2000+ years go, the real world is never the same environment and any task you want to carry out is not the same task the second time you attempt it (I'm paraphrasing). The systems in the videos can't deal with that. They work very similar to industrial robots: everything has to be placed just so with only centimeters of tolerance in the initial placement of objects, and tiny variations in the initial setup throw the system out of whack. As the OP points out, you're only seeing the successful attempts in carefully selected videos.

That's not something that you can solve with learning from data, alone. A real-world autonomous system must be able to deal with situations that it has no experience with, it has to be able to deal with them as they unfold, and it has to learn from them general strategies that it can apply to more novel situations. That is a problem that, by definition, cannot be solved by any approach that must be trained offline on many examples of specific situations.


Thank you for your rebuttal. It is good to think about the "just a stochastic parrot" thing. In many ways this is true, but it might not be bad. I'm not against replay. I'm just pointing out that I would not start with an _affordable_ 20k robot with fairly undeveloped engineering fundamentals. It's kind of like trying to dig a foundation to your house with a plastic beach shovel. Could you do it? Maybe, if you tried hard enough. Is it the best bet for success? doubtful.


The detail about end-effector frame is pretty critical as doing this BC with joint angles would not be tractable. You can tell there was a big shift from the RL approaches trying to do very generalizing algorithms to more recent works that are heavily focused on this arms/manipulators because end-effector control enables more flashy results.

Another limiting factor is that data collection is a big problem: not only will you never be sure you've collected enough data, they're collecting data of a human trying to do this work through a janky teleoperation rig. The behavior they're trying to clone is of a human working poorly, which isn't a great source of data! Furthermore limiting the data collection to (typically) 10Hz means that the scene will always have to be quasi-static, and I'm not sure these huge models will speed up enough to actually understand velocity as a 'sufficient statistic' of the underlying dynamics.

Ultimately, it's been frustrating to see so much money dumped into the recent humanoid push using teleop / BC. It's going to hamper the folks actually pursing first-principles thinking.


BC = Behavioural Cloning.

>> It's going to hamper the folks actually pursing first-principles thinking.

Nah.


What's your preferred approach?


What do you mean by saying that they're replaying signals from teleoperation demonstrations? Like in https://twitter.com/DannyDriess/status/1780270239185588732, was someone demonstrating how to struggle to fold a shirt, then they put a shirt in the same orientation and had the robot repeat the same motor commands?


Yeah, it's called imitation learning. Check out the Mobile ALOHA paper where they explain their approach in some detail:

https://arxiv.org/abs/2401.02117


I follow this space closely and I never saw the 1950 teleoperation video which literally blows my mind that people had this working in 1950. Now you just need to connect that to a transformer / diffusion and it will be able to perform that task autonomously maybe 80% of the time with 200+ demonstrations and close to 100% of the time with 1000+ demonstrations.

Aloha was not new, but it’s still good work because robotics researchers were not focused on this form of data collection. The issue was most people went into the simulation rabbit hole where they had to solve sim-to-real.

Others went into the VR handset and hand tracking idea, where you never got super precise manipulations and so any robots trained on that always showed choppy movement.

Others including OpenAI decided to go full reinforcement learning foregoing human demonstrations which had some decent results but after 6 months of RL on an arm farm led by Google and Sergey Levine, the results were underwhelming to say the least.

So yes it’s not like Aloha invented teleoperation, they demonstrated that using this mode of teleoperation you could collect a lot of data that can train autonomous robot policies easily and beat other methods which I think is a great contribution!


I’m not sure you can say that imitation learning has been under-researched in the past. Imitation learning has been tried before alongside RL. But it did not generalize well until the advent of generative diffusion models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: