Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For the most part that's been on known objects, these are objects it has not seen.


Not specifically trained on but most likely the Vision models have seen it. Vision models like Gemini flash/pro are already good at vision tasks on phones[1] - like clicking on UI elements and scrolling to find stuff etc. The planning of what steps to perform is also quite good with Pro model (slightly worse than GPT 4o in my opinion)

1. A framework to control your phone using Gemini - https://github.com/BandarLabs/clickclickclick


That's a really cool framework you've linked.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: