For the most part that's been on known objects, these are objects it has not see...

mkagenius · 2025-03-12T17:33:59 1741800839

Not specifically trained on but most likely the Vision models have seen it. Vision models like Gemini flash/pro are already good at vision tasks on phones[1] - like clicking on UI elements and scrolling to find stuff etc. The planning of what steps to perform is also quite good with Pro model (slightly worse than GPT 4o in my opinion)

1. A framework to control your phone using Gemini - https://github.com/BandarLabs/clickclickclick

KoolKat23 · 2025-03-12T19:21:47 1741807307

That's a really cool framework you've linked.