Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It will be interesting to see whether this sort of approach works better than something using GPT-4's vision capabilities. Obviously websites are built to be easy to use visually rather than easy to use via the DOM. On the other hand, it's much less clear how to ground action proposals in the visual domain - how do you ask GPT where on an image of the screen it wants to click?


Sites that adhere to WCAG should actually have quite good support for programmatic manipulation, since it's required for screen readers to work.

It would be pretty interesting if a push to makes sites easier to use for AI agents ends up making sites better for blind users as well.


> Sites that adhere to WCAG

According to this survey, 97.4% of websites don't comply with WCAG, which isn't surprising at all to me as someone who has been in the industry since 2004.

https://webaim.org/projects/million/


Since GPT4 released, I've been hoping the vision capabilities will be very shortly followed by projects to essentially allow natural language RPA of a desktop computer.

Copying financials from a PDF to an Excel sheet, for instance, is the kind of task that is tricky to manually automate but seems like it would be trivial for an LLM to execute.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: