It will be interesting to see whether this sort of approach works better than something using GPT-4's vision capabilities. Obviously websites are built to be easy to use visually rather than easy to use via the DOM. On the other hand, it's much less clear how to ground action proposals in the visual domain - how do you ask GPT where on an image of the screen it wants to click?
According to this survey, 97.4% of websites don't comply with WCAG, which isn't surprising at all to me as someone who has been in the industry since 2004.
Since GPT4 released, I've been hoping the vision capabilities will be very shortly followed by projects to essentially allow natural language RPA of a desktop computer.
Copying financials from a PDF to an Excel sheet, for instance, is the kind of task that is tricky to manually automate but seems like it would be trivial for an LLM to execute.