I'm feeling that an API is something much more stable and deterministic than human-readable interface. Also you can train AI to learn which API calls to make for the task by looking at page sources. Why not translating prompts to single API calls instead of a script for clicking through DOM elements achieving the same?