But these models are more like generalists no? Couldn’t they simply be hooked up to more specialized models and just defer to them the way coding agents now use tools to assist?
There would be no point in going via an LLM then, if I had a specialist model ready I'd just invoke it on the images directly. I don't particularly need or want a chatbot for this.
Current LLMs are doing this for coding, and it's very effective. It delegates to tool calls, but a specialized model can just be thought of as another tool. The LLM can be weak in some stuff handled by simple shell scripts or utilities, but strong in knowing what scripts/commands to call. For example, doing math via the model natively may be inaccurate, but the model may know to write the code to do math. An LLM can automate a higher level of abstraction, in the same way a manager or CEO might delegate tasks to specialists.
In this case I'm building a batch workflow: images come in, images get analyzed through a pipeline, images go into a GUI for review. The idea of using a VLM was just to avoid hand-building a solution, not because I actually want to use it in a chatbot. It's just interesting that a generalist model that has expert-level handwriting recognition completely falls apart on a different, but much easier, task.