In this case I'm building a batch workflow: images come in, images get analyzed through a pipeline, images go into a GUI for review. The idea of using a VLM was just to avoid hand-building a solution, not because I actually want to use it in a chatbot. It's just interesting that a generalist model that has expert-level handwriting recognition completely falls apart on a different, but much easier, task.