I absolutely love your approach of "expert tools". If I understand your approach...

I absolutely love your approach of "expert tools". If I understand your approach, you aren't just feeding a video into a multimodal LLM and asking it "what is the bounding box of the optimal caption region?" -- you have built tools with discrete algorithms (using traditional CV techniques) that use things like object detection boxes + traditional motion analysis techniques to give "expert opinions" to the LLM in the form of tool calls -- such as finding the regions of minimal saliency + minimal movement to be the best places for caption placement.

If the LLM needs to place captions, it calls one of these expert discrete-algorithm tools to determine the best place to put the captions -- you aren't just asking the LLM to do it on its own.

If I'm correct about that, then I absolutely applaud you -- it feels like THIS is a fantastic model for how agentic tools should be built, and this is absolutely the opposite of AI slop.

Kudos!