I always find myself baffled by “prompt optimization” frameworks. Do people really find themselves needing random perturbations of a fixed prompt to improve accuracy? It’s my experience that the challenging part of writing a prompt is figuring out what the task you want done is, and understanding which data you need to pass to the model to make the task achievable. None of that can be achieved by “optimizing” the prompt—the hard part is a layer of abstraction upward.
It's useful in enterprise scenarios where you need a reliable outcome for some kind of programmatic task, and you are dealing with throughput of jobs in the thousands to hundreds of thousands.
depends what you're doing. If you're using ChatGPT via the UI for a one off question, sure. If you're prompting an LLM that is doing a critical task in production millions of times, minor improvements can have significant benefit
I have done the latter much more than the former. My experience has been the issues come from inputs that you don’t foresee, not reliability on in-distribution uses (which would be your “training” data for prompt optimization). And the worry is that this kind of optimization would lead to substantive revisions of the guidelines set out in the prompt, which could further compromise performance out of distribution.
To the extent that you need to eke out reliability on the margins, one is vastly better served by actual fine-tuning, which is available both for open-source models and most major proprietary models.