Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How do you iterate on and manage prompts in production?
14 points by ivanpashenko on June 11, 2024 | hide | past | favorite | 4 comments
I'm curious about how people handle managing and iterating on their prompts, especially for production-facing prompts that many users depend on.

- Where do you store your prompts? - Do you use version control? - How do you test prompts after editing? - Where do you store your test sets? - Do you evaluate results? If so, how? - Are you fine-tuning models like GPT-3.5 for better/cheaper results?



We have a script and a input library that has a bunch of scoring dimensions and allows a head to head comparison of a new candidate prompt vs what's in prod. It takes a configuration (prompt, which LLM to use, temperature etc.) and then gets run with all the various inputs and makes a json blob of the outputs for scoring.

Most of the score dimensions are deterministic but we've added some where we integrated an LLM to do the scoring (... which brings the new problem of scoring the scoring prompt!). We also do a manual scan of the outputs to sanity check. Not doing any fine tuning yet as we're getting pretty good results with just prompting.


sounds pretty advanced! can you share some examples of deterministic dimensions for scoring?

and what about llm-scoring: does LLM output passed/not_passed or it is more?


So for context our app (https://nativi.sh) is a language correction app. It takes in text and cleans it up to make it sound more fluent/correct, it's basically geared towards being grammarly for your second language.

For some of our deterministic LLM tests, we have inputs that have known spelling errors but no wrong word errors, or some other combination of errors. If the config under test doesn't identify the issue, or identifies issues that we know aren't there then it's marked as being wrong for that test case. Then we can test across config x language x kind_of_error.

For the LLM vibe driven scoring we have it set up to just do a head to head between the current leading config (usually what's in prod) and the new candidate config rather than generating an abstract score. It will flag "x config straight up failed question N based on some_reason(s)" so that we can manually check it.

My partner wrote the testing framework. She's been thinking about cleaning it up and open sourcing it.


"7 likes / no comments" --> should I read it as: people interested in others people experience, but have nothing to share about their own? - No prompt on production? - No testing or other routines about it yet?

Please share your current status :)




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: