To add some context, this isn't that novel of an approach. A common approach to ...

resiros · on Sept 20, 2024

I think the innovation is using caching as so to make the cost of the approach manageable. The way they implemented it is that each time you create a chunk, you ask the llm to create an atomic chunk from the whole context. You need to do this for all tens of thousands of chunks in your data. This costs a lot. By caching the documents, you can spare costs

skeptrune · on Sept 20, 2024

You could also just save the first outputted atomic chunk and store it then re-use it each time yourself. Easier and more consistent.

IanCal · on Sept 20, 2024

I don't understand how that helps here. They're not regenerating each chunk every time, this is about caching the state after running a large doc through a model. You can only do this kind of thing if you have access to the model itself, or it's provided by the API you use.

postalcoder · on Sept 20, 2024

To be fair, that only works if you keep chunk windows static.

postalcoder · on Sept 20, 2024

Yup. Caching is very nice.. but the framing is weird. "Introducing" to me, connotes a product release, not a new tutorial.

bayesianbot · on Sept 20, 2024

I was trying to do this using Prompt Caching like a month ago, but then noticed there's five minute maximum lifetime for the cached prompts - doesn't really work for my RAG needs (or probably most), where the queries would be ran during the next month or a year. I can't see any changes to that policy. Little surprised to see them talk about Prompt Caching relating to RAG.

spott · on Sept 20, 2024

They aren’t using the prompt caching on the query side, only on the embedding side… so you cache the document in the context window when ingesting it, but not during retrieval.

KTibow · on Sept 20, 2024

It seems a little odd to make multiple requests instead of using one request to create all the context for all the chunks.