Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a harsh foot-gun that seems to harm many ollama users.

That 2k default is extremely low, and ollama *silently* discards the leading context. So users have no idea that most of their data hasn’t been provided to the model.

I’ve had to add docs [0] to aider about this, and aider overrides the default to at least 8k tokens. I’d like to do more, but unilaterally raising the context window size has performance implications for users.

Edit: Ok, aider now gives ollama users a clear warning when their chat context exceeds their ollama context window [1].

[0] https://aider.chat/docs/llms/ollama.html#setting-the-context...

[1] https://github.com/Aider-AI/aider/blob/main/aider/coders/bas...



There are several issues in the Ollama GitHub issue tracker related to this, like this[1] or this[2].

Fortunately it's easy to create a variant of the model with increased context size using the CLI[3] and then use that variant instead.

Just be mindful that longer context means more memory required[4].

[1]: https://github.com/ollama/ollama/issues/4967

[2]: https://github.com/ollama/ollama/issues/7043

[3]: https://github.com/ollama/ollama/issues/8099#issuecomment-25...

[4]: https://www.reddit.com/r/LocalLLaMA/comments/1848puo/comment...


Thank you! I was looking for how to do this. The example in the issue above shows how to increase the context size in ollama:

    $ ollama run llama3.2
    >>> /set parameter num_ctx 32768
    Set parameter 'num_ctx' to '32768'
    >>> /save llama3.2-32k
    Created new model 'llama3.2-32k'
    >>> /bye
    $ ollama run llama3.2-32k "Summarize this file: $(cat README.md)"
    ...
The table in the reddit post above also shows context size vs memory requirements for Model: 01-ai/Yi-34B-200K Params: 34.395B Mode: infer

    Sequence Length vs Bit Precision Memory Requirements
       SL / BP |     4      |     6      |     8      |     16
    --------------------------------------------------------------
           256 |     16.0GB |     24.0GB |     32.1GB |     64.1GB
           512 |     16.0GB |     24.1GB |     32.1GB |     64.2GB
          1024 |     16.1GB |     24.1GB |     32.2GB |     64.3GB
          2048 |     16.1GB |     24.2GB |     32.3GB |     64.5GB
          4096 |     16.3GB |     24.4GB |     32.5GB |     65.0GB
          8192 |     16.5GB |     24.7GB |     33.0GB |     65.9GB
         16384 |     17.0GB |     25.4GB |     33.9GB |     67.8GB
         32768 |     17.9GB |     26.8GB |     35.8GB |     71.6GB
         65536 |     19.8GB |     29.6GB |     39.5GB |     79.1GB
        131072 |     23.5GB |     35.3GB |     47.0GB |     94.1GB
    *   200000 |     27.5GB |     41.2GB |     54.9GB |    109.8GB

    * Model Max Context Size
Code: https://gist.github.com/lapp0/d28931ebc9f59838800faa7c73e3a0...


Can context be split on multiple GPUs?


Not my field, but from this[1] blog post which references this[2] paper, it would seem so. Note the optimal approach are a bit different between training and inference. Also note that several of the approaches rely on batching multiple requests (prompts) in order to exploit the parallelism, so won't see the same gains if fed only a single prompt at a time.

[1]: https://medium.com/@plienhar/llm-inference-series-4-kv-cachi...

[2]: https://arxiv.org/abs/2104.04473




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: