This is a harsh foot-gun that seems to harm many ollama users. That 2k default i...

magicalhippo · on Jan 27, 2025

There are several issues in the Ollama GitHub issue tracker related to this, like this[1] or this[2].

Fortunately it's easy to create a variant of the model with increased context size using the CLI[3] and then use that variant instead.

Just be mindful that longer context means more memory required[4].

[1]: https://github.com/ollama/ollama/issues/4967

[2]: https://github.com/ollama/ollama/issues/7043

[3]: https://github.com/ollama/ollama/issues/8099#issuecomment-25...

[4]: https://www.reddit.com/r/LocalLLaMA/comments/1848puo/comment...

neuralkoi · on Jan 27, 2025

Thank you! I was looking for how to do this. The example in the issue above shows how to increase the context size in ollama:

    $ ollama run llama3.2
    >>> /set parameter num_ctx 32768
    Set parameter 'num_ctx' to '32768'
    >>> /save llama3.2-32k
    Created new model 'llama3.2-32k'
    >>> /bye
    $ ollama run llama3.2-32k "Summarize this file: $(cat README.md)"
    ...

The table in the reddit post above also shows context size vs memory requirements for Model: 01-ai/Yi-34B-200K Params: 34.395B Mode: infer

    Sequence Length vs Bit Precision Memory Requirements
       SL / BP |     4      |     6      |     8      |     16
    --------------------------------------------------------------
           256 |     16.0GB |     24.0GB |     32.1GB |     64.1GB
           512 |     16.0GB |     24.1GB |     32.1GB |     64.2GB
          1024 |     16.1GB |     24.1GB |     32.2GB |     64.3GB
          2048 |     16.1GB |     24.2GB |     32.3GB |     64.5GB
          4096 |     16.3GB |     24.4GB |     32.5GB |     65.0GB
          8192 |     16.5GB |     24.7GB |     33.0GB |     65.9GB
         16384 |     17.0GB |     25.4GB |     33.9GB |     67.8GB
         32768 |     17.9GB |     26.8GB |     35.8GB |     71.6GB
         65536 |     19.8GB |     29.6GB |     39.5GB |     79.1GB
        131072 |     23.5GB |     35.3GB |     47.0GB |     94.1GB
    *   200000 |     27.5GB |     41.2GB |     54.9GB |    109.8GB

    * Model Max Context Size

Code: https://gist.github.com/lapp0/d28931ebc9f59838800faa7c73e3a0...

eurekin · on Jan 27, 2025

Can context be split on multiple GPUs?

magicalhippo · on Jan 27, 2025

Not my field, but from this[1] blog post which references this[2] paper, it would seem so. Note the optimal approach are a bit different between training and inference. Also note that several of the approaches rely on batching multiple requests (prompts) in order to exploit the parallelism, so won't see the same gains if fed only a single prompt at a time.

[1]: https://medium.com/@plienhar/llm-inference-series-4-kv-cachi...

[2]: https://arxiv.org/abs/2104.04473