top | item 36388563

(no title)

I use locally hosted LLM models for roleplaying almost exclusively, and I've found that the only acceptable models for that use case are 30B or larger (though the jump between a 40B model and a 65B model reaches the point of diminishing returns as far as my experience goes).

What I'd really like to see improve is having larger contexts. The current context size available for llama.cpp-compatible models is 2048 tokens, which given the scenario and character descriptions at the start of the prompt only gives the LLM about 3 paragraphs of memory. It just forgets anything you said more than 3 paragraphs ago, which makes for a pretty miserable longer term RP experience unless you constantly update the summary of what's happened in the story/roleplay so far to be sent with every prompt.

4096 or larger contexts running efficiently (>= 1 T/s) with 12GB/24GB consumer grade GPU layer offloading would be fantastic, and bonus points if we can get 30 or 40B models working with that.

discuss

nar001|2 years ago

How do you self host them? What are the specs of the computer you use?