Okay, so this is a PySpur ad, alright. Since I'm interested in this kind of tools, and I see on their GitHub that they don't have loops yet, I have to ask: does anyone know of a similar tool (node/DAG-based) that does support looping?
It seems to be a common problem; so far, I've played with Rivet, n8n, and "LLM Party" nodes for ComfyUI; they all seem to focus on everything other than allowing to conveniently loop the flows.
I don't like formatting in bullet points and listicles much, but the contents are pretty good, they cover many papers in a lightweight way, you can get a decent overview in 10 minutes for what would take hours to research.
Hi, OP here; this article helped me a lot to understand better KV caches, which is ultimately why I co-wrote it with AI + read it several times before posting
getting tired of these blog posts that end with "this post is AI-generated" as if it's going to surprise us. it's getting repetitive. imo, articles should be prefaced if they're ai generated or not to make the reader not feel stupid after reading the whole thing
with that said, i love the content! will be bookmarking for future reference
Hi, OP here. My intention wasn't to "gotcha" anyone by mentioning that in the end, it was simply to be upfront. Many blog posts/content put out these days are obviously 100% AI-generated, yet it's never being mentioned. This one was probably 80%/20% (I still did many manual edits).
I feel like we’re living in strange times where your comment appears to be AI generated as well. You complain about the surprise at the end and then offer up a similar structural surprise in your reply.
Not sure if I'm getting this. Is this cache implemented as part of the forward pass through the network, in a general Python datastructure like a dict? Or is the cache somehow part of the fabric of the neural network?
The KV cache is typically stored in a data structure external to the trained weights—often a buffer or set of tensors kept alongside the model’s forward pass (e.g., in PyTorch, one might store it in a dictionary-like container). It’s not baked into the neural network parameters themselves; instead, it’s an auxiliary memory that holds precomputed key-value pairs so the model doesn’t have to re-encode past tokens on each new inference step.
Neither. Think of it as something like redis or memcached. It's external to the program, and the program will run just fine without it. But it avoids a lot of duplicate works.
It’s a tensor stored in GPU memory to improve inference throughput. Check out the PagedAttention (which introduces vLLM) paper for how most systems implement it nowadays.
Very clean writeup.
On the attention sinks, you mention they enable "infinite-length sequence processing". What does that mean exactly in practice? Isn't deepseek still capped at 128k?
"Infinite-length sequence processing" in StreamingLLM refers to handling much longer sequences than the model's training window (e.g., millions of tokens), by combining a sliding window for recent tokens with fixed attention sinks from the start of the sequence.
I can't speak for DeepSeek, but if I had to guess, I'd say that the infinite context window isn’t practical because storing all past tokens eventually becomes too expensive.
Agreed on the writeup itself. It's beautifully written and presented. Kudos to Jean Kaddour and anyone else that may have been involved in putting it together.
Thanks for reading! In most contexts (including this one), seq length encompasses both the initial input (prompt) tokens and the output tokens the model generates. It’s the total length of all tokens processed by the model so far.
It’s mostly a convention. In many deep learning frameworks (PyTorch, TensorFlow, etc.), inputs are stored with the “batch × length × hidden-dim” shape, effectively making the token embeddings row vectors. Multiplying “xW” is then the natural shape-wise operation. On the other hand, classical linear algebra references often treat vectors as column vectors and write “Wx.”
The phrase, “the first token looks at 1 token,” is simply a shorthand for the self-attention step when the sequence length is one. Although there are no preceding tokens, we still treat it as an O(1^2) operation where the first token effectively attends to itself (or a special [BOS] token). This approach preserves the big-O analysis when summing over all tokens.
No way to know until you painstakingly verify every single assertion that the AI made! The author of this article certainly didn't, and the content was good enough to them.
[+] [-] evertedsphere|1 year ago|reply
it's funny that this was clear about 5% in just due to the classic chatgpt-style format and tone
[+] [-] TeMPOraL|1 year ago|reply
It seems to be a common problem; so far, I've played with Rivet, n8n, and "LLM Party" nodes for ComfyUI; they all seem to focus on everything other than allowing to conveniently loop the flows.
[+] [-] didgeoridoo|1 year ago|reply
[+] [-] GalaxyNova|1 year ago|reply
[+] [-] t55|1 year ago|reply
[+] [-] llmthrow102|1 year ago|reply
[+] [-] visarga|1 year ago|reply
[+] [-] t55|1 year ago|reply
[+] [-] seanvelasco|1 year ago|reply
with that said, i love the content! will be bookmarking for future reference
[+] [-] t55|1 year ago|reply
Glad you overall liked it!
[+] [-] spencerf|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] amelius|1 year ago|reply
[+] [-] t55|1 year ago|reply
[+] [-] anvuong|1 year ago|reply
[+] [-] ahzhou|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] deepdarkforest|1 year ago|reply
[+] [-] t55|1 year ago|reply
"Infinite-length sequence processing" in StreamingLLM refers to handling much longer sequences than the model's training window (e.g., millions of tokens), by combining a sliding window for recent tokens with fixed attention sinks from the start of the sequence.
I can't speak for DeepSeek, but if I had to guess, I'd say that the infinite context window isn’t practical because storing all past tokens eventually becomes too expensive.
[+] [-] m348e912|1 year ago|reply
[+] [-] spps11|1 year ago|reply
Thanks for the post, it was an excellent read!
[+] [-] t55|1 year ago|reply
[+] [-] karolist|1 year ago|reply
[+] [-] t55|1 year ago|reply
As far as I know, they are the only ones using it so far
[+] [-] 8note|1 year ago|reply
k = Wx
seeing
k = xW
is jarring. Is there a reason for using horizontal vectors? Common for data science docs?
[+] [-] t55|1 year ago|reply
[+] [-] quanto|1 year ago|reply
[+] [-] sifar|1 year ago|reply
[+] [-] pama|1 year ago|reply
[+] [-] t55|1 year ago|reply
[+] [-] narmiouh|1 year ago|reply
[+] [-] yellow_lead|1 year ago|reply
> First token: Look at 1 token (cost: O(1^2))
Umm, is this right? There is not 1 token existing before generating the first token, so how do you look at it? AI slop?
[+] [-] t55|1 year ago|reply
[+] [-] Vampiero|1 year ago|reply
No way to know until you painstakingly verify every single assertion that the AI made! The author of this article certainly didn't, and the content was good enough to them.
Trust me, AGI is almost there.
[+] [-] deepstake|1 year ago|reply
[deleted]
[+] [-] unknown|1 year ago|reply
[deleted]