DeveloperErrata's comments

DeveloperErrata | 1 month ago | on: FlashAttention-T: Towards Tensorized Attention

Not quite, most of the recent work on modern RNNs has been addressing this exact limitation. For instance linear attention yields formulations that can be equivalently interpreted either as a parallel operation or a recursive one. The consequence is that these parallelizable versions of RNNs are often "less expressive per param" than their old-school non-parallelizable RNN counterparts, though you could argue that they make up for that in practice by being more powerful per unit of training compute via much better training efficiency.

DeveloperErrata | 7 months ago | on: LLM architecture comparison

This was really educational to me, felt at the perfect level of abstraction to learn a lot about the specifics of LLM architecture without the difficulty of parsing the original papers

DeveloperErrata | 8 months ago | on: Grok 4 Launch [video]

Don't know how Grok is setup, but in earlier models the vision backbone was effectively a separate model that was trained to convert vision inputs into a tokenized output, where the tokenized outputs would be in the form of "soft tokens" that the main model would treat as input and attend to just like it would for text token inputs. Because they're two separate things, you can modify each somewhat independently. Not sure how things are currently setup tho.

DeveloperErrata | 11 months ago | on: Show HN: Chonky – a neural approach for text semantic chunking

Trueish - for orgs that can't use API models for regulatory or security reasons, or that just need really efficient high throughput models, setting up your own infra for long context models can still be pretty complicated and expensive. Careful chunking and thoughtful design of the RAG system often still matters a lot in that context.

DeveloperErrata | 1 year ago | on: URAvatar: Universal Relightable Gaussian Codec Avatars

Seems like this would (eventually) be big for VR applications. Especially if the avatar could be animated using sensors installed on the headset so that the expressions match the headset user. Reminds me of the metaverse demo with Zuckerberg and Lex Friedman

DeveloperErrata | 1 year ago | on: Show HN: Velvet – Store OpenAI requests in your own DB

I agree, a naive approach to approximate caching would probably not work for most use cases.

I'm speculating here, but I wonder if you could use a two stage pipeline for cache retrieval (kinda like the distance search + reranker model technique used by lots of RAG pipelines). Maybe it would be possible to fine-tune a custom reranker model to only output True if 2 queries are semantically equivalent rather than just similar. So the hypothetical model would output True for "how to change the oil" vs. "how to replace the oil" but would output False in your Spain example. In this case you'd do distance based retrieval first using the normal vector DB techniques, and then use your custom reranker to validate that the potential cache hits are actual hits

DeveloperErrata | 1 year ago | on: Show HN: Velvet – Store OpenAI requests in your own DB

Seems neat - I'm not sure if you do anything like this but one thing that would be useful with RAG apps (esp at big scales) is vector based search over cache contents. What I mean is that, users can phrase the same question (which has the same answer) in tons of different ways. If I could pass a raw user query into your cache and get back the end result for a previously computed query (even if the current phrasing is a bit different than the current phrasing) then not only would I avoid having to submit a new OpenAI call, but I could also avoid having to run my entire RAG pipeline. So kind of like a "meta-RAG" system that avoids having to run the actual RAG system for queries that are sufficiently similar to a cached query, or like a "approximate" cache.

DeveloperErrata | 1 year ago | on: Show HN: Convert HTML DOM to semantic markdown for use in LLMs

It's neat to see this getting attention. I've used similar techniques in production RAG systems that query over big collections of HTML docs. In our case the primary motivator was higher token efficiency (ie to represent the same semantic content but with a smaller token count).

I've found that LLMs are often bad at understanding "standard" markdown tables (of which there are many different types, see: https://pandoc.org/chunkedhtml-demo/8.9-tables.html). In our case, we found the best results when keeping the HTML tables in HTML, only converting the non-table parts of the HTML doc to markdown. We also strip the table tags of any attributes, with the exception of colspan and rowspan which are semantically meaningful for more complicated HTML tables. I'd be curous if there are LLM performance differences between the approach the author uses here (seems like it's based on repeating column names for each cell?) and just preserving the original HTML table structure.

DeveloperErrata | 2 years ago | on: Show HN: AI-Powered Vintage Interactive Fiction Interpreter

I love old school interactive fiction games (like Zork, etc) but find the strict syntax endlessly frustrating. I built this ChatGPT powered "middleman" to translate commands written in natural language into something understandable by the simple parser of old interactive fiction games.

To run, see instructions here: https://github.com/ethan-w-roland/ai-interactive-fiction

For a demo, see video here: https://www.youtube.com/watch?v=JHzeb39VqkM

page 1