DeveloperErrata
|
1 month ago
|
on: FlashAttention-T: Towards Tensorized Attention
Not quite, most of the recent work on modern RNNs has been addressing this exact limitation. For instance linear attention yields formulations that can be equivalently interpreted either as a parallel operation or a recursive one. The consequence is that these parallelizable versions of RNNs are often "less expressive per param" than their old-school non-parallelizable RNN counterparts, though you could argue that they make up for that in practice by being more powerful per unit of training compute via much better training efficiency.
DeveloperErrata
|
7 months ago
|
on: LLM architecture comparison
This was really educational to me, felt at the perfect level of abstraction to learn a lot about the specifics of LLM architecture without the difficulty of parsing the original papers
DeveloperErrata
|
8 months ago
|
on: Grok 4 Launch [video]
Don't know how Grok is setup, but in earlier models the vision backbone was effectively a separate model that was trained to convert vision inputs into a tokenized output, where the tokenized outputs would be in the form of "soft tokens" that the main model would treat as input and attend to just like it would for text token inputs. Because they're two separate things, you can modify each somewhat independently. Not sure how things are currently setup tho.
DeveloperErrata
|
10 months ago
|
on: Why it is (nearly) impossible that we live in a simulation
Consider the difference between the requirements to simulate the universe and simulate a person's experience of the universe. As people in the universe, we wouldn't be able to tell the difference, but the latter would be have much lower requirements
DeveloperErrata
|
11 months ago
|
on: Show HN: Chonky – a neural approach for text semantic chunking
Trueish - for orgs that can't use API models for regulatory or security reasons, or that just need really efficient high throughput models, setting up your own infra for long context models can still be pretty complicated and expensive. Careful chunking and thoughtful design of the RAG system often still matters a lot in that context.
DeveloperErrata
|
1 year ago
|
on: Nvidia Dynamo: A Datacenter Scale Distributed Inference Serving Framework
Increasingly so. Many other popular inference tools in this space also expose an OpenAI compatible API: VLLM, Llama.cpp, and LiteLLM all do.
DeveloperErrata
|
1 year ago
|
on: URAvatar: Universal Relightable Gaussian Codec Avatars
Seems like this would (eventually) be big for VR applications. Especially if the avatar could be animated using sensors installed on the headset so that the expressions match the headset user. Reminds me of the metaverse demo with Zuckerberg and Lex Friedman
DeveloperErrata
|
1 year ago
|
on: AI engineers claim new algorithm reduces AI power consumption by 95%
Macbook Pros with M3 & integrated RAM & VRAM can do 70B models :)
DeveloperErrata
|
1 year ago
|
on: Ask HN: Platform for 11 year old to create video games?
I want to plug the Little Big Planet series of games, it's what got me into programming when I was young and I think it still has a lot of charm
DeveloperErrata
|
1 year ago
|
on: Show HN: Velvet – Store OpenAI requests in your own DB
I agree, a naive approach to approximate caching would probably not work for most use cases.
I'm speculating here, but I wonder if you could use a two stage pipeline for cache retrieval (kinda like the distance search + reranker model technique used by lots of RAG pipelines). Maybe it would be possible to fine-tune a custom reranker model to only output True if 2 queries are semantically equivalent rather than just similar. So the hypothetical model would output True for "how to change the oil" vs. "how to replace the oil" but would output False in your Spain example. In this case you'd do distance based retrieval first using the normal vector DB techniques, and then use your custom reranker to validate that the potential cache hits are actual hits
DeveloperErrata
|
1 year ago
|
on: Show HN: Velvet – Store OpenAI requests in your own DB
Seems neat - I'm not sure if you do anything like this but one thing that would be useful with RAG apps (esp at big scales) is vector based search over cache contents. What I mean is that, users can phrase the same question (which has the same answer) in tons of different ways. If I could pass a raw user query into your cache and get back the end result for a previously computed query (even if the current phrasing is a bit different than the current phrasing) then not only would I avoid having to submit a new OpenAI call, but I could also avoid having to run my entire RAG pipeline. So kind of like a "meta-RAG" system that avoids having to run the actual RAG system for queries that are sufficiently similar to a cached query, or like a "approximate" cache.
DeveloperErrata
|
1 year ago
|
on: Show HN: Convert HTML DOM to semantic markdown for use in LLMs
It's neat to see this getting attention. I've used similar techniques in production RAG systems that query over big collections of HTML docs. In our case the primary motivator was higher token efficiency (ie to represent the same semantic content but with a smaller token count).
I've found that LLMs are often bad at understanding "standard" markdown tables (of which there are many different types, see: https://pandoc.org/chunkedhtml-demo/8.9-tables.html). In our case, we found the best results when keeping the HTML tables in HTML, only converting the non-table parts of the HTML doc to markdown. We also strip the table tags of any attributes, with the exception of colspan and rowspan which are semantically meaningful for more complicated HTML tables. I'd be curous if there are LLM performance differences between the approach the author uses here (seems like it's based on repeating column names for each cell?) and just preserving the original HTML table structure.
DeveloperErrata
|
2 years ago
|
on: Show HN: AI-Powered Vintage Interactive Fiction Interpreter