trq_'s comments | WingNews

trq_ | 2 days ago | on: OpenCode – Open source AI coding agent

Claude Code is not an electron app.

trq_ | 1 month ago | on: Claude Code daily benchmarks for degradation tracking

Yes, we do but harnesses are hard to eval, people use them across a huge variety of tasks and sometimes different behaviors tradeoff against each other. We have added some evals to catch this one in particular.

trq_ | 1 month ago | on: Claude Code daily benchmarks for degradation tracking

Hi everyone, Thariq from the Claude Code team here.

Thanks for reporting this. We fixed a Claude Code harness issue that was introduced on 1/26. This was rolled back on 1/28 as soon as we found it.

Run `claude update` to make sure you're on the latest version.

trq_ | 3 months ago | on: Claude Code gets native LSP support

Hi, work on Claude Code here! Let me know if you have any feedback!

trq_ | 4 months ago | on: Claude Is Down

We're back up! It was about ~30 minutes of downtime this morning, our apologies if it interrupted your work.

trq_ | 1 year ago | on: Show HN: Llama 3.3 70B Sparse Autoencoders with API access

Hmm the hallucination would happen in the auto labelling, but we review and test our labels and they seem correct!

trq_ | 1 year ago | on: Show HN: Llama 3.3 70B Sparse Autoencoders with API access

If you're hacking on this and have questions, please join us on Discord: https://discord.gg/vhT9Chrt

trq_ | 1 year ago | on: Show HN: Llama 3.3 70B Sparse Autoencoders with API access

We haven't yet found generalizable "make this model smarter" features, but there is a tradeoff of putting instructions in system prompts, e.g. if you have a chatbot that sometimes generates code, you can give it very specific instructions when it's coding and leave those out of the system prompt otherwise.

We have a notebook about that here: https://docs.goodfire.ai/notebooks/dynamicprompts

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

This is incredible! I haven't seen that repo yet, thank you for pointing it out, and the writing

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

Yeah, I think the idea of finding out what flavor of uncertainty you have is very interesting.

trq_ | 1 year ago | on: OmniParser for Pure Vision Based GUI Agent

This is awesome, can't wait for evals against Claude Computer Use!

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

Yeah! I want to use the logprobs API, but you can't for example:

- sample multiple logits and branch (we maybe could with the old text completion API, but this no longer exists)

- add in a reasoning token on the fly

- stop execution, ask the user, etc.

But a visualization of logprobs in a query seems like it might be useful.

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

I want to build intuition on this by building a logit visualizer for OpenAI outputs. But from what I've seen so far, you can often trace down a hallucination.

Here's an example of someone doing that for 9.9 > 9.11: https://x.com/mengk20/status/1849213929924513905

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

I mean, LLMs certainly know representations of what words means and their relationship to each other, that's what the Key and Query matrices hold for example.

But in this case, it means that the underlying point in embedding space doesn't map clearly to only one specific token. That's not too different from when you have an idea in your head but can't think of the word.

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

Yeah wouldn't be surprised if the big labs are doing more than just arg max in the sampling.

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

Definitely, but if you can detect when you might be in one of those states, you could reflect to see exactly which state you're in.

So far this has mostly been done using Reinforcement Learning, but catching it and doing it inference seems like it could be interesting to explore. And much more approachable for open source, only the big ML labs can do this sort of RL.

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

Yeah that's been my thinking as well.

There are definitely times when entropy can be high but not actually be uncertain (again synonyms are the best), but it seems promising. I want to build a visualizer using the OpenAI endpoints.

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

That makes sense, thanks for the expertise!

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

Yeah I wish more LLM APIs offered internal insights like logits, right now I think only OpenAI does and it started recently.

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

In this case it would be a low entropy, high varentropy situation. It's confident in a few possible answers, like if it's a set of synonyms.