timbilt's comments

timbilt | 7 months ago | on: Benchmarking GPT-5 on 400 real-world code reviews

Yes, but in a case like this it's a neutral third-party running the benchmark. So there isn't a direct incentive for them to favor one lab over another.

With public benchmarks we're trusting the labs not to cheat. And it's easy to "cheat" accidentally - they actually need to make a serious effort to not contaminate the training data.

And there's massive incentives for the labs to cheat in order to get the hype going around their launch and justify their massive investments in training. It doesn't have to be the CEO who's directing it. Can even be one/a few researchers who are responsible for a specific area of model performance and are under tremendous pressure to deliver.

timbilt | 7 months ago | on: Benchmarking GPT-5 on 400 real-world code reviews

> Unlike many public benchmarks, the PR Benchmark is private, and its data is not publicly released. This ensures models haven’t seen it during training, making results fairer and more indicative of real-world generalization.

This is key.

Public benchmarks are essentially trust-based and the trust just isn't there.

timbilt | 1 year ago | on: SOTA Code Retrieval with Efficient Code Embedding Models

anyone else concerned that training models on synthetic, LLM-generated data might push us into a linguistic feedback loop? relying on LLM text for training could bias the next model towards even more overuse of words like "delve", "showcasing", and "underscores"...

timbilt | 1 year ago | on: Scaling up test-time compute with latent reasoning: A recurrent depth approach

Twitter thread about this by the author: https://x.com/jonasgeiping/status/1888985929727037514

timbilt | 1 year ago | on: Solving key challenges in AI-assisted code reviews

Until we get real-time learning to work in production, every AI tool feels like it's getting dumber over time. It goes very quick from "wow this is magic" to starting to notice all the little gaps. I think we have a fundamental expectation of intelligence to learn and when it doesn't, it just doesn't seem that smart

timbilt | 1 year ago | on: Effective AI code suggestions: less is more

The weirdness of LLMs is that they're so damn good at so many things but then you see these glaring gaps that instantly make them seem dumb. We desperately need benchmarks and evals that test these kinds of hard to pin down cognitive abilities

timbilt | 1 year ago | on: Introducing Qodo Cover: Automate Test Coverage

Unit tests are more commonly written to future proof code from issues down the road, rather than to discover existing bugs. A code base with good test coverage is considered more maintainable — you can make changes without worrying that it will break something in an unexpected place.

I think automating test coverage would be really useful if you needed to refactor a legacy project — you want to be sure that as you change the code, the existing functionality is preserved. I could imagine running this to generate tests and get to good coverage before starting the refactor.

timbilt | 1 year ago | on: Introducing Qodo Cover: Automate Test Coverage

> validates each test to ensure it runs successfully, passes, and increases code coverage

This seems to be based on the cover agent open source which implements Meta's TestGen-LLM paper. https://www.qodo.ai/blog/we-created-the-first-open-source-im...

After generating each test, it's automatically run — it needs to pass and increase coverage, otherwise it's discarded.

This means you're guaranteed to get working tests that aren't repetitions of existing tests. You just need to do a quick review to check that they aren't doing something strange and they're good to go.

timbilt | 1 year ago | on: The mind-bending new science of measuring time

https://archive.md/Bn3Dz

timbilt | 1 year ago | on: WhisperNER: Unified Open Named Entity and Speech Recognition

I think one of the biggest advantages is the security/privacy benefit — you can see in the demo that the model can mask entities instead of tagging. This means that instead of transcribing and then scrubbing sensitive info, you can prevent the sensitive info from ever being transcribed. Another potential benefit is in lower latency. The paper doesn't specifically mention latency but it seems to be on par with normal Whisper, so you save all of the time it would normally take to do entity tagging — big deal for real-time applications

timbilt | 1 year ago | on: WhisperNER: Unified Open Named Entity and Speech Recognition

GitHub repo: https://github.com/aiola-lab/whisper-ner

Hugging Face Demo: https://huggingface.co/spaces/aiola/whisper-ner-v1

Pretty good article that focuses on the privacy/security aspect of this — having a single model that does ASR and NER:

https://venturebeat.com/ai/aiola-unveils-open-source-ai-audi...

timbilt | 1 year ago | on: Claude can now view images within a PDF

PDFs are now also supported in the API: https://docs.anthropic.com/en/docs/build-with-claude/pdf-sup...

I think Anthropic is actually the first to support PDFs in their API