typpo | 1 month ago | on: Advancing finance with Claude Opus 4.6
typpo's comments
typpo | 9 months ago | on: Visualizing 100k Years of Earth in WebGL
typpo | 1 year ago | on: Show HN: Time Portal – Get dropped into history, guess where you landed
typpo | 1 year ago | on: Open source AI is the path forward
Having run many red teams recently as I build out promptfoo's red teaming featureset [0], I've noticed the Llama models punch above their weight in terms of accuracy when it comes to safety. People hate excessive guardrails and Llama seems to thread the needle.
Very bullish on open source.
typpo | 1 year ago | on: Gemma 2: Improving Open Language Models at a Practical Size [pdf]
prompts:
- 'Answer this coding problem in Python: {{ask}}'
providers:
- ollama:chat:gemma2:9b
- ollama:chat:llama3:8b
tests:
- vars:
ask: function to find the nth fibonacci number
- vars:
ask: calculate pi to the nth digit
- # ...
One small thing I've always appreciated about Gemma is that it doesn't include a "Sure, I can help you" preamble. It just gets right into the code, and follows it with an explanation. The training seems to emphasize response structure and ease of comprehension.Also, best to run evals that don't rely on rote memorization of public code... so please substitute with your personal tests :)
typpo | 1 year ago | on: Show HN: I built a backend so simple that it fits in a YAML file
typpo | 1 year ago | on: Google scrambles to manually remove weird AI answers in search
This is an attempt to reduce hallucinations coming full circle. A simple summarization model was meant to reduce hallucination risk, but now it's not discerning enough to exclude untruthful results from the summary.
typpo | 1 year ago | on: Ollama v0.1.33 with Llama 3, Phi 3, and Qwen 110B
For those looking to create their own benchmarks, promptfoo[0] is one way to do this locally:
prompts:
- "Write this in Python 3: {{ask}}"
providers:
- ollama:chat:llama3:8b
- ollama:chat:phi3
- ollama:chat:qwen:7b
tests:
- vars:
ask: a function to determine if a number is prime
- vars:
ask: a function to split a restaurant bill given individual contributions and shared items
Jumping in because I'm a big believer in (1) local LLMs, and (2) evals specific to individual use cases.typpo | 1 year ago | on: Show HN: I made a website that converts YT videos into step-by-step guides
I'm curious if you noticed certain models worked better for summarizing and converting to steps. For example, in my projects I've found that Gemini outperforms "better" models like GPT for similar use cases, which I guess makes sense given Google's interest in summarization.
typpo | 1 year ago | on: Meta Llama 3
Replicate created a Llama 3 API [0] very quickly. This can be used to run simple benchmarks with promptfoo [1] comparing Llama 3 vs Mixtral, GPT, Claude, and others:
prompts:
- 'Answer this programming question concisely: {{ask}}'
providers:
- replicate:meta/meta-llama-3-8b-instruct
- replicate:meta/meta-llama-3-70b-instruct
- replicate:mistralai/mixtral-8x7b-instruct-v0.1
- openai:chat:gpt-4-turbo
- anthropic:messages:claude-3-opus-20240229
tests:
- vars:
ask: Return the nth element of the Fibonacci sequence
- vars:
ask: Write pong in HTML
# ...
Still testing things but Llama 3 8b is looking pretty good for my set of random programming qs at least.Edit: ollama now supports Llama 3 8b, making it easy to run this eval locally.
providers:
- ollama:chat:llama3
[0] https://replicate.com/blog/run-llama-3-with-an-apityppo | 1 year ago | on: Google CodeGemma: Open Code Models Based on Gemma [pdf]
prompts:
- "Solve in Python: {{ask}}"
providers:
- ollama:chat:codellama:7b
- ollama:chat:codegemma:instruct
tests:
- vars:
ask: function to return the nth number in fibonacci sequence
- vars:
ask: convert roman numeral to number
# ...
YMMV based on your coding tasks, but I notice gemma is much less verbose by default.typpo | 1 year ago | on: Ask HN: What non-AI products are you working on?
I messed around with some AI features, mostly just for fun and to see if they could help users onboard. But the core product is decidedly not AI.
typpo | 2 years ago | on: Launch HN: Talc AI (YC S23) – Test Sets for AI
I've been interested in automatic testset generation because I find that the chore of writing tests is one of the reasons people shy away from evals. Recently landed eval testset generation for promptfoo (https://github.com/typpo/promptfoo), but it is non-RAG so more simplistic than your implementation.
Was also eyeballing this paper https://arxiv.org/abs/2401.03038, which outlines a method for generating asserts from prompt version history that may also be useful for these eval tools.
typpo | 2 years ago | on: The Quadrantid meteor shower 2024 peaks tonight alongside a bright moon
It uses meteor data from NASA CAMS [1] to reconstruct the meteoroid cloud that creates the Quadrantids. When Earth passes through the cloud every year, we see a meteor shower.
Each particle in this visualization represents an actual meteor that burned up in the Earth's atmosphere. CAMS reconstructs the orbit of the meteor based on its entry trajectory by triangulating multiple recordings. CAMS is very cool!
typpo | 2 years ago | on: 2024 Quadrantid meteor shower to peak January 3-4
The Quadrantids are interesting because their source is not obvious, but the most likely one (as noted in the article) is an asteroid with a relatively unusual orbit that is likely an extinct comet.
typpo | 2 years ago | on: A collection of LLM evaluation tools
typpo | 2 years ago | on: Ask HN: What side projects landed you a job?
typpo | 2 years ago | on: My $500M Mars rover mistake
Incidentally, this happened to Lewicki a few years later when Planetary Resources' first satellite blew up on an Antares rocket: https://www.geekwire.com/2014/rocket-carrying-planetary-reso...
typpo | 2 years ago | on: Cosmological galaxy formation simulation software
It was a nice way to learn about three.js/webgl and how to make many particles performant. There are probably better visualizations out there nowadays.
My anecdotal experience is GPT 5.2 Pro is decently ahead of Claude Opus 4.5 in this category when it gets to the tricky stuff, both in presentation and accuracy. The long reasoning seems to help a lot. But, apparently the benchmarks do not agree.
Edit - noticed OpenAI specifically focuses on finance use cases in their gpt-5.3-codex blog as well https://openai.com/index/introducing-gpt-5-3-codex/