typpo's comments

typpo | 1 month ago | on: Advancing finance with Claude Opus 4.6

Lately my company has been doing a lot of complex accounting and reporting in spreadsheets. Overall was surprised by how well both GPT and Claude handled some of these extremely tedious tasks. Not uncommon to have an hours-long task compressed to minutes.

My anecdotal experience is GPT 5.2 Pro is decently ahead of Claude Opus 4.5 in this category when it gets to the tricky stuff, both in presentation and accuracy. The long reasoning seems to help a lot. But, apparently the benchmarks do not agree.

Edit - noticed OpenAI specifically focuses on finance use cases in their gpt-5.3-codex blog as well https://openai.com/index/introducing-gpt-5-3-codex/

typpo | 1 year ago | on: Open source AI is the path forward

Thanks to Meta for their work on safety, particularly Llama Guard. Llama Guard 3 adds defamation, elections, and code interpreter abuse as detection categories.

Having run many red teams recently as I build out promptfoo's red teaming featureset [0], I've noticed the Llama models punch above their weight in terms of accuracy when it comes to safety. People hate excessive guardrails and Llama seems to thread the needle.

Very bullish on open source.

[0] https://www.promptfoo.dev/docs/red-team/

typpo | 1 year ago | on: Gemma 2: Improving Open Language Models at a Practical Size [pdf]

If anyone is interested in evaling Gemma locally, this can be done pretty easily using ollama[0] and promptfoo[1] with the following config:

  prompts:
    - 'Answer this coding problem in Python: {{ask}}'

  providers:
    - ollama:chat:gemma2:9b
    - ollama:chat:llama3:8b

  tests:
    - vars:
        ask: function to find the nth fibonacci number
    - vars:
        ask: calculate pi to the nth digit
    - # ...
One small thing I've always appreciated about Gemma is that it doesn't include a "Sure, I can help you" preamble. It just gets right into the code, and follows it with an explanation. The training seems to emphasize response structure and ease of comprehension.

Also, best to run evals that don't rely on rote memorization of public code... so please substitute with your personal tests :)

[0] https://ollama.com/library/gemma2

[1] https://github.com/promptfoo/promptfoo

typpo | 1 year ago | on: Google scrambles to manually remove weird AI answers in search

The problem in this case is not that it was trained on bad data. The AI summaries are just that - summaries - and there are bad results that it faithfully summarizes.

This is an attempt to reduce hallucinations coming full circle. A simple summarization model was meant to reduce hallucination risk, but now it's not discerning enough to exclude untruthful results from the summary.

typpo | 1 year ago | on: Veo

The amount of negativity in these comments is astounding. Congrats to the teams at Google on what they have built, and hoping for more competition and progress in this space.

typpo | 1 year ago | on: Ollama v0.1.33 with Llama 3, Phi 3, and Qwen 110B

Paul's benchmarks are excellent and they're the first thing I look for to get a sense of a new model performance :)

For those looking to create their own benchmarks, promptfoo[0] is one way to do this locally:

  prompts:
    - "Write this in Python 3: {{ask}}"
  
  providers:
    - ollama:chat:llama3:8b
    - ollama:chat:phi3
    - ollama:chat:qwen:7b
    
  tests:
    - vars:
        ask: a function to determine if a number is prime
    - vars:
        ask: a function to split a restaurant bill given individual contributions and shared items
Jumping in because I'm a big believer in (1) local LLMs, and (2) evals specific to individual use cases.

[0] https://github.com/typpo/promptfoo

typpo | 1 year ago | on: Show HN: I made a website that converts YT videos into step-by-step guides

Great idea and congrats on shipping the project!

I'm curious if you noticed certain models worked better for summarizing and converting to steps. For example, in my projects I've found that Gemini outperforms "better" models like GPT for similar use cases, which I guess makes sense given Google's interest in summarization.

typpo | 1 year ago | on: Meta Llama 3

Public benchmarks are broadly indicative, but devs really should run custom benchmarks on their own use cases.

Replicate created a Llama 3 API [0] very quickly. This can be used to run simple benchmarks with promptfoo [1] comparing Llama 3 vs Mixtral, GPT, Claude, and others:

  prompts:
    - 'Answer this programming question concisely: {{ask}}'

  providers:
    - replicate:meta/meta-llama-3-8b-instruct
    - replicate:meta/meta-llama-3-70b-instruct
    - replicate:mistralai/mixtral-8x7b-instruct-v0.1
    - openai:chat:gpt-4-turbo
    - anthropic:messages:claude-3-opus-20240229

  tests:
    - vars:
        ask: Return the nth element of the Fibonacci sequence
    - vars:
        ask: Write pong in HTML
    # ...
Still testing things but Llama 3 8b is looking pretty good for my set of random programming qs at least.

Edit: ollama now supports Llama 3 8b, making it easy to run this eval locally.

  providers:
    - ollama:chat:llama3
[0] https://replicate.com/blog/run-llama-3-with-an-api

[1] https://github.com/typpo/promptfoo

typpo | 1 year ago | on: Google CodeGemma: Open Code Models Based on Gemma [pdf]

If anyone wants to eval this locally versus codellama, it's pretty easy with Ollama[0] and Promptfoo[1]:

  prompts:
    - "Solve in Python: {{ask}}"

  providers:
    - ollama:chat:codellama:7b
    - ollama:chat:codegemma:instruct

  tests:
    - vars:
        ask: function to return the nth number in fibonacci sequence
    - vars:
        ask: convert roman numeral to number
    # ...
YMMV based on your coding tasks, but I notice gemma is much less verbose by default.

[0] https://github.com/ollama/ollama

[1] https://github.com/promptfoo/promptfoo

typpo | 1 year ago | on: Ask HN: What non-AI products are you working on?

I'm working on https://quickchart.io/, a web API for generating chart images. I've expanded it to a WYSIWYG chart editor at https://quickchart.io/chart-maker/, which lets you create an endpoint that you can use to generate variations of custom charts. This is useful for creating charts quickly, or using them in places that don't support dynamic charting (email, SMS, various app plugins, etc).

I messed around with some AI features, mostly just for fun and to see if they could help users onboard. But the core product is decidedly not AI.

typpo | 2 years ago | on: Launch HN: Talc AI (YC S23) – Test Sets for AI

Congrats on the launch!

I've been interested in automatic testset generation because I find that the chore of writing tests is one of the reasons people shy away from evals. Recently landed eval testset generation for promptfoo (https://github.com/typpo/promptfoo), but it is non-RAG so more simplistic than your implementation.

Was also eyeballing this paper https://arxiv.org/abs/2401.03038, which outlines a method for generating asserts from prompt version history that may also be useful for these eval tools.

typpo | 2 years ago | on: The Quadrantid meteor shower 2024 peaks tonight alongside a bright moon

I posted this visualization of mine in a recent thread on the Quadrantids, but sharing again because people seemed to enjoy it: https://www.meteorshowers.org/view/Quadrantids

It uses meteor data from NASA CAMS [1] to reconstruct the meteoroid cloud that creates the Quadrantids. When Earth passes through the cloud every year, we see a meteor shower.

Each particle in this visualization represents an actual meteor that burned up in the Earth's atmosphere. CAMS reconstructs the orbit of the meteor based on its entry trajectory by triangulating multiple recordings. CAMS is very cool!

[1] http://cams.seti.org/

typpo | 2 years ago | on: A collection of LLM evaluation tools

Evals are important for LLM app development. I've noticed dozens of tools in this space, including 11 (!) YC companies, so I put them together on a page.

typpo | 2 years ago | on: Cosmological galaxy formation simulation software

I like seeing how familiar structures appear at cosmological scale. Long ago I created a webgl visualization of the Millennium Run, an early large-scale cosmological simulation: https://www.asterank.com/galaxies/

It was a nice way to learn about three.js/webgl and how to make many particles performant. There are probably better visualizations out there nowadays.

page 1