typpo's comments

typpo | 2 years ago | on: Bard can now connect to your Google Apps and services

"Extensions" and integration into the rest of the Google ecosystem could be how Bard wins at the end of the day. There are many tasks where I'd prefer an integration with my email/docs over a slightly smarter LLM. Unlike ChatGPT plugins, Google has the luxury of finetuning its model for each of their integrations.

The new feature for enriching outputs with citations from Google Search is also pretty cool.

typpo | 2 years ago | on: Asking 60 LLMs a set of 20 questions

In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.

I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.

This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.

typpo | 2 years ago | on: GPT-4 is getting worse over time, not better

As far as I can tell, you are the only person in this thread who actually skimmed the paper. Thank you for pointing this out!

The API clearly delineates the March and June versions. The paper authors ran tests on different API versions. The fact that these versions are different is clear & transparent. Anyone can use the March version of GPT by calling the API.

gpt-4-0314: very slow, smart

gpt-4-0613: fast, less smart

typpo | 2 years ago | on: Ask HN: How are you improving your use of LLMs in production?

I'm responsible for multiple LLM apps with hundreds of thousands of DAU total. I have built and am using promptfoo to iterate: https://github.com/promptfoo/promptfoo

My workflow is based on testing: start by defining a set of representative test cases and using them to guide prompting. I tend to prefer programmatic test cases over LLM-based evals, but LLM evals seem popular these days. Then, I create a hypothesis, run an eval, and if the results show improvement I share them with the team. In some of my projects, this is integrated with CI.

The next step is closing the feedback loop and gathering real-world examples for your evals. This can be difficult to do if you respect the privacy of your users, which is why I prefer a local, open-source CLI. You'll have to set up the appropriate opt-ins etc. to gather this data, if at all.

typpo | 2 years ago | on: GPT-Prompt-Engineer

Thanks for mentioning promptfoo. For anyone else who might prefer deterministic, programmatic evaluation of LLM outputs, I've been building this for evaluating prompts and models: https://github.com/typpo/promptfoo

Example asserts include basic string checks, regex, is-json, cosine similarity, etc. (and LLM self-eval is an option if you'd like).

typpo | 2 years ago | on: Hard stuff when building products with LLMs

This is a great summary of why productionizing LLMs is hard. I'm working on a couple LLM products, including one that's in production for >10 million users.

The lack of formal tooling for prompt engineering drives me bonkers, and it compounds the problems outlined in the article around correctness and chaining.

Then there are the hot takes on Twitter from people claiming prompt engineering will soon be obsolete, or people selling blind prompts without any quality metrics. It's surprisingly hard to get LLMs to do _exactly_ what you want.

I'm building an open-source framework for systematically measuring prompt quality [0], inspired by best practices for traditional engineering systems.

0. https://github.com/typpo/promptfoo

typpo | 2 years ago | on: Brex’s Prompt Engineering Guide

Are there established best practices for "engineering" prompts systematically, rather than through trial-and-error?

Editing prompts is like playing whack-a-mole: once you clear an edge case, a new problem pops up elsewhere. I'd really like to be able to say, "this new prompt performs 20% better across all our test cases".

Because I haven't found a better way, I am building https://github.com/typpo/promptfoo, a CLI that outputs a matrix view for quickly comparing outputs across multiple prompts, variables, and models. Good luck to everyone else out there tuning prompts :)

typpo | 2 years ago | on: Ancient Earth Globe

Very true. As the author of this visualization, I airbrushed the country borders off the two most recent textures. Because it was too obvious that the borders show the Soviet Union :)

Professor Scotese was a great partner and instrumental in putting this visualization together. He's acknowledged on the site, but for those interested here is his website: http://www.scotese.com/. I believe he has a more modern iteration of the paleomap that is not downloadable on the web, but for various reasons I did not get those textures in this visualization (they didn't wrap properly iirc).

He also has a nice writeup of the methods used here: https://drive.google.com/file/d/1-q0WIa7ofISFHyBe4UxvN8DIPs8...

typpo | 2 years ago | on: Show HN: Promptfoo – CLI for testing & improving LLM prompt quality

Looks like the playground is mainly for comparison between models, not prompts, and doesn't support templating? Vercel's is similar but not free and open-source.

I'm running these tests in bulk, so I prefer to automate with the CLI, or integrate with a test framework like Jest. I think the web UI is good for tinkering, but does not fit as well into real workflows.

typpo | 2 years ago | on: Show HN: Promptfoo – CLI for testing & improving LLM prompt quality

Hi HN,

I built this because I'm tuning a bunch of prompts and don't have a great way to do this systematically.

This CLI tool helps you pick the best prompt and model by allowing you to configure multiple prompts and variables. It outputs "before" and "after" so you can easily compare LLM outputs side-by-side and determine if the prompt has improved the quality of each example.

Example use cases:

- Deciding whether it's worth using GPT-4 over GPT-3.5

- Evaluating quality improvements to your prompt across a large range of examples

- Catching regressions in edge cases as you iterate on your prompt

It supports a handful of useful output formats: console, HTML table view, csv, json, yaml, so you can integrate into your workflow as needed. It also can be used as a library, not a CLI.

I'm interested in hearing your thoughts and suggestions on how to improve this tool further. Thanks!

typpo | 2 years ago | on: Show HN: Text-to-Chart – embeddable natural language charts

Hi HN,

I maintain a chart generation service, QuickChart (https://github.com/typpo/quickchart), which renders millions of charts per day. The most consistent pain point for users is that charts require some programming ability, or at least a strict JSON schema.

The idea is to make chart creation more approachable. Instead of messing around with D3 or Chart.js for a one-off, you can just embed https://quickchart.io/natural/red_bar_chart in an image tag or iframe and call it a day. GPT generates a reasonable look & feel.

After you have a template that you're happy with, you can modify the chart with precision, e.g. https://quickchart.io/natural/red_bar_chart?data1=3,5,7. The idea is that you don't have to mess around with chart configs, hosting, etc.

I welcome your thoughts & feedback.

typpo | 3 years ago | on: Replacing a SQL analyst with 26 recursive GPT prompts

I've been building something similar that handles the dirty business of formatting a large database into a prompt. Additional work that I've found helpful includes:

1. Using embeddings to filter context into the prompt

2. Identifying common syntax errors or hallucinations of non-existent columns

3. Flagging queries that write instead of read

Plus lots of prompt finessing to get it to avoid mistakes.

It doesn't execute the queries, yet. For an arbitrary db, it's still helpful to have a human in the loop to sanity check the SQL (for now at least).

Demo at https://www.querymuse.com/query if anyone's interested

typpo | 3 years ago | on: Show HN: Generate SQL Queries from English

Hi HN,

I built this tool because I found it useful (1) as a way to learn, and (2) to start basic data analysis quickly.

It uses OpenAI's completion and embeddings API, although I might be able to move it to a cheaper model eventually.

Also worth noting that state is managed in browser localStorage. I don't store your database structure unless you explicitly save/share.

typpo | 3 years ago | on: Farewell, Building in Public

I relate to this a lot. I used to do a yearly roundup of all my side projects, which were generally well received by HN. Then some of my projects started making money and the copycats came.

Carving out a niche is more important to me than building in public. I'm not one of those guys on Twitter making $1M/yr and I don't need a personal brand.

Good luck to Cory - I've enjoyed reading his stuff.

page 2