typpo | 2 years ago | on: Bard can now connect to your Google Apps and services
typpo's comments
typpo | 2 years ago | on: Asking 60 LLMs a set of 20 questions
Example here: https://promptfoo.dev/docs/guides/factuality-eval
typpo | 2 years ago | on: Asking 60 LLMs a set of 20 questions
I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.
This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.
typpo | 2 years ago | on: Show HN: NotYetNews – AI-Generated News from the Future
If anyone else wants a peek behind the curtain, here is the GPT-4 call: https://github.com/johnpolacek/notyetnews/blob/main/cron/ope...
typpo | 2 years ago | on: GPT-4 is getting worse over time, not better
The API clearly delineates the March and June versions. The paper authors ran tests on different API versions. The fact that these versions are different is clear & transparent. Anyone can use the March version of GPT by calling the API.
gpt-4-0314: very slow, smart
gpt-4-0613: fast, less smart
typpo | 2 years ago | on: Ask HN: How are you improving your use of LLMs in production?
My workflow is based on testing: start by defining a set of representative test cases and using them to guide prompting. I tend to prefer programmatic test cases over LLM-based evals, but LLM evals seem popular these days. Then, I create a hypothesis, run an eval, and if the results show improvement I share them with the team. In some of my projects, this is integrated with CI.
The next step is closing the feedback loop and gathering real-world examples for your evals. This can be difficult to do if you respect the privacy of your users, which is why I prefer a local, open-source CLI. You'll have to set up the appropriate opt-ins etc. to gather this data, if at all.
typpo | 2 years ago | on: GPT-Prompt-Engineer
Example asserts include basic string checks, regex, is-json, cosine similarity, etc. (and LLM self-eval is an option if you'd like).
typpo | 2 years ago | on: Hard stuff when building products with LLMs
typpo | 2 years ago | on: Hard stuff when building products with LLMs
typpo | 2 years ago | on: Hard stuff when building products with LLMs
The lack of formal tooling for prompt engineering drives me bonkers, and it compounds the problems outlined in the article around correctness and chaining.
Then there are the hot takes on Twitter from people claiming prompt engineering will soon be obsolete, or people selling blind prompts without any quality metrics. It's surprisingly hard to get LLMs to do _exactly_ what you want.
I'm building an open-source framework for systematically measuring prompt quality [0], inspired by best practices for traditional engineering systems.
typpo | 2 years ago | on: Brex’s Prompt Engineering Guide
Editing prompts is like playing whack-a-mole: once you clear an edge case, a new problem pops up elsewhere. I'd really like to be able to say, "this new prompt performs 20% better across all our test cases".
Because I haven't found a better way, I am building https://github.com/typpo/promptfoo, a CLI that outputs a matrix view for quickly comparing outputs across multiple prompts, variables, and models. Good luck to everyone else out there tuning prompts :)
typpo | 2 years ago | on: Ancient Earth Globe
Professor Scotese was a great partner and instrumental in putting this visualization together. He's acknowledged on the site, but for those interested here is his website: http://www.scotese.com/. I believe he has a more modern iteration of the paleomap that is not downloadable on the web, but for various reasons I did not get those textures in this visualization (they didn't wrap properly iirc).
He also has a nice writeup of the methods used here: https://drive.google.com/file/d/1-q0WIa7ofISFHyBe4UxvN8DIPs8...
typpo | 2 years ago | on: Show HN: Promptfoo – CLI for testing & improving LLM prompt quality
I'm running these tests in bulk, so I prefer to automate with the CLI, or integrate with a test framework like Jest. I think the web UI is good for tinkering, but does not fit as well into real workflows.
typpo | 2 years ago | on: Show HN: Promptfoo – CLI for testing & improving LLM prompt quality
typpo | 2 years ago | on: Show HN: Promptfoo – CLI for testing & improving LLM prompt quality
I built this because I'm tuning a bunch of prompts and don't have a great way to do this systematically.
This CLI tool helps you pick the best prompt and model by allowing you to configure multiple prompts and variables. It outputs "before" and "after" so you can easily compare LLM outputs side-by-side and determine if the prompt has improved the quality of each example.
Example use cases:
- Deciding whether it's worth using GPT-4 over GPT-3.5
- Evaluating quality improvements to your prompt across a large range of examples
- Catching regressions in edge cases as you iterate on your prompt
It supports a handful of useful output formats: console, HTML table view, csv, json, yaml, so you can integrate into your workflow as needed. It also can be used as a library, not a CLI.
I'm interested in hearing your thoughts and suggestions on how to improve this tool further. Thanks!
typpo | 2 years ago | on: Show HN: Text-to-Chart – embeddable natural language charts
I maintain a chart generation service, QuickChart (https://github.com/typpo/quickchart), which renders millions of charts per day. The most consistent pain point for users is that charts require some programming ability, or at least a strict JSON schema.
The idea is to make chart creation more approachable. Instead of messing around with D3 or Chart.js for a one-off, you can just embed https://quickchart.io/natural/red_bar_chart in an image tag or iframe and call it a day. GPT generates a reasonable look & feel.
After you have a template that you're happy with, you can modify the chart with precision, e.g. https://quickchart.io/natural/red_bar_chart?data1=3,5,7. The idea is that you don't have to mess around with chart configs, hosting, etc.
I welcome your thoughts & feedback.
typpo | 3 years ago | on: Launch HN: Buildt (YC W23) – Conversational semantic code search
typpo | 3 years ago | on: Replacing a SQL analyst with 26 recursive GPT prompts
1. Using embeddings to filter context into the prompt
2. Identifying common syntax errors or hallucinations of non-existent columns
3. Flagging queries that write instead of read
Plus lots of prompt finessing to get it to avoid mistakes.
It doesn't execute the queries, yet. For an arbitrary db, it's still helpful to have a human in the loop to sanity check the SQL (for now at least).
Demo at https://www.querymuse.com/query if anyone's interested
typpo | 3 years ago | on: Show HN: Generate SQL Queries from English
I built this tool because I found it useful (1) as a way to learn, and (2) to start basic data analysis quickly.
It uses OpenAI's completion and embeddings API, although I might be able to move it to a cheaper model eventually.
Also worth noting that state is managed in browser localStorage. I don't store your database structure unless you explicitly save/share.
typpo | 3 years ago | on: Farewell, Building in Public
Carving out a niche is more important to me than building in public. I'm not one of those guys on Twitter making $1M/yr and I don't need a personal brand.
Good luck to Cory - I've enjoyed reading his stuff.
The new feature for enriching outputs with citations from Google Search is also pretty cool.