This tool doesn’t benchmark based on how a model actually responds to the generated prompts. Instead, it trusts GPT4 to rank prompts simply in terms of how well it imagines they will perform head-to-head. Thus, there’s no way to tell if the chosen ‘best prompt’ actually is the best, because there’s no ground truth against actual responses.
Why is this so popular, then (more popular than promptfoo, which I think is a much better tool in the same vein)? AI devs seem enamored with the idea of LLMs evaluating LLMs —everything is ‘auto-‘ this and that. They’re in for a rude awakening. The truth is, there are no shortcuts to evaluating performance in real world applications.
Because this is the upside if not the peak of the hype bubble. All you have to do is use GPT for a task, and nobody cares whether it actually works, you still get whatever VC funding is left after the interest rate hikes.
Grifters. I won't say that the person working on this is a grifter. Instead, it's so popular right now because of grifters. The same type of NFT grifters and crypto grifters who are mostly silent now. They've moved on.
How ethical would it be to sell things to these grifters, to sell the shovels they will use? But I'm always hung up on the idea that they will use those shovels on others and exploit them.
Thanks for mentioning promptfoo. For anyone else who might prefer deterministic, programmatic evaluation of LLM outputs, I've been building this for evaluating prompts and models: https://github.com/typpo/promptfoo
Example asserts include basic string checks, regex, is-json, cosine similarity, etc. (and LLM self-eval is an option if you'd like).
There is a paper on arxiv saying that GPT4 correlation with human evaluators on a variety of tasks with strongly positive. I am also uncomfortable with it, but using GPT4 as a grader is not as bad as you think.
Well, you could keep everything else in the project and put yourself or a human as the "does this result feel better than the other one" decision maker
it seems to me also, that this is very much some sort of snake oil for the llm era. prompt generation varies from llm-to-llm and I doubt gpt4 can do reasonable evaluation, provided that it does not know at all about other models.
I think the real etymology of it was "social engineering". Which also means clever hacks, not a real engineering, it deliberately subverted the meaning.
Isn’t engineering an exact science while prompt engineering is completely not?
Although, even software engineering being an exact science, it is a funny one: most of us don’t get certified as like, let’s say, mechanical engineers do. Would they say we are engineers?
So perhaps the “engineer” term got overloaded in recent years?
Prompt "engineering" is just writing prayers to forest faeries.
Whilst BASIC/JavaScript/etc are all magic incantations to a child, a child will soon figure out there's underlaying logic, and learn the ability to reason about what code does, and what certain changes will do.
With prompts, it's all faerie logic. There is nothing to learn, there are only magic incantations that change drastically if the model is updated.
Worse yet, the incantations cannot be composed. E.g. take the SQL statement "SELECT column FROM table WHERE column = [%s]". For any given string you insert here, the output is predictable. You can even know which characters would trigger an injection attack.
With prompts you cannot predict results. Any word, phrase, or sequence of characters may upset the faeries and cause the model to misbehave in who knows what way. No processing of user-input will stop injection attacks.
Whilst it's dubious to call current software development practices "engineering", it's utterly ridiculous to do so for prompt-writing.
Anyway, the best engineer for ChatGPT prompt can be ChatGPT itself.
People seem to think that “automation will create new jobs”, but in the age or AI, those job opprtunities will be very temporary as the companies making the AI automate that thin layer.
Similarly, people think that humans will control AIs. That’s a bit quaint, a bit like humans controlling a corporation. The thin layer of “control” can be easily swapped out and present an improvement in the market, so that the number of totally autonomous (no human in the loop) workflows will grow.
That can include predictive policing with Palantir (thanks Peter Thiel!), autonomous killbots in war etc. Seeing how reckless companies have been in releasing the current AI in an arms race, I don’t see how they would be restrained in a literal arms race of slaughterbot swarms and panopticon camera meshes.
PS: I remember this exact phase when computers like Deep Blue beat Garry Kasparov. For a while he and others advocated “centaurs” — humans collaborating with computers. But the last decade hardly anyone will claim that a system with humans in the loop can beat a system that’s fully automated: https://en.m.wikipedia.org/wiki/Advanced_chess
“Engineer” originally just meant someone who builds engines (in a broad sense of that word). The formal titles requiring certification, etc. are the more recent development.
BTW, GPT-Engineer is openly collecting all of your data: user prompts and other metadata. And they were even defending it until they received some strong responses from the community: https://github.com/AntonOsika/gpt-engineer/issues/415 They now explicitly ask for consent regarding user data, but can we really trust their motives?
Is this prompt generation for the purposes of prompt engineering? Is this then a kind of meta engineering? Engineering for the purposes of engineering which then hopefully will generate working code for the computer that generated the prompt and the response to the prompt.
Usage query:
It looks like this could get expensive quite quickly. The approach is great, but with GPT-4 especially, could be very difficult.
Is it worth using with 3.5 as a first pass then switching prompts to GPT4 once you've got the best prompt?
I am not sure this approach is doable since GPT-4 is capable of solving assignments that GPT-3.5 get wrong. Example GPT-3.5 fails to solve the prompt (with the dvdrental sample database schema added [1]):
> find customers who didn't rent a movie in the last 12 months but rented a movie in the 12 months before that
GPT-4 solves this without a problem [2]. Combining logic like (without additional database schema added):
> find all users who lives in Paris using lat/lng and who visited the south of France within the last month
GPT-3.5 can't understand this at all, GPT-4 solves it [3].
Isn’t that what this codebase is doing? I haven’t grokked it 100% yet.
Recently I’ve been trying to engineer a prompt that I intend to run 1k times.
Noticing GPT4 bug out on several responses, I’ve talked it through the problem more and asked it to rewrite the prompt. So an automated approach to help build better prompts based upon held out gold data is useful to me.
A bit like an autoGPT, I did not immediately see any kind of token limits. But I did not look carefully. On a complex problem or one that accesses a lot of data the cost might ramp up.
Currently working on something similar for myself, this doesn't seem to fit my needs (benchmarking generations too rather than just classification). I only have a crude cosine similarity metric for accuracy for now. Also I'm using function calling rather than the normal completions.
I was hoping this would do something more interesting with multiple messages (if using a chat model) rather than just dumping the entire prompt in one message. The assistant lets you do stuff with examples.
It would be cool if, given a handful of test cases, you could send those off to the LLM to generate even more test cases.
My first thought when looking over this tool was "Why do I have to do all the work?", the ideal scenario is that I give the high level description and the LLM does the hard work to create the best prompt.
I haven't read the article at all, but what you said just reminded me that I've been using ChatGPT to create Dungeons & Dragons campaigns / adventures / worlds (it's crazy good at that) and one trick i've started doing at the end of a fruitful conversation is to ask it to summarize the discussion so far _in a format to be used as a prompt_. it works more or less.
Off topic but Jupyter cells on GitHub can't display horizontally long content and it frustrates me a lot. This small piece of code for the browser console helps me see more content, but it only works in large displays.
Try editing the pre style so it has `white-space: pre-wrap`. That works for me. -- This uses `pre` whitespace formatting but wraps to the next line if it is too long, unlike the default `pre` element behaviour.
You need to learn how to use this to generate a good prompt but why not just learn how to generate good prompts? This code is basically asking for something, with examples. And ask a few real questions to test it
This could work really well if it replaced GPT-X-judged performance ranking with human-in-the-loop ranking of prompts, but that’s not as exciting, I guess.
I think the human in the loop evaluation of prompts is a red herring since the LLMs were trained with HFRL, which optimized them to be convincing. Humans aren't reliable for evaluating these things because it was literally trained to make us think it's doing well.
[+] [-] fatso784|2 years ago|reply
Why is this so popular, then (more popular than promptfoo, which I think is a much better tool in the same vein)? AI devs seem enamored with the idea of LLMs evaluating LLMs —everything is ‘auto-‘ this and that. They’re in for a rude awakening. The truth is, there are no shortcuts to evaluating performance in real world applications.
[+] [-] immibis|2 years ago|reply
[+] [-] jasonlotito|2 years ago|reply
Grifters. I won't say that the person working on this is a grifter. Instead, it's so popular right now because of grifters. The same type of NFT grifters and crypto grifters who are mostly silent now. They've moved on.
How ethical would it be to sell things to these grifters, to sell the shovels they will use? But I'm always hung up on the idea that they will use those shovels on others and exploit them.
[+] [-] typpo|2 years ago|reply
Example asserts include basic string checks, regex, is-json, cosine similarity, etc. (and LLM self-eval is an option if you'd like).
[+] [-] donkeyboy|2 years ago|reply
[+] [-] toxicFork|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] larodi|2 years ago|reply
[+] [-] mistymountains|2 years ago|reply
[+] [-] mangecoeur|2 years ago|reply
[+] [-] muzani|2 years ago|reply
But literally the first sentence of the readme is "Prompt engineering is kind of like alchemy."
[+] [-] mnky9800n|2 years ago|reply
[+] [-] andrewdb|2 years ago|reply
[+] [-] js8|2 years ago|reply
[+] [-] classified|2 years ago|reply
[+] [-] brunoluiz|2 years ago|reply
Although, even software engineering being an exact science, it is a funny one: most of us don’t get certified as like, let’s say, mechanical engineers do. Would they say we are engineers?
So perhaps the “engineer” term got overloaded in recent years?
[+] [-] FishInTheWater|2 years ago|reply
Whilst BASIC/JavaScript/etc are all magic incantations to a child, a child will soon figure out there's underlaying logic, and learn the ability to reason about what code does, and what certain changes will do.
With prompts, it's all faerie logic. There is nothing to learn, there are only magic incantations that change drastically if the model is updated.
Worse yet, the incantations cannot be composed. E.g. take the SQL statement "SELECT column FROM table WHERE column = [%s]". For any given string you insert here, the output is predictable. You can even know which characters would trigger an injection attack.
With prompts you cannot predict results. Any word, phrase, or sequence of characters may upset the faeries and cause the model to misbehave in who knows what way. No processing of user-input will stop injection attacks.
Whilst it's dubious to call current software development practices "engineering", it's utterly ridiculous to do so for prompt-writing.
[+] [-] mellosouls|2 years ago|reply
If coders can call themselves engineers, no reason why anybody else solving puzzles for a living can't.
[+] [-] groestl|2 years ago|reply
[+] [-] EGreg|2 years ago|reply
People seem to think that “automation will create new jobs”, but in the age or AI, those job opprtunities will be very temporary as the companies making the AI automate that thin layer.
Similarly, people think that humans will control AIs. That’s a bit quaint, a bit like humans controlling a corporation. The thin layer of “control” can be easily swapped out and present an improvement in the market, so that the number of totally autonomous (no human in the loop) workflows will grow.
That can include predictive policing with Palantir (thanks Peter Thiel!), autonomous killbots in war etc. Seeing how reckless companies have been in releasing the current AI in an arms race, I don’t see how they would be restrained in a literal arms race of slaughterbot swarms and panopticon camera meshes.
PS: I remember this exact phase when computers like Deep Blue beat Garry Kasparov. For a while he and others advocated “centaurs” — humans collaborating with computers. But the last decade hardly anyone will claim that a system with humans in the loop can beat a system that’s fully automated: https://en.m.wikipedia.org/wiki/Advanced_chess
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] umanwizard|2 years ago|reply
[+] [-] shivams|2 years ago|reply
[+] [-] Towaway69|2 years ago|reply
[+] [-] danielbln|2 years ago|reply
[+] [-] Apfel|2 years ago|reply
[+] [-] l5870uoo9y|2 years ago|reply
> find customers who didn't rent a movie in the last 12 months but rented a movie in the 12 months before that
GPT-4 solves this without a problem [2]. Combining logic like (without additional database schema added):
> find all users who lives in Paris using lat/lng and who visited the south of France within the last month
GPT-3.5 can't understand this at all, GPT-4 solves it [3].
[1]: https://www.postgresqltutorial.com/postgresql-getting-starte...
[2]: https://aihelperbot.com/snippets/cljy8km2h0000my0fgq8kut5w
[3]: https://aihelperbot.com/snippets/cljy8q6gz000al70fvfzxt2hh
[+] [-] hoc|2 years ago|reply
[+] [-] Deestan|2 years ago|reply
[+] [-] dontupvoteme|2 years ago|reply
[+] [-] xmcqdpt2|2 years ago|reply
[+] [-] bravura|2 years ago|reply
Recently I’ve been trying to engineer a prompt that I intend to run 1k times.
Noticing GPT4 bug out on several responses, I’ve talked it through the problem more and asked it to rewrite the prompt. So an automated approach to help build better prompts based upon held out gold data is useful to me.
[+] [-] Deestan|2 years ago|reply
[+] [-] Makhini|2 years ago|reply
[+] [-] jwestbury|2 years ago|reply
[+] [-] Kiro|2 years ago|reply
[+] [-] asimpleusecase|2 years ago|reply
[+] [-] msp26|2 years ago|reply
I was hoping this would do something more interesting with multiple messages (if using a chat model) rather than just dumping the entire prompt in one message. The assistant lets you do stuff with examples.
[+] [-] iRomain|2 years ago|reply
[+] [-] hacksoi|2 years ago|reply
[+] [-] PUSH_AX|2 years ago|reply
My first thought when looking over this tool was "Why do I have to do all the work?", the ideal scenario is that I give the high level description and the LLM does the hard work to create the best prompt.
[+] [-] Kichererbsen|2 years ago|reply
[+] [-] hamasho|2 years ago|reply
[+] [-] rhdunn|2 years ago|reply
[+] [-] m3kw9|2 years ago|reply
[+] [-] sgt101|2 years ago|reply
it reminds me of those aircraft that folks in rural india build from time to time.
[+] [-] namuol|2 years ago|reply
[+] [-] __loam|2 years ago|reply
[+] [-] magicroot75|2 years ago|reply
[+] [-] fdondi|2 years ago|reply