top | item 36677034

GPT-Prompt-Engineer

356 points| sturza | 2 years ago |github.com

160 comments

order
[+] fatso784|2 years ago|reply
This tool doesn’t benchmark based on how a model actually responds to the generated prompts. Instead, it trusts GPT4 to rank prompts simply in terms of how well it imagines they will perform head-to-head. Thus, there’s no way to tell if the chosen ‘best prompt’ actually is the best, because there’s no ground truth against actual responses.

Why is this so popular, then (more popular than promptfoo, which I think is a much better tool in the same vein)? AI devs seem enamored with the idea of LLMs evaluating LLMs —everything is ‘auto-‘ this and that. They’re in for a rude awakening. The truth is, there are no shortcuts to evaluating performance in real world applications.

[+] immibis|2 years ago|reply
Because this is the upside if not the peak of the hype bubble. All you have to do is use GPT for a task, and nobody cares whether it actually works, you still get whatever VC funding is left after the interest rate hikes.
[+] jasonlotito|2 years ago|reply
> Why is this so popular

Grifters. I won't say that the person working on this is a grifter. Instead, it's so popular right now because of grifters. The same type of NFT grifters and crypto grifters who are mostly silent now. They've moved on.

How ethical would it be to sell things to these grifters, to sell the shovels they will use? But I'm always hung up on the idea that they will use those shovels on others and exploit them.

[+] typpo|2 years ago|reply
Thanks for mentioning promptfoo. For anyone else who might prefer deterministic, programmatic evaluation of LLM outputs, I've been building this for evaluating prompts and models: https://github.com/typpo/promptfoo

Example asserts include basic string checks, regex, is-json, cosine similarity, etc. (and LLM self-eval is an option if you'd like).

[+] donkeyboy|2 years ago|reply
There is a paper on arxiv saying that GPT4 correlation with human evaluators on a variety of tasks with strongly positive. I am also uncomfortable with it, but using GPT4 as a grader is not as bad as you think.
[+] toxicFork|2 years ago|reply
Well, you could keep everything else in the project and put yourself or a human as the "does this result feel better than the other one" decision maker
[+] larodi|2 years ago|reply
it seems to me also, that this is very much some sort of snake oil for the llm era. prompt generation varies from llm-to-llm and I doubt gpt4 can do reasonable evaluation, provided that it does not know at all about other models.
[+] mangecoeur|2 years ago|reply
Should we really call it "engineering" if its a case of "try random things until one of them works without really knowing why"?
[+] muzani|2 years ago|reply
Looks more like science here. Run a bunch of experiments and see which does better.

But literally the first sentence of the readme is "Prompt engineering is kind of like alchemy."

[+] mnky9800n|2 years ago|reply
That's called machine learning.
[+] andrewdb|2 years ago|reply
"I don't play craps. I'm a dice engineer!" - prompt engineers, satirically
[+] js8|2 years ago|reply
I think the real etymology of it was "social engineering". Which also means clever hacks, not a real engineering, it deliberately subverted the meaning.
[+] classified|2 years ago|reply
Hasn't software "engineering" been like that forever?
[+] brunoluiz|2 years ago|reply
Isn’t engineering an exact science while prompt engineering is completely not?

Although, even software engineering being an exact science, it is a funny one: most of us don’t get certified as like, let’s say, mechanical engineers do. Would they say we are engineers?

So perhaps the “engineer” term got overloaded in recent years?

[+] FishInTheWater|2 years ago|reply
Prompt "engineering" is just writing prayers to forest faeries.

Whilst BASIC/JavaScript/etc are all magic incantations to a child, a child will soon figure out there's underlaying logic, and learn the ability to reason about what code does, and what certain changes will do.

With prompts, it's all faerie logic. There is nothing to learn, there are only magic incantations that change drastically if the model is updated.

Worse yet, the incantations cannot be composed. E.g. take the SQL statement "SELECT column FROM table WHERE column = [%s]". For any given string you insert here, the output is predictable. You can even know which characters would trigger an injection attack.

With prompts you cannot predict results. Any word, phrase, or sequence of characters may upset the faeries and cause the model to misbehave in who knows what way. No processing of user-input will stop injection attacks.

Whilst it's dubious to call current software development practices "engineering", it's utterly ridiculous to do so for prompt-writing.

[+] mellosouls|2 years ago|reply
Software engineer has always been a daft, grandiose term that seems aimed at prestige rather than reality in the majority of cases.

If coders can call themselves engineers, no reason why anybody else solving puzzles for a living can't.

[+] groestl|2 years ago|reply
I am a Software Engineer, I am certified (since in my country, they do so, just as with Mechanical Engineers). Still think it's overloaded though.
[+] EGreg|2 years ago|reply
Anyway, the best engineer for ChatGPT prompt can be ChatGPT itself.

People seem to think that “automation will create new jobs”, but in the age or AI, those job opprtunities will be very temporary as the companies making the AI automate that thin layer.

Similarly, people think that humans will control AIs. That’s a bit quaint, a bit like humans controlling a corporation. The thin layer of “control” can be easily swapped out and present an improvement in the market, so that the number of totally autonomous (no human in the loop) workflows will grow.

That can include predictive policing with Palantir (thanks Peter Thiel!), autonomous killbots in war etc. Seeing how reckless companies have been in releasing the current AI in an arms race, I don’t see how they would be restrained in a literal arms race of slaughterbot swarms and panopticon camera meshes.

PS: I remember this exact phase when computers like Deep Blue beat Garry Kasparov. For a while he and others advocated “centaurs” — humans collaborating with computers. But the last decade hardly anyone will claim that a system with humans in the loop can beat a system that’s fully automated: https://en.m.wikipedia.org/wiki/Advanced_chess

[+] umanwizard|2 years ago|reply
“Engineer” originally just meant someone who builds engines (in a broad sense of that word). The formal titles requiring certification, etc. are the more recent development.
[+] shivams|2 years ago|reply
BTW, GPT-Engineer is openly collecting all of your data: user prompts and other metadata. And they were even defending it until they received some strong responses from the community: https://github.com/AntonOsika/gpt-engineer/issues/415 They now explicitly ask for consent regarding user data, but can we really trust their motives?
[+] Towaway69|2 years ago|reply
Is this prompt generation for the purposes of prompt engineering? Is this then a kind of meta engineering? Engineering for the purposes of engineering which then hopefully will generate working code for the computer that generated the prompt and the response to the prompt.
[+] danielbln|2 years ago|reply
A bit like code generation, really. Transpile one code to another and have the execution engine run that.
[+] Apfel|2 years ago|reply
Usage query: It looks like this could get expensive quite quickly. The approach is great, but with GPT-4 especially, could be very difficult. Is it worth using with 3.5 as a first pass then switching prompts to GPT4 once you've got the best prompt?
[+] l5870uoo9y|2 years ago|reply
I am not sure this approach is doable since GPT-4 is capable of solving assignments that GPT-3.5 get wrong. Example GPT-3.5 fails to solve the prompt (with the dvdrental sample database schema added [1]):

> find customers who didn't rent a movie in the last 12 months but rented a movie in the 12 months before that

GPT-4 solves this without a problem [2]. Combining logic like (without additional database schema added):

> find all users who lives in Paris using lat/lng and who visited the south of France within the last month

GPT-3.5 can't understand this at all, GPT-4 solves it [3].

[1]: https://www.postgresqltutorial.com/postgresql-getting-starte...

[2]: https://aihelperbot.com/snippets/cljy8km2h0000my0fgq8kut5w

[3]: https://aihelperbot.com/snippets/cljy8q6gz000al70fvfzxt2hh

[+] hoc|2 years ago|reply
Doug Adams would've had so much fun these days.
[+] Deestan|2 years ago|reply
His genius laid in not only seeing which technologies were coming, but predicting what perversions business people would twist it into.
[+] dontupvoteme|2 years ago|reply
Imagine what Russell or Wittgenstein would think of all this.
[+] xmcqdpt2|2 years ago|reply
I think one should just use GPT to generate the prompts so as to reduce the human input further still, a kind of gpt-gpt-prompt-engineer-engineer.
[+] bravura|2 years ago|reply
Isn’t that what this codebase is doing? I haven’t grokked it 100% yet.

Recently I’ve been trying to engineer a prompt that I intend to run 1k times.

Noticing GPT4 bug out on several responses, I’ve talked it through the problem more and asked it to rewrite the prompt. So an automated approach to help build better prompts based upon held out gold data is useful to me.

[+] Deestan|2 years ago|reply
And a GPT to receive the output, write a summary, and insert it directly into a CEO's trash folder.
[+] Makhini|2 years ago|reply
I think it's exactly what they do here
[+] jwestbury|2 years ago|reply
It's turtles^W GPT all the way down, I guess.
[+] Kiro|2 years ago|reply
How are they actually ranked?
[+] asimpleusecase|2 years ago|reply
A bit like an autoGPT, I did not immediately see any kind of token limits. But I did not look carefully. On a complex problem or one that accesses a lot of data the cost might ramp up.
[+] msp26|2 years ago|reply
Currently working on something similar for myself, this doesn't seem to fit my needs (benchmarking generations too rather than just classification). I only have a crude cosine similarity metric for accuracy for now. Also I'm using function calling rather than the normal completions.

I was hoping this would do something more interesting with multiple messages (if using a chat model) rather than just dumping the entire prompt in one message. The assistant lets you do stuff with examples.

[+] iRomain|2 years ago|reply
Cool for you. What about if you contribute to this one or open-source yours?
[+] hacksoi|2 years ago|reply
Are we really going down this path of prompt-prompt-engineering?
[+] PUSH_AX|2 years ago|reply
It would be cool if, given a handful of test cases, you could send those off to the LLM to generate even more test cases.

My first thought when looking over this tool was "Why do I have to do all the work?", the ideal scenario is that I give the high level description and the LLM does the hard work to create the best prompt.

[+] Kichererbsen|2 years ago|reply
I haven't read the article at all, but what you said just reminded me that I've been using ChatGPT to create Dungeons & Dragons campaigns / adventures / worlds (it's crazy good at that) and one trick i've started doing at the end of a fruitful conversation is to ask it to summarize the discussion so far _in a format to be used as a prompt_. it works more or less.
[+] hamasho|2 years ago|reply
Off topic but Jupyter cells on GitHub can't display horizontally long content and it frustrates me a lot. This small piece of code for the browser console helps me see more content, but it only works in large displays.

    $("[data-type='ipynb']").style.width = '100%'
[+] rhdunn|2 years ago|reply
Try editing the pre style so it has `white-space: pre-wrap`. That works for me. -- This uses `pre` whitespace formatting but wraps to the next line if it is too long, unlike the default `pre` element behaviour.
[+] m3kw9|2 years ago|reply
You need to learn how to use this to generate a good prompt but why not just learn how to generate good prompts? This code is basically asking for something, with examples. And ask a few real questions to test it
[+] sgt101|2 years ago|reply
this is supervised machine learning on top of unsupervised machine learning with some interesting wrinkles in both steps!

it reminds me of those aircraft that folks in rural india build from time to time.

[+] namuol|2 years ago|reply
This could work really well if it replaced GPT-X-judged performance ranking with human-in-the-loop ranking of prompts, but that’s not as exciting, I guess.
[+] __loam|2 years ago|reply
I think the human in the loop evaluation of prompts is a red herring since the LLMs were trained with HFRL, which optimized them to be convincing. Humans aren't reliable for evaluating these things because it was literally trained to make us think it's doing well.
[+] fdondi|2 years ago|reply
Does it only work with ChatGPT? Seems it would be useful also for local Llamas etc.