ChatGPT-4 significantly increased performance of business consultants

[+] hubraumhugo|2 years ago|reply

Well, this sounds like perfect tasks for GPT:

"Participants responded to a total of 18 tasks (or as many as they could within the given time frame). These tasks spanned various domains. Specifically, they can be categorized into four types: creativity (e.g., “Propose at least 10 ideas for a new shoe targeting an underserved market or sport.”), analytical thinking (e.g., “Segment the footwear industry market based on users.”), writing proficiency (e.g., “Draft a press release marketing copy for your product.”), and persuasiveness (e.g., “Pen an inspirational memo to employees detailing why your product would outshine competitors.”)."

Here is the GPT response to the first task: https://chat.openai.com/share/db7556f7-6036-4b3d-a61a-9cd253...

A confident GPT hallucination is almost indistinguishable from typical management consulting material...

[+] highwaylights|2 years ago|reply

1) Your ideas are bad.

2) Spreadsheets exist.

3) No-one cares about your marketing copy.

4) No-one finds your c-suite babble inspirational.

This is almost perfect input to an LLM exactly because of how low value it is in the first place.

[+] iterminate|2 years ago|reply

> A confident GPT hallucination is almost indistinguishable from typical management consulting material...

If you're measuring based on output, sure, but... the value of any knowledge worker is primarily driven by the input, that is, a client doesn't want "10 ideas" they want "10 [valuable] ideas [informed by the understanding of the business and the market they're operating in]". If a management consultant said "boat shoes" in response to this question they would not have a client much longer.

You could apply this same nonsense task to software engineering, i.e: ask ChatGPT to "write 10 lines of code" and it'll be indistinguishable from the code we churn out day after day.

[+] shakow|2 years ago|reply

> Target: Dog Walkers

> Features: Built-in waste bag dispenser

I'm not yet sure whether I hate it or love it.

> Target: Visually Impaired Individuals

> Features: Haptic feedback

Haptic shoes, how revolutionary!

[+] segfaultex|2 years ago|reply

> A confident GPT hallucination is almost indistinguishable from typical management consulting material...

Sounds perfect for both Harvard and those linked to the institution.

My employer has hired McKinsey a few times, known to recruit from HYP, and their output has been subpar to say the least. My entire experience with these institutions has been fairly uniform in that regard.

I know it’s anecdotal. But it feels like there’s a lot of confirmation bias with these sorts of studies.

[+] AdamCraven|2 years ago|reply

Well, they buried the lede with this one. Using LLMs were better for some tasks and actually made it worse for others.

The first task was a generalist task ("inside the frontier" as they refer to it), which I'm not surprised has improved performance, as it purposely made to fall into an LLM's areas of strength: research into well-defined areas where you might not have strong domain knowledge. This also is the mainstay of early consultants' work, in which they are generalists in their early careers – usually as business analysts or similar – until they become more valuable and specialise later on.

LLMs are strong in this area of general research because they have generalised a lot of information. But this generalisation is also its weakness. A good way to think about it is it's like a journalist of research. If you've ever read a newspaper, you often think you're getting a lot of insight. However, as soon as you read an article on an area of your specialisation, you realise they've made many flaws with the analysis; they don't understand your subject anywhere near the level you would.

The second task (outside the frontier) required analysis of a spreadsheet, interviews and a more deeply analytical take with evidence to back it up. These are all tasks that LLMs aren't strong at currently. Unsurprisingly, the non-LLM group scored 84.5%, and between 60% and 70.6% for LLM users.

The takeaway should be that LLMs are great for generalised research but less good for specialist analytical tasks.

[+] doitLP|2 years ago|reply

I was thinking about this last night. It’s a new version of Gell-Mann amnesia. I call it LLm-man amnesia.

When I ask a programming question, chat GPT hallucinates something about 20% of the time and I can only tell because I’m skilled enough to see it. For all the other domains I ask it questions if I should assume at least as much hallucination and incorrect information.

[+] RandomLensman|2 years ago|reply

LLMs are broadly good at things that average knowledge workers are good at or can be trained to be good at reasonably quickly.

[+] genman|2 years ago|reply

Comparing LLM to journalists is good insight.

[+] Hippocrates|2 years ago|reply

This is hilarious. As impressive as GPT-3/4 has been at writing, what's more shocking is just how bullshity-y human writing is.. And a "business consultant" is the epitome of a role requiring bullshit writing. Chat GPT could certainly out business-consultant the very best business consultants.

Sometimes to be taken seriously at work, you need to take some concise idea or data and fluff it up into a multiple pages or a slide deck JUST so that others can immediately see how much work you put in.

The ideal role for chatgpt at this moment is probably to take concise writings and to expand it into something way larger and full of filler. On the receiving end, people will endure your long-winded document or slide deck, recognize you "put in the work", and then feed it back into chatGPT to get the original key points summarized.

[+] klabb3|2 years ago|reply

> As impressive as GPT-3/4 has been at writing, what's more shocking is just how bullshity-y human writing is..

Yeah. Most people have focused on what LLMs can do, but I think it’s equally if not more interesting what can they not do, and why?

When we say LLMs can generate text we’re painting brush strokes as broad as a 10-lane highway. Apparently we have quite limited vocabulary about what writing actually is, and specifically what categories and levels exist.

For instance, it’s fun (and in my view completely expected) to see that courteous emails, LinkedIn inspirational spam, corp-speech etc, GPT outperforms humans with flying colors, on the first attempt too! Whereas if you’re asking for the next book of Game of Thrones or any well-written literature it falls flat – incredibly boring, generic, full of platitudes and empty arcs and characters.

We have to start mapping the field of writing to a better conceptual space. Currently it seems like we can’t even differentiate between the equivalent of arithmetic and abstract algebra.

[+] benreesman|2 years ago|reply

LLMs are stunningly good at language tasks: almost all of what us old-timers called NLP is just crushed these days. Summarization, Q&A, sentiment, the list goes on and on. Truly remarkable stuff.

And where there isn’t a bright line around “fact”, and where it doesn’t need to come together like a Pynchon novel, the generative stuff is smoking hot: short-form fiction, opinion pieces, product copy? Massive productivity booster, you can prototype 20 ideas in one minute.

But that’s about where we are: lift natural language into a latent space with some clear notion of separability, do some affine (ish) transformations, lower back down.

Fucking impressive for a computer. But if it can really carry water for an expensive Penn grad?

You’re paying for something other than blindingly insightful product strategy.

[+] Jeff_Brown|2 years ago|reply

I wonder how long it takes AI to get good at law. Right now the verbal tasks it excels at are similar to the artistic ones: namely, solving problems with enormous solution spaces that are robust to small perturbations. That is, change a good picture of an angry tree man slightly and it's still probably a good picture of an angry tree man.

[+] aofjfdgnionio|2 years ago|reply

I don't buy it. LLMs cannot do anything reliably, no matter how constrained the domain. Their outputs are of acceptable quality when back to a person who will use their human brain to paper over the cracks. People can recognize when the output is garbage, figure out minor ambiguities, and subconsciously correct minor factual or logical errors. But I would never feed LLM results directly into another computer program This rules out most traditional NLP tasks.

[+] unknown|2 years ago|reply

[deleted]

[+] bugglebeetle|2 years ago|reply

> almost all of what us old-timers called NLP is just crushed these days

For this to be true for most production service use cases, LLMs would need to be at least ~10X faster. I generally agree they can be quite good at these tasks, but the performance is not there to do them on large datasets.

[+] qingcharles|2 years ago|reply

Try asking it for specifically NLP-ized text and it does it very well...

(but then tells you not to use it as it is "unethical")

[+] mgaunard|2 years ago|reply

Says more about how useless BCG consultants are.

[+] awestroke|2 years ago|reply

Not surprised. It's frighteningly good, and a perfect match for programming.

I often ask GPT4 to write code for something, and try if it works, but I seldom copy and paste the code it writes - I rewrite it myself to fit into the context of the codebase. But it saves me a lot of time when I am unsure about how to do something.

Other times I don't like the suggestion at all, but that's useful as well, as it often clarifies the problem space in my head.

[+] brap|2 years ago|reply

I used ChatGPT yesterday for code for the first time.

I gave it a nontrivial task I couldn’t google a solution for, and wasn’t sure it was even possible:

Given a python object, give me a list of functions that received this object as an argument. I cannot modify the existing code, only how the object is structured.

It gave me a few ideas that didn’t quite work (e.g modifying the functions or wrapping them in decorators, looking at the current stack trace to find such functions) and after some back and forth it came up with hijacking the python tracer to achieve this. And it actually worked.

The crazy thing is that I don’t believe it encountered anything like this in its training set, it was able to put pieces together which is near human level. When asked, it easily explained the shortcomings of this solution (e.g interfering with the debugger).

[+] Rexxar|2 years ago|reply

The published article is not at all about programming tasks but about generating text for "strategy consultant".

Some example found page 10 of the original article:

   - Propose at least 10 ideas for a new shoe targeting an underserved market or sport.
   - Segment the footwear industry market based on users.
   - Draft a press release marketing copy for your product.
   - Pen an inspirational memo to employees detailing why your product would outshine competitors.

Nothing of real value imho.

[+] bryancoxwell|2 years ago|reply

I’ve also found the act of describing my problem to GPT4 is sometimes just a helpful as the answer itself. It’s almost like enhanced rubber duck debugging.

[+] dkjaudyeqooe|2 years ago|reply

This is one step removed from "try different things until it works" style of programming.

Not to say you're of of those programmers, but it certainly enables those sorts of programmers.

[+] DonHopkins|2 years ago|reply

It's a hell of an articulate rubber duck!

https://en.wikipedia.org/wiki/Rubber_duck_debugging

[+] 101008|2 years ago|reply

Beware of that practice. If for some reasing you are get used to it too much, one day you may not have and you won't know where to start to write a function yourself.

It's simlar to what happens to people who knows a language (not coding language), stop using it or go back to use translator, and when they need to use it themselves, they are unable.

[+] simonmesmith|2 years ago|reply

Having been a consultant, what strikes me about this is the next, to me seemingly obvious question: What if you just removed the consultants entirely and just had GPT-4 do the work directly for the client?

If you’re a client and need a consultant to do something, you have to explain the requirement to them, review the work, give feedback, and so forth. There will likely be a few meetings in there.

But if GPT-4 can make consultants so much better, I imagine it can also do their work for them. And if you combine this with the reduction in communications overhead that comes from not working with an outside group, why wouldn’t clients just accrue all the benefits to themselves, plus the benefit of not paying outside consultants or dealing with the overhead of managing them?

This is especially the case when the client is already a domain expert but just needs some additional horsepower. For example, marketing brand managers may work with marketing consultants even though they know their products and marketing very well. They just need more resources, which can come in the form of consultants for reasons such as internal head-count restrictions.

Anyway, I just wonder if BCG thought through the implications of participating in this study. To me it feels like a very short step from “helps consultants help their clients” to “helps clients directly and shows consultants aren’t really necessary.”

Especially so if the client just hires an intern and gives them GPT-4.

[+] olalonde|2 years ago|reply

HN is so bad at predictions. Just a few months ago HN was awash with comments that confidently claimed LLMs were no more than stochastic parrots and unlikely to amount to anything.

> I can't help but think the next AI winter is around the corner. [0]

Yeah, right.

[0] https://news.ycombinator.com/item?id=23886325

[+] xbmcuser|2 years ago|reply

There is a lot of office work that will overtime be optimized over time using gpt like services. I was tech savvy enough to know that a lot of office work that I do is repeatable and can be done using scripts but not good enough to write those scripts myself. Using Chat gpt allowed me to write those scripts it took me I think 15-20hrs to get the scripts working perfectly. I knew just a little bit of python scripting did not know anything about python pandas or xls writer etc but was able to create something that saves me I would estimate 20-25 hours a week.

In my opinion a lot of people here on hackernews as they are themselves good at programing underestimate how services like chat gpt can open a new world to non programmers. They also probably make the non inquisitive learn less. Previously to learn how to stop multiple snapd services using a script I would have googled and then cobbled together something today I just ask chatgpt and get a working script in less than a min.

[+] dontupvoteme|2 years ago|reply

Couldn't agree more. I've gone multiple times now from "I wonder if X is possible/how would you do X" to hacking out a crude proof of concept to a problem that I wouldn't even know how to google.

[+] skepticATX|2 years ago|reply

Two things mentioned in the abstract that are worth pointing out.

> For each one of a set of 18 realistic consulting tasks within the frontier of AI capabilities

They specifically picked tasks that GPT-4 was capable of doing. GPT-4 could not do many tasks, so when we say that performance was significantly increased this is only for tasks GPT-4 is well suited to. There is still value here but let's put these results into context.

> Consultants across the skills distribution benefited significantly from having AI augmentation, with those below the average performance threshold increasing by 43% and those above increasing by 17% compared to their own scores

Even when cherry-picking tasks that GPT-4 is particularly suited for, above average performers only increased performance by 17%. This increase is still impressive, were it to be seen across the board. But I do think that 17% is a lot less than some people are trying to sell.

[+] denton-scratch|2 years ago|reply

Hmmm. Perhaps below-average performers are more likely to take GPT output at face-value, being less competent to review and edit it. And above-average performers are more likely to hack the GPT output around, because they're confident in their own abilities.

Therefore below-average types will produce finished output more quickly; and this was a time-constrained test, so velocity matters.

ChatGPT is very good at waffling, and marketing-speak and inspirational messages are essentially waffle. IOW, the tasks were tailor-made for unaided ChatGPT, so high-performers were penalized.

[+] ellyagg|2 years ago|reply

You're underestimating because it compounds. Small gains in efficency lead to huge advantages in long term growth. 17% would be absolutely monumental improvement.

[+] Sirikon|2 years ago|reply

Pipe /dev/random, transform to decimal, and you just got an amazing increase in performance for calculating decimals of Pi. Nobody said precision was important anyway.

[+] segfaltnh|2 years ago|reply

Honestly if you don't care about precision, /dev/zero is going to give you more throughput. Plus, I personally guarantee it's correct to within an error margin of 4.0. You can't offer the same with /dev/random!

[+] seanhunter|2 years ago|reply

We're not trying to hit a comet with a rocket here. 1 significant figure is more than sufficient for an initial consultation. Any additional accuracy required would be billable follow-on work.

[+] nopinsight|2 years ago|reply

More details in this blog post by a Wharton professor: https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the...

My questions to naysayers:

* Do you or anyone you know use GPT-4 (not the free GPT-3.5) to do productive tasks like coding and found it to help in many cases?

* If you insist it’s useless, why do millions of people pay $20 a month to access GPT-4 and plugins?

[+] syntaxing|2 years ago|reply

Yes, GPT-4 is great for doing “boring work” and allows me to focus on the “fun work”. You still need to know what you’re doing though, you can’t blindly copy and paste.

And for the second one, although I am paying for it too, this idea is more or less flawed nowadays. Utilization is a very hand wavy thing when it comes to this stuff. Like a purse, millions would pay money for it, some even pay thousands. But I have no use for it and wouldn’t even pay a $1 for one.

[+] footy|2 years ago|reply

I have free access to copilot because I do some open source work. I haven't been impressed by what it can do and I wouldn't pay even $3/month to use it.

The second question doesn't make sense to me. There are tons of things I think are useless (or worse) that people pay for anyway. Meal kit boxes come to mind, and at least you can eat those at the end of the day.

[+] ofjcihen|2 years ago|reply

Obviously anecdata, but:

1: I’ve been scripting for 5 years using Python. I purchased a subscription to use GPT4 to see if it could assist me.

In the end it took me more time to fix its mistakes than to just apply my knowledge of knowing what to Google and reading docs.

Additionally the largest hurdle I encountered was when it hallucinated a package that didn’t exist and I spent time trying to find it.

2: I don’t know about most people but I’m terrible at cancelling services that are “cheap”. I used ChatGPT for a few hours that first month and didn’t cancel it for another 5 months.

[+] NBJack|2 years ago|reply

My prediction? In about 6 months, every test, task, or use of a LLM for anything that requires a modicum of creativity is going to find that it only has a fixed set of "ideas" before it starts regurgitating them. [0] I can easily imagine this in their hypothetical shoe pitch question, and many models going for more factual answers have been rapidly showing this bias by design.

[0] https://www.marktechpost.com/2023/06/16/this-paper-tests-cha...

[+] simonw|2 years ago|reply

I'm very unimpressed by that study. Look at how they generated the jokes - they fed it a prompt that was a slight variation on "please tell me a joke" and then wrote about how the jokes weren't varied enough.

https://github.com/DLR-SC/JokeGPT-WASSA23/blob/main/01_joke_...

That's a bad way to use an LLM for joke generation.

Try "tell me a joke about a sea lion" - then replace sea lion with any other animal.

Or "tell me ten jokes about a lawyer on the moon" - combine concepts like that and you get an infinite variety of jokes.

Some of them might even be funny!

[+] FrustratedMonky|2 years ago|reply

Can confirm. I popped the 20 bucks for GPT4, and have been using it more and more, every day for 3 weeks. Not sure how I can get by without it now. It's just so easy to have normal conversation and get answers. Like having an expert friend across the hall you can just shoutout questions, and ask for simple reminders, recommendations.

Who cares if it gets things wrong sometimes, you would double check your co-workers answers also. And there are times when I insist I am correct, and GPT will argue back and eventually I find I was wrong.

[+] _pferreir_|2 years ago|reply

Maybe this tells more about BCG consultants than its does about GPT-4?

[+] brabel|2 years ago|reply

That's what you would like to think, isn't it? I'm afraid this would be just as much true with any other kind of subjects, and as far as I know, there's no evidence either way so this is just a cheap stab you're having at them.

[+] refurb|2 years ago|reply

Meh.. I mean a lot of consulting is tasks like writing or idea generation. Using something like chat GPT to do it [faster, better] doesn't negate the value in what they do, since they are hired to do those tasks, those tasks are required for the broader work.

[+] User23|2 years ago|reply

I bet early search engines had similar or even better figures under similar conditions.

I suppose this because I recall how much search improved my productivity over flipping through books and I know how for certain tasks ChatGPT is a better source of knowledge on how to do it than search. While often the GPT output isn’t entirely correct, more often than not it suffices to make the correct solution obvious thus saving a lot of time.

[+] z991|2 years ago|reply

The actual research article: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321

Summary: https://pdf2gpt.com/?summary=84ff84d4b98b4f0c985a17d07db482c...

[+] makach|2 years ago|reply

Guilty! GPT is the best colleague I ever had, but boy does it speak. You can't just copy paste, but if you consider its responses as input I find myself less dependent on other senior consultants sharing their insights. It also makes me more confident in my assessments and deliveries.

Purpose of technology is to enhance our performance, GPT is very much doing so - but with great powers comes great responsibility.

[+] leoff|2 years ago|reply

This is a good thing, since increased perfomance means that the clients will have less billed hours, right? Right?

[+] pydry|2 years ago|reply

BCG : We know layoffs are in fashion and we'd just like you to know that if you need industrial grade ass covering excuses from a legitimate-ish sounding authority to justify what you were planning to do anyway, our 23 year old consultants and their PowerPoint presentations have got you covered.

272 comments