Unexpected responses from ChatGPT: Incident Report

[+] montroser|2 years ago|reply

This explanation feels unsatisfying. It's so high-level that it's mostly void of any actual information.

What was the wrong assumption that the code made that caused this wrong behavior? Why was it not caught in the many layers of automated testing before it made its way to production? What process and procedural changes are being implemented to reduce the risk of this class of bug happening again?

Presumably all of that is playing out internally, but if the public postmortem is meant to instill confidence, you have to actually share some of the details, or else it becomes meaningless.

[+] __MatrixMan__|2 years ago|reply

I think your questions all grew up in a world where the people operating the thing knew some rationalist who could think deductively about its operation.

But neural networks... they're an exercise in empiricism. We only ever understood that it works, never why. It's sort of a miracle that it doesn't produce buggy output all the time.

What do you tell people when they want to know why the miracles have stopped? Root cause: the gods are angry.

[+] iforgotpassword|2 years ago|reply

To be honest

> On February 20, 2024, an optimization to the user experience

At that point, about 10 words in, I already wanted to stop reading because it starts with the "we only wanted the best for our customers" bullshit newspeak. Anyone else going off on that stuff too? I'm pretty much already conditioned to expect whatever company is messaging me that way to take away some feature, increase pricing, or otherwise piss me off. In that case it was "not give any interesting detail at all".

[+] hn_throwaway_99|2 years ago|reply

I had the exact opposite reaction. I am in no way an AI expert (or novice for that matter), but I generally have an understanding of how tokenization works and how LLMs parse text strings into a series of tokens. Thus, I thought this paragraph was particularly well-written in a manner that explained pretty clearly what happened, but in a manner accessible to a layperson like me:

> In this case, the bug was in the step where the model chooses these numbers. Akin to being lost in translation, the model chose slightly wrong numbers, which produced word sequences that made no sense.

I liked this because when I first saw the example word salads I was so impressed by them - they look to be syntactically correct, but semantically they're gibberish. But knowing the basics of how LLMs choose the next token let me imagine some bugs where the "lookup table" if you will of word-to-token or vice versa (and I realize that may not be exactly the best analogy) was slightly offset.

Further, this sentence, "More technically, inference kernels produced incorrect results when used in certain GPU configurations." led me to understand how this could make it out into production - I'm sure ChatGPT has to be using tons of different GPUs in different configs to handle their volume, so it's understandable to me that their was a new matrix of config settings + code that made this bug show up.

I don't say any of the above to "excuse" OpenAI, but I also don't think they need any "excusing" to begin with. I don't think this was an unforgivable bug, and I appreciate them being so quick to explain what happened.

[+] laborcontract|2 years ago|reply

The explanation is definitely unsatisfying. If I had to guess it was an issue with quantization.

OpenAI moves so quickly with their product and doesn't seem to be shy about quickly pushing changes to prod. There are too many times to count that, using ChatGPT, I've seen the frontend changing before my eyes or seen responses change as backend modifications are pushed.

On one hand it's refreshing to see their lack of aversion to push changes. On the other hand, it leads to probably one of the most unpredictable experiences I've had using any consumer product.

[+] jasonjmcghee|2 years ago|reply

Huh. Thought it was pretty clear. I read it as:

~ "We tried a new optimization technique that modifies how next token candidates are chosen and there was a bug."

That would definitely produce the behavior people saw and makes sense.

[+] rmbyrro|2 years ago|reply

"in certain GPU configurarions" gives a hint for why it wasn't caught in automated testing.

It looks like it was a mandelbug, which is hard to catch in a test environment.

[+] nomel|2 years ago|reply

> inference kernels produced incorrect results when used in certain GPU configurations.

Seems pretty clear. A good interpretation is that they had a test escape for certain GPU configs.

[+] Zetobal|2 years ago|reply

FP is not deterministic across GPU generations. They basically say with a lot of mambo jambo that they deployed code on the wrong GPU architecture.

[+] cfn|2 years ago|reply

I don't see the issue, in this type of software a one off bug in tokenization, for example, would create the type of nonsense people saw. That would make sense given their high level explanation.

[+] unknown|2 years ago|reply

[deleted]

[+] mavamaarten|2 years ago|reply

Honestly, I don't think we're entitled to a deep technical write-up. We are users of their product, they messed something up, and that's it. Unless they signed an SLA agreement, it's just shit that happens. They don't have to give a technical view into their internal kitchen if they don't want to.

[+] bjackman|2 years ago|reply

I don't a the problem with this communication, except they just shouldn't call it a "postmortem".

In a medical setting this is equivalent to saying "the patient was coughing up blood, and now they're dead". It's relaying some potentially useful info, and I guess it's _technically_ a post-mortem analysis but it doesn't fit expectations for the phrase!

[+] chime|2 years ago|reply

I hope I'm allowed to share GPT output here because I asked GPT4 to explain OpenAI's postmortem message and this is what it said:

> Large language models (LLMs) like GPT operate using statistics to predict the next word in a sentence. Think of it like a highly sophisticated version of your phone's autocorrect that predicts what you're going to type next, but on a much larger scale.

> The system represents words as numbers, or "tokens," as part of its internal language. It uses these tokens to generate text by selecting them based on how likely they are to come after what's already been written.

> On February 20, 2024, an update meant to improve the system introduced a glitch in the part of the model that picks out these numbers. Imagine trying to grab a red marble from a bag without looking and accidentally grabbing a blue one instead—similarly, the model grabbed the wrong tokens.

> Because of this, the system started spewing out words that didn't make sense together. It's like if your autocorrect suddenly started inserting random, incorrect words into your sentences.

> The technical side of the bug involved something called "inference kernels," which are part of the system's operations. When used with certain types of graphics processing units (GPUs)—special hardware to process data—the kernels didn't work properly.

> Once the error was spotted, a correction was made to the system. After the fix, everything went back to normal, and the model resumed generating coherent responses.

It would be better if they elaborated on what "certain GPU configurations" meant because that's basically the central piece here.

[+] block_dagger|2 years ago|reply

I experienced this personally and it kinda freaked me out. Here is the chat in question, it occurs about halfway through (look for ChatGPT using emojis)

https://chat.openai.com/share/74bd7c02-79b5-4c99-a3a5-97b83f...

EDIT: Note that my personal instructions tell ChatGPT to refer to itself as Chaz in the third person. I find this fun.

EDIT2: Here is a snippet of the conversation on pastebin: https://pastebin.com/AXzd6PvM

[+] atleastoptimal|2 years ago|reply

Almost a tautological PR paragraph: "The words were wrong because the model chose the wrong words"

[+] nf3|2 years ago|reply

Is it possible that even us developers and hackers, who should know better, have fallen for the hugely exaggerated promise of AI? I read the comments on here and it's as if people really expect to be having an intelligent conversation with a rational being.

A kind reminder people: it's just a machine, the only thing that might be intelligent about it is the designs of its makers, and even then I'm not so sure...

People are talking about ChatGPT hallucinating. I think it's rather us humans who are.

[+] BeefySwain|2 years ago|reply

I am used to postmortems posted to here being a rare chance for us to take a peek behind the curtain and get a glimpse into things like architecture, monitoring systems, disaster recovery processes, "blameless culture", etc for large software service companies.

In contrast, I feel like like the greatest insight that could be gleaned from this post is that OpenAI uses GPU's.

[+] bee_rider|2 years ago|reply

We also know it uses the GPUs to generate numbers. But these numbers, they were the wrong ones. More technically, part of the computation didn’t work when run on some hardware.

[+] dimatura|2 years ago|reply

Yeah, definitely opaque. If I had to guess it sort of sounds like a code optimization that resulted in a numerical error, but only in some GPUs or CUDA versions. I've seen that sort of issue happen a few times in the pytorch framework, for example.

[+] jldugger|2 years ago|reply

It sounds like something went sideways with the embedding mapping. Either some kind of quantization, different rounding, or maybe just an older embedding.

[+] alephxyz|2 years ago|reply

Someone posted an explanation that lines up with their postmortem: https://news.ycombinator.com/item?id=39450978

[+] yieldcrv|2 years ago|reply

ChatGPT had a stroke. Haven't seen that since the 3B parameter models from 8 months ago

[+] prmoustache|2 years ago|reply

They could have said "shit happened" and it would have been as informative tbh.

[+] minimaxir|2 years ago|reply

EDIT: misread

[+] Bjorkbat|2 years ago|reply

If you had given me their explanation of events before I had any knowledge of the output from ChatGPT, then I would infer that the output would be random gibberish, or perhaps something more akin to creating an accidental shift cipher of sorts. Instead, while ChatGPT's outputs still made no sense, they still followed a certain "order".

In one amusing Twitter example, some guy asked it a math problem and ChatGPT replied with "It's the mix, it's the match, it's the meter, it's the method." and repeated sentences of this structure for who knows how long.

I guess what I'm getting at is that it's kind of an underwhelming, unsatisfying explanation of events given how truly bizarre some of the outputs are. Like, you'd assume it would be something more other than "Oops, picked the wrong numbers, gonna repeat this sentence 100x but with slightly different word choices each time".

[+] tky|2 years ago|reply

While the postmortem is nice to see at all, it’s the postmortem equivalent to “Bug fixes and performance improvements.”

[+] cm2012|2 years ago|reply

This should maybe help out the people who think ChatGPT has actual consciousness. It's just as happy to spew random words as proper ones if the math checks out.

[+] dkarras|2 years ago|reply

I have no skin on the consciousness game but those were not "random" words and humans do something similar when they are mentally ill. https://en.wikipedia.org/wiki/Clanging

Not to mention that a hypothetical "conscious" system that works by emitting token probabilities will still sound completely random if you do not choose the tokens according to the emitted probabilities.

[+] raldi|2 years ago|reply

Similarly, I thought my Chinese friend was conscious, but as soon as his translator app stopped working, everything was just gibberish.

[+] bbor|2 years ago|reply

Posting one more time: this is proof that AI is connected to human-like linguistic patterns, IMO. No, it obviously doesn’t have “consciousness” in the sense of an ongoing stream-of-consciousness monologue, but that doesn’t mean it’s not mimicking some real part of human cognition.

https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_fu...

[+] tsunamifury|2 years ago|reply

It literally doesn’t matter. If it effectively quacks like a duck then ducks are fucked.

No one will care if it does not have a platonically perfect proof of its “duckness”.

[+] skissane|2 years ago|reply

Whether "ChatGPT has actual consciousness" depends on what you consider "consciousness" to be, and what are your criteria for deciding whether something has it.

Panpsychists [0] claim that everything is actually conscious, even inanimate objects such as rocks. If rocks have actual consciousness, why can't ChatGPT have it too? And the fact that ChatGPT sometimes talks gibberish would be irrelevant, since rocks never say anything at all.

Of course, you obviously aren't a panpsychist – nor am I. Still, can we prove that they are wrong? Not sure if anyone actually can.

[0] https://iep.utm.edu/panpsych/

[+] stevenAthompson|2 years ago|reply

Are you arguing that people with aphasia are not people?

[+] swyx|2 years ago|reply

bad argument that i'm very tired of. some might say that current/former world leaders also exhibit this property. not getting political but just because "math fucked up sometimes produces bad results" does not invalidate the idea that consciousness can emerge from a pile of biological or digital neurons.

[+] pushedx|2 years ago|reply

Midjourney's image models have multiple "temperature-like" parameters, such as --weird and --chaos. In the documentation you can see examples of how they visually affect the output. With high enough values the images seem almost unrelated to the prompt. My (almost entirely unfounded) guess is that ChatGPT has similar parameters, and on a new class of hardware or configuration there was an overflow or underflow issue which caused these parameters to be set to very high values.

https://docs.midjourney.com/docs/weird-1

https://docs.midjourney.com/docs/chaos-1

[+] FridgeSeal|2 years ago|reply

Wow, could they have put any less effort in?

“Stuff happened. It was bad but now it’s good”

Yep? Ok great, solid PM guys - did you have ChatGPT write this for you on one of its “lazy days”?

[+] john2x|2 years ago|reply

Is this the future of postmortems? "We tweaked a random seed, and it resulted in unfavorable random outcomes. "

[+] antihipocrat|2 years ago|reply

Organisations are starting to use LLMs in production with the aim to reduce human operations.

I wonder how long until there is a major incident that seriously impacts customers experience and/or business systems.

[+] oliverpk|2 years ago|reply

Not really a postmortem in my opinion. While they did answer the central question of "why", I feel like a proper postmortem should question the problem more.

[+] okdood64|2 years ago|reply

Say what you will about Google, but bugs such as these, released to the public (or their enterprise customers) this debilitating to a core product, are exceedingly rare.

[+] 1oooqooq|2 years ago|reply

Hope all those people replying emails via chatGPT without reading much had iphone-style signatures to excuse them of any typo^H^H^H^Hallucinations.

[+] mcguire|2 years ago|reply

"Akin to being lost in translation, the model chose slightly wrong numbers, which produced word sequences that made no sense. More technically, inference kernels produced incorrect results when used in certain GPU configurations."

Several of the examples I saw involved ChatGPT going into what looked like repetitive (but not completely so) output loops. I'm not sure that "explanation" matches the things.

[+] treyd|2 years ago|reply

I can imagine it'd get screwy when the incorrect output token selection gets fed back into the model with the correct incorrect token selection with nonsense tokens. It's plausible.

[+] Fatalist_ma|2 years ago|reply

I remember Bing Chat doing that sometimes in the first days when it was rolled out. Could it be the "temperature" set too high (or interpreted incorrectly) in some instances?

[+] noduerme|2 years ago|reply

>> inference kernels produced incorrect results when used in certain GPU configurations.

It seems reasonable to assume that GPT inference is done entirely on Nvidia GPUs. I wonder if this is a subtle clue that they're experimenting with getting it to run on competing hardware.

[+] ShamelessC|2 years ago|reply

Why would they subtly hint at anything they’re trying to keep secret?

[+] vinay_ys|2 years ago|reply

Since when did incident postmortems became this watered down non-technical BS. If I'm a paying corporate customer, I would expect a much greater detailed RCA and action plan to prevent similar occurrence in the future. Publishing these postmortems is about holding yourself accountable in public in a manner that shows how thorough and seriously you take it and this does not accomplish that.

At minimum, I want a why-5 analysis. Let me start with the first question:

1. Why did ChatGPT generate gibberish?

A: the model chose slightly wrong numbers.

2. Why did the model choose slightly wrong numbers?

A: ??

[+] redskyluan|2 years ago|reply

Lessons learned:

The Importance of Model Agnosticism: With the rapid evolution of AI models, building applications that are model-agnostic has become more critical than ever.

Control and Interpretability Matter: Relying solely on large language models (LLMs) poses significant challenges for creating applications that can be deployed in real-world scenarios.

The Need for Open Models: Lastly, the push for more open models has never been more apparent. Open-source models are essential for fostering innovation, ensuring accessibility, and maintaining the integrity of our work in the AI field.

[+] carlosbaraza|2 years ago|reply

If a human had this failure, it would probably be something like a psychotic episode. If a super intelligence had a psychotic episode because of a bug, it could be pretty destructive.

260 comments