This explanation feels unsatisfying. It's so high-level that it's mostly void of any actual information.
What was the wrong assumption that the code made that caused this wrong behavior? Why was it not caught in the many layers of automated testing before it made its way to production? What process and procedural changes are being implemented to reduce the risk of this class of bug happening again?
Presumably all of that is playing out internally, but if the public postmortem is meant to instill confidence, you have to actually share some of the details, or else it becomes meaningless.
I think your questions all grew up in a world where the people operating the thing knew some rationalist who could think deductively about its operation.
But neural networks... they're an exercise in empiricism. We only ever understood that it works, never why. It's sort of a miracle that it doesn't produce buggy output all the time.
What do you tell people when they want to know why the miracles have stopped? Root cause: the gods are angry.
> On February 20, 2024, an optimization to the user experience
At that point, about 10 words in, I already wanted to stop reading because it starts with the "we only wanted the best for our customers" bullshit newspeak. Anyone else going off on that stuff too? I'm pretty much already conditioned to expect whatever company is messaging me that way to take away some feature, increase pricing, or otherwise piss me off. In that case it was "not give any interesting detail at all".
I had the exact opposite reaction. I am in no way an AI expert (or novice for that matter), but I generally have an understanding of how tokenization works and how LLMs parse text strings into a series of tokens. Thus, I thought this paragraph was particularly well-written in a manner that explained pretty clearly what happened, but in a manner accessible to a layperson like me:
> In this case, the bug was in the step where the model chooses these numbers. Akin to being lost in translation, the model chose slightly wrong numbers, which produced word sequences that made no sense.
I liked this because when I first saw the example word salads I was so impressed by them - they look to be syntactically correct, but semantically they're gibberish. But knowing the basics of how LLMs choose the next token let me imagine some bugs where the "lookup table" if you will of word-to-token or vice versa (and I realize that may not be exactly the best analogy) was slightly offset.
Further, this sentence, "More technically, inference kernels produced incorrect results when used in certain GPU configurations." led me to understand how this could make it out into production - I'm sure ChatGPT has to be using tons of different GPUs in different configs to handle their volume, so it's understandable to me that their was a new matrix of config settings + code that made this bug show up.
I don't say any of the above to "excuse" OpenAI, but I also don't think they need any "excusing" to begin with. I don't think this was an unforgivable bug, and I appreciate them being so quick to explain what happened.
The explanation is definitely unsatisfying. If I had to guess it was an issue with quantization.
OpenAI moves so quickly with their product and doesn't seem to be shy about quickly pushing changes to prod. There are too many times to count that, using ChatGPT, I've seen the frontend changing before my eyes or seen responses change as backend modifications are pushed.
On one hand it's refreshing to see their lack of aversion to push changes. On the other hand, it leads to probably one of the most unpredictable experiences I've had using any consumer product.
I don't see the issue, in this type of software a one off bug in tokenization, for example, would create the type of nonsense people saw. That would make sense given their high level explanation.
Honestly, I don't think we're entitled to a deep technical write-up. We are users of their product, they messed something up, and that's it. Unless they signed an SLA agreement, it's just shit that happens. They don't have to give a technical view into their internal kitchen if they don't want to.
I don't a the problem with this communication, except they just shouldn't call it a "postmortem".
In a medical setting this is equivalent to saying "the patient was coughing up blood, and now they're dead". It's relaying some potentially useful info, and I guess it's _technically_ a post-mortem analysis but it doesn't fit expectations for the phrase!
I hope I'm allowed to share GPT output here because I asked GPT4 to explain OpenAI's postmortem message and this is what it said:
> Large language models (LLMs) like GPT operate using statistics to predict the next word in a sentence. Think of it like a highly sophisticated version of your phone's autocorrect that predicts what you're going to type next, but on a much larger scale.
> The system represents words as numbers, or "tokens," as part of its internal language. It uses these tokens to generate text by selecting them based on how likely they are to come after what's already been written.
> On February 20, 2024, an update meant to improve the system introduced a glitch in the part of the model that picks out these numbers. Imagine trying to grab a red marble from a bag without looking and accidentally grabbing a blue one instead—similarly, the model grabbed the wrong tokens.
> Because of this, the system started spewing out words that didn't make sense together. It's like if your autocorrect suddenly started inserting random, incorrect words into your sentences.
> The technical side of the bug involved something called "inference kernels," which are part of the system's operations. When used with certain types of graphics processing units (GPUs)—special hardware to process data—the kernels didn't work properly.
> Once the error was spotted, a correction was made to the system. After the fix, everything went back to normal, and the model resumed generating coherent responses.
It would be better if they elaborated on what "certain GPU configurations" meant because that's basically the central piece here.
I experienced this personally and it kinda freaked me out. Here is the chat in question, it occurs about halfway through (look for ChatGPT using emojis)
Is it possible that even us developers and hackers, who should know better, have fallen for the hugely exaggerated promise of AI? I read the comments on here and it's as if people really expect to be having an intelligent conversation with a rational being.
A kind reminder people: it's just a machine, the only thing that might be intelligent about it is the designs of its makers, and even then I'm not so sure...
People are talking about ChatGPT hallucinating. I think it's rather us humans who are.
I am used to postmortems posted to here being a rare chance for us to take a peek behind the curtain and get a glimpse into things like architecture, monitoring systems, disaster recovery processes, "blameless culture", etc for large software service companies.
In contrast, I feel like like the greatest insight that could be gleaned from this post is that OpenAI uses GPU's.
We also know it uses the GPUs to generate numbers. But these numbers, they were the wrong ones. More technically, part of the computation didn’t work when run on some hardware.
Yeah, definitely opaque. If I had to guess it sort of sounds like a code optimization that resulted in a numerical error, but only in some GPUs or CUDA versions. I've seen that sort of issue happen a few times in the pytorch framework, for example.
It sounds like something went sideways with the embedding mapping. Either some kind of quantization, different rounding, or maybe just an older embedding.
If you had given me their explanation of events before I had any knowledge of the output from ChatGPT, then I would infer that the output would be random gibberish, or perhaps something more akin to creating an accidental shift cipher of sorts. Instead, while ChatGPT's outputs still made no sense, they still followed a certain "order".
In one amusing Twitter example, some guy asked it a math problem and ChatGPT replied with "It's the mix, it's the match, it's the meter, it's the method." and repeated sentences of this structure for who knows how long.
I guess what I'm getting at is that it's kind of an underwhelming, unsatisfying explanation of events given how truly bizarre some of the outputs are. Like, you'd assume it would be something more other than "Oops, picked the wrong numbers, gonna repeat this sentence 100x but with slightly different word choices each time".
This should maybe help out the people who think ChatGPT has actual consciousness. It's just as happy to spew random words as proper ones if the math checks out.
I have no skin on the consciousness game but those were not "random" words and humans do something similar when they are mentally ill. https://en.wikipedia.org/wiki/Clanging
Not to mention that a hypothetical "conscious" system that works by emitting token probabilities will still sound completely random if you do not choose the tokens according to the emitted probabilities.
Posting one more time: this is proof that AI is connected to human-like linguistic patterns, IMO. No, it obviously doesn’t have “consciousness” in the sense of an ongoing stream-of-consciousness monologue, but that doesn’t mean it’s not mimicking some real part of human cognition.
Whether "ChatGPT has actual consciousness" depends on what you consider "consciousness" to be, and what are your criteria for deciding whether something has it.
Panpsychists [0] claim that everything is actually conscious, even inanimate objects such as rocks. If rocks have actual consciousness, why can't ChatGPT have it too? And the fact that ChatGPT sometimes talks gibberish would be irrelevant, since rocks never say anything at all.
Of course, you obviously aren't a panpsychist – nor am I. Still, can we prove that they are wrong? Not sure if anyone actually can.
bad argument that i'm very tired of. some might say that current/former world leaders also exhibit this property. not getting political but just because "math fucked up sometimes produces bad results" does not invalidate the idea that consciousness can emerge from a pile of biological or digital neurons.
Midjourney's image models have multiple "temperature-like" parameters, such as --weird and --chaos. In the documentation you can see examples of how they visually affect the output. With high enough values the images seem almost unrelated to the prompt. My (almost entirely unfounded) guess is that ChatGPT has similar parameters, and on a new class of hardware or configuration there was an overflow or underflow issue which caused these parameters to be set to very high values.
Not really a postmortem in my opinion. While they did answer the central question of "why", I feel like a proper postmortem should question the problem more.
Say what you will about Google, but bugs such as these, released to the public (or their enterprise customers) this debilitating to a core product, are exceedingly rare.
"Akin to being lost in translation, the model chose slightly wrong numbers, which produced word sequences that made no sense. More technically, inference kernels produced incorrect results when used in certain GPU configurations."
Several of the examples I saw involved ChatGPT going into what looked like repetitive (but not completely so) output loops. I'm not sure that "explanation" matches the things.
I can imagine it'd get screwy when the incorrect output token selection gets fed back into the model with the correct incorrect token selection with nonsense tokens. It's plausible.
I remember Bing Chat doing that sometimes in the first days when it was rolled out. Could it be the "temperature" set too high (or interpreted incorrectly) in some instances?
>> inference kernels produced incorrect results when used in certain GPU configurations.
It seems reasonable to assume that GPT inference is done entirely on Nvidia GPUs. I wonder if this is a subtle clue that they're experimenting with getting it to run on competing hardware.
Since when did incident postmortems became this watered down non-technical BS. If I'm a paying corporate customer, I would expect a much greater detailed RCA and action plan to prevent similar occurrence in the future. Publishing these postmortems is about holding yourself accountable in public in a manner that shows how thorough and seriously you take it and this does not accomplish that.
At minimum, I want a why-5 analysis. Let me start with the first question:
1. Why did ChatGPT generate gibberish?
A: the model chose slightly wrong numbers.
2. Why did the model choose slightly wrong numbers?
The Importance of Model Agnosticism: With the rapid evolution of AI models, building applications that are model-agnostic has become more critical than ever.
Control and Interpretability Matter: Relying solely on large language models (LLMs) poses significant challenges for creating applications that can be deployed in real-world scenarios.
The Need for Open Models: Lastly, the push for more open models has never been more apparent. Open-source models are essential for fostering innovation, ensuring accessibility, and maintaining the integrity of our work in the AI field.
If a human had this failure, it would probably be something like a psychotic episode. If a super intelligence had a psychotic episode because of a bug, it could be pretty destructive.
[+] [-] montroser|2 years ago|reply
What was the wrong assumption that the code made that caused this wrong behavior? Why was it not caught in the many layers of automated testing before it made its way to production? What process and procedural changes are being implemented to reduce the risk of this class of bug happening again?
Presumably all of that is playing out internally, but if the public postmortem is meant to instill confidence, you have to actually share some of the details, or else it becomes meaningless.
[+] [-] __MatrixMan__|2 years ago|reply
But neural networks... they're an exercise in empiricism. We only ever understood that it works, never why. It's sort of a miracle that it doesn't produce buggy output all the time.
What do you tell people when they want to know why the miracles have stopped? Root cause: the gods are angry.
[+] [-] iforgotpassword|2 years ago|reply
> On February 20, 2024, an optimization to the user experience
At that point, about 10 words in, I already wanted to stop reading because it starts with the "we only wanted the best for our customers" bullshit newspeak. Anyone else going off on that stuff too? I'm pretty much already conditioned to expect whatever company is messaging me that way to take away some feature, increase pricing, or otherwise piss me off. In that case it was "not give any interesting detail at all".
[+] [-] hn_throwaway_99|2 years ago|reply
> In this case, the bug was in the step where the model chooses these numbers. Akin to being lost in translation, the model chose slightly wrong numbers, which produced word sequences that made no sense.
I liked this because when I first saw the example word salads I was so impressed by them - they look to be syntactically correct, but semantically they're gibberish. But knowing the basics of how LLMs choose the next token let me imagine some bugs where the "lookup table" if you will of word-to-token or vice versa (and I realize that may not be exactly the best analogy) was slightly offset.
Further, this sentence, "More technically, inference kernels produced incorrect results when used in certain GPU configurations." led me to understand how this could make it out into production - I'm sure ChatGPT has to be using tons of different GPUs in different configs to handle their volume, so it's understandable to me that their was a new matrix of config settings + code that made this bug show up.
I don't say any of the above to "excuse" OpenAI, but I also don't think they need any "excusing" to begin with. I don't think this was an unforgivable bug, and I appreciate them being so quick to explain what happened.
[+] [-] laborcontract|2 years ago|reply
OpenAI moves so quickly with their product and doesn't seem to be shy about quickly pushing changes to prod. There are too many times to count that, using ChatGPT, I've seen the frontend changing before my eyes or seen responses change as backend modifications are pushed.
On one hand it's refreshing to see their lack of aversion to push changes. On the other hand, it leads to probably one of the most unpredictable experiences I've had using any consumer product.
[+] [-] jasonjmcghee|2 years ago|reply
~ "We tried a new optimization technique that modifies how next token candidates are chosen and there was a bug."
That would definitely produce the behavior people saw and makes sense.
[+] [-] rmbyrro|2 years ago|reply
It looks like it was a mandelbug, which is hard to catch in a test environment.
[+] [-] nomel|2 years ago|reply
Seems pretty clear. A good interpretation is that they had a test escape for certain GPU configs.
[+] [-] Zetobal|2 years ago|reply
[+] [-] cfn|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] mavamaarten|2 years ago|reply
[+] [-] bjackman|2 years ago|reply
In a medical setting this is equivalent to saying "the patient was coughing up blood, and now they're dead". It's relaying some potentially useful info, and I guess it's _technically_ a post-mortem analysis but it doesn't fit expectations for the phrase!
[+] [-] chime|2 years ago|reply
> Large language models (LLMs) like GPT operate using statistics to predict the next word in a sentence. Think of it like a highly sophisticated version of your phone's autocorrect that predicts what you're going to type next, but on a much larger scale.
> The system represents words as numbers, or "tokens," as part of its internal language. It uses these tokens to generate text by selecting them based on how likely they are to come after what's already been written.
> On February 20, 2024, an update meant to improve the system introduced a glitch in the part of the model that picks out these numbers. Imagine trying to grab a red marble from a bag without looking and accidentally grabbing a blue one instead—similarly, the model grabbed the wrong tokens.
> Because of this, the system started spewing out words that didn't make sense together. It's like if your autocorrect suddenly started inserting random, incorrect words into your sentences.
> The technical side of the bug involved something called "inference kernels," which are part of the system's operations. When used with certain types of graphics processing units (GPUs)—special hardware to process data—the kernels didn't work properly.
> Once the error was spotted, a correction was made to the system. After the fix, everything went back to normal, and the model resumed generating coherent responses.
It would be better if they elaborated on what "certain GPU configurations" meant because that's basically the central piece here.
[+] [-] block_dagger|2 years ago|reply
https://chat.openai.com/share/74bd7c02-79b5-4c99-a3a5-97b83f...
EDIT: Note that my personal instructions tell ChatGPT to refer to itself as Chaz in the third person. I find this fun.
EDIT2: Here is a snippet of the conversation on pastebin: https://pastebin.com/AXzd6PvM
[+] [-] atleastoptimal|2 years ago|reply
[+] [-] nf3|2 years ago|reply
A kind reminder people: it's just a machine, the only thing that might be intelligent about it is the designs of its makers, and even then I'm not so sure...
People are talking about ChatGPT hallucinating. I think it's rather us humans who are.
[+] [-] BeefySwain|2 years ago|reply
In contrast, I feel like like the greatest insight that could be gleaned from this post is that OpenAI uses GPU's.
[+] [-] bee_rider|2 years ago|reply
[+] [-] dimatura|2 years ago|reply
[+] [-] jldugger|2 years ago|reply
[+] [-] alephxyz|2 years ago|reply
[+] [-] yieldcrv|2 years ago|reply
[+] [-] prmoustache|2 years ago|reply
[+] [-] minimaxir|2 years ago|reply
[+] [-] Bjorkbat|2 years ago|reply
In one amusing Twitter example, some guy asked it a math problem and ChatGPT replied with "It's the mix, it's the match, it's the meter, it's the method." and repeated sentences of this structure for who knows how long.
I guess what I'm getting at is that it's kind of an underwhelming, unsatisfying explanation of events given how truly bizarre some of the outputs are. Like, you'd assume it would be something more other than "Oops, picked the wrong numbers, gonna repeat this sentence 100x but with slightly different word choices each time".
[+] [-] tky|2 years ago|reply
[+] [-] cm2012|2 years ago|reply
[+] [-] dkarras|2 years ago|reply
Not to mention that a hypothetical "conscious" system that works by emitting token probabilities will still sound completely random if you do not choose the tokens according to the emitted probabilities.
[+] [-] raldi|2 years ago|reply
[+] [-] bbor|2 years ago|reply
https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_fu...
[+] [-] tsunamifury|2 years ago|reply
No one will care if it does not have a platonically perfect proof of its “duckness”.
[+] [-] skissane|2 years ago|reply
Panpsychists [0] claim that everything is actually conscious, even inanimate objects such as rocks. If rocks have actual consciousness, why can't ChatGPT have it too? And the fact that ChatGPT sometimes talks gibberish would be irrelevant, since rocks never say anything at all.
Of course, you obviously aren't a panpsychist – nor am I. Still, can we prove that they are wrong? Not sure if anyone actually can.
[0] https://iep.utm.edu/panpsych/
[+] [-] stevenAthompson|2 years ago|reply
[+] [-] swyx|2 years ago|reply
[+] [-] pushedx|2 years ago|reply
https://docs.midjourney.com/docs/weird-1
https://docs.midjourney.com/docs/chaos-1
[+] [-] FridgeSeal|2 years ago|reply
“Stuff happened. It was bad but now it’s good”
Yep? Ok great, solid PM guys - did you have ChatGPT write this for you on one of its “lazy days”?
[+] [-] john2x|2 years ago|reply
[+] [-] antihipocrat|2 years ago|reply
I wonder how long until there is a major incident that seriously impacts customers experience and/or business systems.
[+] [-] oliverpk|2 years ago|reply
[+] [-] okdood64|2 years ago|reply
[+] [-] 1oooqooq|2 years ago|reply
[+] [-] mcguire|2 years ago|reply
Several of the examples I saw involved ChatGPT going into what looked like repetitive (but not completely so) output loops. I'm not sure that "explanation" matches the things.
[+] [-] treyd|2 years ago|reply
[+] [-] Fatalist_ma|2 years ago|reply
[+] [-] noduerme|2 years ago|reply
It seems reasonable to assume that GPT inference is done entirely on Nvidia GPUs. I wonder if this is a subtle clue that they're experimenting with getting it to run on competing hardware.
[+] [-] ShamelessC|2 years ago|reply
[+] [-] vinay_ys|2 years ago|reply
At minimum, I want a why-5 analysis. Let me start with the first question:
1. Why did ChatGPT generate gibberish?
A: the model chose slightly wrong numbers.
2. Why did the model choose slightly wrong numbers?
A: ??
[+] [-] redskyluan|2 years ago|reply
The Importance of Model Agnosticism: With the rapid evolution of AI models, building applications that are model-agnostic has become more critical than ever.
Control and Interpretability Matter: Relying solely on large language models (LLMs) poses significant challenges for creating applications that can be deployed in real-world scenarios.
The Need for Open Models: Lastly, the push for more open models has never been more apparent. Open-source models are essential for fostering innovation, ensuring accessibility, and maintaining the integrity of our work in the AI field.
[+] [-] carlosbaraza|2 years ago|reply