top | item 43772311

How a 20 year old bug in GTA San Andreas surfaced in Windows 11 24H2

1371 points| yett | 10 months ago |cookieplmonster.github.io

298 comments

order

bombcar|10 months ago

This is the kind of thing I'd expect from Raymond Chen - which is extremely high praise!

I'm glad they tracked it down even further to figure out exactly why.

aneutron|10 months ago

Or randomascii. A freaking legend (although he had a heart braking streak of bad events ... I wish him the best)

martinsnow|10 months ago

Raymond is a wizard. Read his blogs for many years and love his style and knowledge.

amenghra|10 months ago

IMHO, if something isn’t part of the contract, it should be randomized. Eg if iteration order of maps isn’t guaranteed in your language, then your language should go out of its way to randomize it. Otherwise, you end up with brittle code: code that works fine until it doesn’t.

bri3d|10 months ago

There are various compiler options like -ftrivial-auto-var-init to initialize uninitialized variables to specific (or random) values in some situations, but overall, randomizing (or zeroing) the full content of the stack in each function call would be a horrendous performance regression and isn't done for this reason.

frollogaston|10 months ago

Randomization at this level would be too expensive. There are tools that do this for debug purposes, and your stuff runs a lot slower in that mode.

abnercoimbre|10 months ago

Regarding contracts, there's an additional lesson here, quoting from the source:

> This is an interesting lesson in compatibility: even changes to the stack layout of the internal implementations can have compatibility implications if an application is bugged and unintentionally relies on a specific behavior.

I suppose this is why Linux kernel maintainers insist on never breaking user space.

tantalor|10 months ago

Nope. You have to remember https://www.hyrumslaw.com/

  With a sufficient number of users of an API,
  it does not matter what you promise in the contract:
  all observable behaviors of your system
  will be depended on by somebody.
If you promise randomization, then somebody will depend on that :)

And then you can never remove it!

ormax3|10 months ago

one might argue that one of the advantages of languages like C is that you only pay for the features you choose to use, no unnecessary overhead like initializing unused variables

willcipriano|10 months ago

Then you are wasting runtime clock cycles randomizing lists.

gzalo|10 months ago

I agree, this can also detect brittle tests (e.g, test methods/classes that only pass if executed in a particular order). But applying it for all data could be expensive computation-wise

mras0|10 months ago

Not really the ethos of C(++), though of course this particular bug would be easily caught by running a debug build (even 20 years ago). However, this being a game "true" debug builds were probably too slow to be usable. That was at least my experience doing gamedev in that timeframe. Then again code holding up for 20 years in that line of biz is more than sufficient anyway :)

plutaniano|10 months ago

Aren't you just creating another contract? Users might write code that depends on it being random.

codebje|10 months ago

I once updated a little shy of 1mloc of Perl 5.8 code to run on Perl 5.32 (ish). There were, overall, remarkably few issues that cropped up. One of these issues (that showed itself a few times) was more or less exactly this: the iteration order through a hash is not defined. It has never been defined, but in Perl 5.8 it was consistent: for the same insertion order of the same set of keys, a hash would always iterate in the same way. In a later Perl it was deliberately randomised, not just once, but on every iteration through the hash.

It turned out there a few places that had assumed a predictable - not just stable, but deterministic - hash key iteration order. Mostly this showed up as tests that failed 50% of the time, which suggested to me a rough measure of how annoying an error is to track down is inversely correlated with how often the error appears in tests.

(Other issues were mostly due to the fact that Perl 5 is all but abandoned by its former community: a few CPAN modules are just gone, some are so far out of date that they can't be coerced to still work with other modules that have been updated over time. )

roseway4|10 months ago

iirc, Go intentionally randomizes map ordering for just this reason.

jandrese|10 months ago

> Not ignore the compilation warnings – this code most likely threw a warning in the original code that was either ignored or disabled!

What compiler error would you expect here? Maybe not checking the return value from scanf to make sure it matches the number of parameters? Otherwise this seems like a data file error that the compiler would have no clue about.

kristianp|10 months ago

Trying g++ version 11.4, there's no warning by default if you don't check the return value of sscanf. Even `g++ -Wall -Wextra -Wunused-result` produces no warnings for a small example.

burch45|10 months ago

Undefined behavior to access the uninitialized memory. A sanitizer would have flagged that.

phire|10 months ago

Good point. When reading, I kind of just assumed the "use of initialised memory" warning would pick this up.

But because the whole line is parsed in a single sscanf call, the compiler's static analysis is forced to assume they have now initialised. There doesn't seem to be any generic static analysis approach that can catch this bug.

Though... you could make a specialised warning just for scanf that forced you to either pass in pre-initilized values or check the return result.

maz1b|10 months ago

I always enjoy reading deeply technical writeups like these. I only wonder how much more rare they may or may not get in the AI era.

Cthulhu_|10 months ago

I don't think they will get more rare; there will always be a top % of engineers that do deep dives. I hope anyway.

But AI won't replace them, nor did the past 50+ years of software development innovation. There's millions (tens of millions?) of higher programming language developers that don't know the difference between stack or heap besides maybe some theory they half remember from school but they don't care because they don't have to think about it for their day job.

senda|10 months ago

i think the shift will be from craftmens to trademens in regards to general software engineers, but these are type of writes up stem of a artisan style all to its own.

adzm|10 months ago

I'm more curious in what changed with the critical section locking/unlocking implementation in this version of Windows!

mjevans|10 months ago

It looks like the utilized stack, or a stack protection area, increased.

rossant|10 months ago

Am I the only one to be annoyed by this...?

while (this->m_fBladeAngle > 6.2831855) { this->m_fBladeAngle = this->m_fBladeAngle - 6.2831855; }

Like, "let's just write a while loop that could turn into an infinite loop coz I'm too lazy to do a division"

nemothekid|10 months ago

I want to assume that the GTA developers did this hack because it was faster than floating point division on the Playstation 2 or something.

But knowing they were able to they were able to blow up loading GTA5 by 5 minutes by just parsing json with sscanf, I don't have much hope.

GeoAtreides|10 months ago

I'm willing to bet it was was done for performance reasons, subtraction is cheaper than float point division. Probably the compiler also has some tricks to optimize this further.

There is absolutely no way this could turn into an infinite loop. It could underflow, but for that to happen angle would have to be less than the 2*pi, therefore exiting the loop.

anal_reactor|10 months ago

Long shot, but maybe if the value is small, then this loop could be faster than division.

hoten|10 months ago

for real. The author clearly never heard of fmod

rs186|10 months ago

Knowing C/C++, I more or less guessed what's happening (uninitialized variable) early in the blog post.

It blows my mind that the languages allow you to leave variables uninitialized which has caused countless bugs (including production bugs that I have seen first hand), and you often need to rely on additional compiler flags or static analysis tools/valgrind etc to catch them. Even though newer languages often use a different solution (default zero value or must initialize a variable before use), people still go back to C/C++ all the time.

dusted|10 months ago

> all these findings prove that the bug is NOT an issue with Windows 11 24H2, as things like the way the stack is used by internal WinAPI functions are not contractual and they may change at any time, with no prior notice.

This reminds me of an excellent article I read a while back, the gist of it was that, given sufficient success, there's no such thing as a private API.

someperson|10 months ago

Could you please find this article and link it here. I'm curious about the arguments.

kidfiji|10 months ago

I know there’s an XKCD comic about this

carlos-menezes|10 months ago

Much love to Silent, who’s been improving my favorite game for over... a decade now?

pmarreck|10 months ago

My takeaway, speaking as someone who leans towards functional programming and immutability, is "this is yet another example of a mutability problem that could never happen in a functional context"

(so, for example, this bug would have never been created by Rust unless it was deeply misused)

grishka|10 months ago

This is more of a problem of the C/C++ standard that it allows uninitialized variables but doesn't give them defined values, considering it "undefined behavior" to read from an uninitialized variable. Java, for example, doesn't have this particular problem because it does specify default values for variables.

smcl|10 months ago

I think the response to that would be: yes but the game would simply not have been made if it wasn't written in C++. That's not to say you couldn't or that you can't make something like GTA:SA in Rust in 2025 or in a safer different language in the early 2000s. It just would take a great deal more time and expense as you'd have needed to construct a lot of tooling and do a lot of training to ensure all of the employees were up to speed before getting started. C++ was, and I think to some extent still is, the lingua franca of the gaming industry - there are some fun exceptions (Naughty Dog implementing much of Crash Bandicoot in a home-grown LISP, and presumably dozens or hundreds of DSLs and other little bespoke scripting languages in use at other studios).

And that's not to mention the uncomfortable truth that while doing this correctly in something like Rust may very well take less effort overall than in C++, that is not the bar we are aiming to clear. They wanted to implement something that was correct-enough, and given that this bug wasn't hit for 20+ years and that the game was a roaring success on all the major platforms - I think that was the right decision.

isatty|10 months ago

The constant rust evangelism on this site is such a turn off from actually wanting to use the language.

smj-edison|10 months ago

I'd actually say that Rust is a third option between "everything is immutable" and "mutable soup". Rust is more of "one mutator at a time". Because, Rust really embraces being able to mutate stuff (so not functional in that sense), it just makes sure that it's in a controlled way.

bentcorner|10 months ago

FWIW I think a linter or other similar code quality checker would have caught this as well. From a practical perspective (e.g., how do you prevent this from happening again in your game studio's multi-million line code base) that would have been the right thing to do here.

gavinray|10 months ago

Rust protects you from external file data you read being incorrect?

That's one hell of a language!

jdndndb|10 months ago

Could you elaborate? I cannot see how a functional programming language would have protected you from reading a non existing value while not providing a default

nayuki|10 months ago

> all these findings prove that the bug is NOT an issue with Windows 11 24H2, as things like the way the stack is used by internal WinAPI functions are not contractual and they may change at any time, with no prior notice. The real issue here is the game relying on undefined behavior (uninitialized local variables), and to be honest, I’m shocked that the game didn’t hit this bug on so many OS versions, although as I pointed out earlier, it was extremely close

This sentence is the real takeaway point of the article. Undefined behavior is extremely insidious and can lull you into the belief that you were right, when you already made a mistake 1000 steps ago but it only got triggered now.

I emphasized this point in my article from years ago (but after the game was released):

> When a C or C++ program triggers undefined behavior, anything is allowed to happen in the program execution. And by anything, I really mean anything: The program can crash with an error message, it can silently corrupt data, it can morph into a colorful video game, or it can even give the right result.

> If you’re lucky, the program triggering UB will show an appropriate error message and/or crash, making you immediately aware that something went wrong. If you’re unlucky, the program will quietly mangle data, and by the time you notice the problem (via effects such as crashes or incorrect output) the root cause has been buried in the past execution history. And if you’re very unlucky, the program will do exactly what you hoped it should do, until you change some unrelated code / compiler versions / compiler vendors / operating systems / hardware platforms – and then a new bug becomes visible, and you have no clue why seemingly correct code now fails to work properly.

-- https://www.nayuki.io/page/undefined-behavior-in-c-and-cplus...

As I wrote in my article, this point really got hammered into me when a coworker showed me a patch that he made - which added a couple of innocuous, totally correct print statements to an existing C++ program - and that triggered a crash. But without his print statements, there was no crash. It turned out that there was a preexisting out-of-bounds array write, and the layout of the stack/heap somehow masked that problem before, and his unlucky prints unmasked the problem.

Okay so then, how can we do better as developers today?

0) Read, understand, and memorize what actions in C or C++ are undefined behavior. Avoid them in your code at all costs. Also obey the preconditions of any API you use, whether in the standard library, operating system, etc.

1) Compile your application in Debug mode and compare its behavior to Release mode. If they differ by anything other than speed, then you have a serious problem on your hands.

2) Compile and run with sanitizers like -fsanitize=undefined,address to catch undefined behavior at runtime.

3) Use managed languages like Java, C#, Python, etc. where you basically don't have to worry about UB in normal day-to-day code. Or use very well-designed low-level languages like Rust that are safe by default and minimize your exposure to UB when you really need to do advanced things. Whereas C and C++ have been a bonanza of UB like we have never seen before in any other language.

spookie|10 months ago

Other than C#, there is no reason to use those other languages for game dev. Unless the game is fairly simple, or you want to risk a fairly long project by employing a language that hasn't been proven in tge space yet (Rust). No shade at any of those languages, I don't even like C#, just being pragmatic.

wat10000|10 months ago

I would add: code defensively. Initialize your variables (either to a sensible value, or an outrageously wrong value) before passing pointers to them, even when you "know" that the value will be overwritten. Check for errors. Always consider what happens when things go wrong, not just when things go right. Any time you find yourself thinking, "condition X is guaranteed to hold, so I don't need to check for it" consider checking it anyway just in case you're wrong about that, or it changes later.

semi-extrinsic|10 months ago

I learned this lesson many moons ago, on a Fortran code I wrote for a university assignment. It was a basic genetic algorithm, and for some reason it was converging much more slowly than expected. So I was sprinkling some WRITEs to debug, and suddenly the code converged a hundred times faster.

csours|10 months ago

Once this category of error is raised to your attention, you start to notice it more and more.

A little piece of technology made sense in the original context, but then it got moved to a different context without realizing that move broke the contract. Specifically in this case a flying boat became an airplane.

---

I recently worked a bug that feels very similar:

A linux cups printer would not print to the selected tray, instead it always requested manual feed.

Ok. Try a bunch of command line options, same issue.

Ok. Make the selection directly in the PPD (postscript printer definition) file. Same issue.

Ok! Decompile the PXL file. Wrong tray is set in pxl file... why?

Check Debug2 log level for cups - Wrong MediaPosition is being sent to ghostscript (which compiles the printer options into the print job) by a cups filter... why?

Cups filter is translating the MediaPosition from the PPD file... because the philosophy of cups is to do what the user intended. The intention inferred from MediaPosition in the PPD file (postscript printer definition) is that the MediaPosition corresponds to the PWG (Printer Working Group) MediaPosition, NOT the vendor MediaPosition (or local equivalent - in this case MediaSource).

AHA!! My PPD file had been copied from a previous generation of server, from a time when that cups filter did NOT translate the MediaPosition, so the VENDOR MediaSource numbers were used. Historically, this makes sense. The vendor tray number is set in the vendor ppd file because cups didn't know how to translate that.

Fast forward to a new execution context, and cups filters have gotten better at translating user intention, now it's translating a number that doesn't need to be translated, and silently selecting the wrong tray.

TLDR; There is no such thing as a printer command, only printer suggestions.

twic|10 months ago

Infamously, this is also why Ariane 501 blew up.

(a component being reused in a new context where a contract is broken, not bad CUPS drivers)

gigatexal|10 months ago

Use a debugger folks. A 10x dev cited this story to me about the ills of not using one.

epolanski|10 months ago

I always wonder, why not write these games on top of a virtual machine like Carmack started doing in Quake, a usage he then later extended to quake 2 and 3 [1].

I'm ignorant about game development, virtual machines and system programming but from the little I understand it seems a sensible choice to make.

While there is an initial price to pay modeling 99% of the game to be implemented on a user-implemented stack seems a sensible approach to me.

[1] https://fabiensanglard.net/quake3/qvm.php

glandium|10 months ago

The article mentions using breakpoints, so they did use a debugger.

assassinator42|10 months ago

This is a game; I don't think a debug configuration (with checks for things like this enabled) would run fast enough to be playable on contemporary hardware.

ajross|10 months ago

Tools like valgrind/asan/msan would have flagged this instantly too. Just a unit test of that vehicle loader would have seen it.

Really this is more a story about poor development practice than it is an interesting bug.

claiir|10 months ago

Okay, but why did `LeaveCriticalSection` change? Compiler changes, new features, refactoring, etc? That’s the most interesting part (and absent)!

robohoe|10 months ago

Maybe it’s the real reason why CJ couldn’t follow the dang train.

kristofferR|10 months ago

I hope someone can figure out the Red Dead Redemption 2 bug where random animals and characters disappear silently if you have too many texture mods installed.

I spent hours looking for a badger.

josephcsible|10 months ago

tl;dr of the explanation: the Skimmer vehicle is missing a wheel scale definition, so its wheel scale gets read from uninitialized memory. On previous versions of Windows, this happened to be the wheel scale of the previously-loaded vehicle, so things happened to work fine. Starting on Windows 11 24H2, LeaveCriticalSection (which gets called between loading vehicles) uses more stack space than before, so it now overwrites that memory with a gigantic value, resulting in the Skimmer spawning so high up that it may as well not exist at all.

userbinator|10 months ago

On Windows 11 24H2, more stack space was modified by a new implementation of Critical Sections.

IMHO this shows the downfall of Microsoft. Why did they do that? Critical sections have been there for many decades and should be basically bug-free by now. My best guess is someone thought they'd "improve" things and rewrote it, then made some microbenchmark that maybe showed the dubious improvement.

The other comment here mentions Raymond Chen, who wrote this article about why backwards-compatibility is very important (and arguably what got Microsoft into the position it's in today):

https://devblogs.microsoft.com/oldnewthing/20031224-00/?p=41...

and also this memorable case: https://news.ycombinator.com/item?id=2281932

voidspark|10 months ago

This is an existing bug in GTA, not Windows 11.

kittoes|10 months ago

Really? Someone depending on UB in their software represents the downfall of Microsoft?! What a hot take...

smcameron|10 months ago

Surprised to see the return value of sscanf being ignored, that seems like a pretty rookie mistake, and this bug would never have made it out of the original programmer's system if they had bothered to check it.

canucker2016|10 months ago

Yes, it would have made it out of the original programmer's system for that initial commit.

FTA:

    I have a likely explanation for why Rockstar made this specific mistake in the data to begin with – in Vice City, Skimmer was defined as a boat, and therefore did not have those values defined by design! When in San Andreas they changed Skimmer’s vehicle type to a plane, someone forgot to add those now-required extra parameters. Since this game seldom verifies the completeness of its data, this mistake simply slipped under the radar.
So the original code (or at least a working code + data version) in GTA Vice City had no visible problems, at least with the Skimmer object, since the vehicles.ide file had the correct number of values for the Skimmer boat object.

Someone changed the Skimmer object from a boat to a plane for GTA San Andreas, BUT they DID NOT update the object to have the REQUIRED wheel values for a plane object.

Now the GTA code is expecting more values than it gets.

The vehicles.ide wasn't validated for correctness after the Skimmer object change to plane. Maybe there are more gotchas in that file...

At least users can fix the problem with a text editor instead of waiting and hoping that RockStar would fix the problem and release an update.

cadamsdotcom|10 months ago

It has always been too easy to read & write beyond the stack. This should fail, plain and simple.

Mitigations exist - ASLR, NX pages, stack-smashing protection etc. but nothing comprehensively stops reads of stale data beyond the stack.

Thought experiment for a moment. What if the hardware ensures the unused part of a stack region cannot be read or written.

There are many ways to skin this cat, here’s one based around tracking each stack’s start address A, size S, and current depth D

1. Add an instruction to inform the CPU there is a stack at address A of size S. Its depth D is initially 0.

2. Add a jump instruction which reserves N bytes on the stack at address A, growing depth D to (D+N). Maybe this can be its own “reserve” instruction so as not to need a new jump instruction.

3. Give existing return instructions stack awareness. If returning to an address inside a stack, un-reserve the bytes reserved by the most recent jump, making the new depth (D-N).

4. Fail reads or writes to the stack region beyond its current depth. In other words fail all reads and writes between A+S-D and A+S.

5. The arithmetic is reversed on architectures whose stacks grow downwards.

Downsides I can see:

It cements one calling convention. The CPU memory manager will need a lot of state per stack, of which there are many per process: address A, size S, current depth D, plus a reservation stack - ie. sizes of each frame’s stack memory. That’s a lot of bookkeeping! It’s far from zero cost. The limits of how much bookkeeping the CPU can do impose limits on how deep a stack can go and how many stacks are supported - so when there are too many stacks or one goes too deep, either the CPU needs to signal failure or engage a fallback mode and revert to behaving as CPUs do today. And of course fallback puts things back to the start. It’d therefore only mitigate situations in which an attacker cannot control the depth of the stack / a bug always happens inside the max depth the CPU can bookkeep for.

That said, stacks are ubiquitous! Hardware stack awareness opens up all kinds of new mitigations.

Why isn’t this a common idea? Has it been tried?

LegionMammal978|10 months ago

This bug wasn't caused by a read beyond the current bounds of the stack, but a stale value from a prior call to the same function at the exact same location on the stack. Buffer-overflow protections like you describe wouldn't help here.

mjevans|10 months ago

Any solution I can think of uses a lot of resources. Those sort of methods are useful in some contexts, such as highly secure operations, but seem very excessive for the sort of abuse and leak encountered in this example.

anal_reactor|10 months ago

I love it how many bugs go from "why doesn't this work?!" to "how on Earth did this work previously?!"

rkunde|10 months ago

I wonder if they fixed the vehicle definition file as well, or just the parser. The latter would be an incomplete fix.

ddm999|10 months ago

SilentPatch (for GTAs, at least) specifically is a code-only mod, such that the single .asi file can be removed to uninstall it & all it's changes.

A real update should fix both (note: I don't believe the later releases did, they also just added defaults to the parser) but for SilentPatch: a mod is not a real update, and being as simple as possible to remove & reducing conflicts with other mods is more important here than a fix that digs as deep as possible.

tonmoy|10 months ago

Given that those parameters are for wheels on a plane that doesn’t have wheels, I would say fixing the parser is the better fix

smallstepforman|10 months ago

The core problem is some compilers initialising memory to zero in Debug mode, masking behaviour of unitialised data, since in most cases zero is a legit value. In Release mode, this zeroing doesn’t happen.

Devs need to be aware that the following C++ initisliser exists which zeros data structures for you:

MyStruct s = { };

xxpor|10 months ago

For some reason this domain is blocked by work dns filtering?

kstrauser|10 months ago

Let your IT department know that their denylist is broken.

gitroom|10 months ago

pretty wild how bugs can stick around that long - id never think something from 20 years ago would pop up just cause windows changed

vortico|10 months ago

>Scientists claim to have discovered a ‘new color’ no one has seen before.

LOL!

olvy0|10 months ago

Just like $dayjob.

DuckOnFire|10 months ago

How does a bug from 20 years ago even still work today?

db48x|10 months ago

[flagged]

trinix912|10 months ago

Putting the (very valid) reasons for not having human-readable game saves aside, are you sure it's worse than using a 3rd party library that's built to accept semi-valid input values, possibly evaluates user input in some way and has difficult to debug bugs that occur only under certain inputs? I agree that writing a stable and safe parser for a binary data file isn't easy, but there's less things that can go wrong when you can hardcode it to reject any remotely suspicious input. Third party XML/JSON libraries OTOH try to interpret as much as possible, even when the values are bogus. Also no need to deal with different text encoding bugs, line endings...

jhatemyjob|10 months ago

After finishing the article I immediately did ctrl+f "rust" and was disappointed to not see any of the results I wanted, but actually this comment is more hilarious than anyone saying "why didnt rockstar use rust in 2004!!!1111!!???" it's a bit more of a sophisticated joke since there's an IYKYK factor but it is no less hilarious. Bravo sir, bravo.

dogleash|10 months ago

Telling a 3d engine programmer not to have opinions on data formats? Good luck with that.

mschuster91|10 months ago

To u/db48x whose post got flagged and doesn't reappear despite me vouching for it as I think they have a point (at least for modern games): GTA San Andreas was released in 2004. Back then, YAML was in its infancy (2001) and JSON was only standardized informally in 2006, and XML wasn't something widely used outside of the Java world.

On top of that, the hardware requirements (256MB of system RAM, and the PlayStation 2 only had 32MB) made it enough of a challenge to get the game running at all. Throwing in a heavyweight parsing library for either of these three languages was out of the question.