top | item 42439275

(no title)

To be completely honest, it's surprising to me as well. I would expect it to be bad, but not as bad as it was. I entirely expected that the slow part would be decoding, not copying. In fact, my initial plan was to convert the remaining images that couldn't be DDS to Targa, on the assumption it would decode faster. However, when I investigated the slow functions and found they were only copying, I changed tactic because then in theory that would not make a difference.

There is no fixed amount of per-frame work. After the 550ms hardcoded timer is up, it is blocking during the loading of those images, and during this phase all animations on screen are completely still. I thought to check for this, because it did occur to me that if it tried to render a frame inbetween loading each image to keep the app responsive, that would push it to be significantly longer, and that would be a pretty normal thing to want to do! But I found no evidence of this happening. Furthermore, I never changed anything but the actual image loading related code - if it tried to push out a frame after every image load or every x number of image loads, those x number of frames wouldn't go away only by making the images load faster, so it'd have never gotten as instant as it did without even more change.

The only explanation I can really fathom is the one I provided. The L_GetBitmapRow function has a bunch of branches at the start of it, it's a DLL export so the actual loop happens in a different DLL, and that happens row by row for 500+ images per node... I can only guess it must be because of a lack of CPU caching, it's the only thing that makes sense given the data I got. Probably doesn't help that the images are loaded in single threaded fashion, either.

That said, there have been plenty of criticisms of my profiling methodology here in these comments, so it would be nice to perhaps have someone more experienced in low level optimizations back me up. At the end of the day, I'm pretty sure I'm close enough to right, at least close enough to have created a satisfactory solution :)

discuss

mananaysiempre|1 year ago

I absolutely did not mean to imply that you did a bad job at any point, or to discourage you. The mere fact that you reached that far into the game’s internals, achieved the speedup you were aiming for, and left it completely functional is extremely impressive to me.

And that’s part of why I’m confused. If you’d screwed up the profiling in some obvious way, I’d have chalked it up to bad profiling and been perfectly unconfused. But your methods are good as far as I can see, and with the detail you’ve gone into I feel I see sufficiently far. Also, well, whatever you did, it evidently did help. So the question of what the hell is happening is all the more poignant.

(I agree with the other commenter that you may have dismissed WaitForSingleObject too quickly—can your tools give you flame graphs?.. In general, though, if machine code produced by an optimizing compiler takes a minute on a modern machine—i.e. hundreds of billions of issued instructions—to process data not measured in gigabytes, then something has gone so wrong that even the most screwed-up of profiling methodologies shouldn’t miss the culprit that much. A minute of work is bound to be a very, very target-rich environment, enough so that I’d expect even ol’ GDB & Ctrl-C to be helpful. Thus my discounting the possibility that your profiling is wrong.)