(no title)
tomysshadow | 1 year ago
There is no fixed amount of per-frame work. After the 550ms hardcoded timer is up, it is blocking during the loading of those images, and during this phase all animations on screen are completely still. I thought to check for this, because it did occur to me that if it tried to render a frame inbetween loading each image to keep the app responsive, that would push it to be significantly longer, and that would be a pretty normal thing to want to do! But I found no evidence of this happening. Furthermore, I never changed anything but the actual image loading related code - if it tried to push out a frame after every image load or every x number of image loads, those x number of frames wouldn't go away only by making the images load faster, so it'd have never gotten as instant as it did without even more change.
The only explanation I can really fathom is the one I provided. The L_GetBitmapRow function has a bunch of branches at the start of it, it's a DLL export so the actual loop happens in a different DLL, and that happens row by row for 500+ images per node... I can only guess it must be because of a lack of CPU caching, it's the only thing that makes sense given the data I got. Probably doesn't help that the images are loaded in single threaded fashion, either.
That said, there have been plenty of criticisms of my profiling methodology here in these comments, so it would be nice to perhaps have someone more experienced in low level optimizations back me up. At the end of the day, I'm pretty sure I'm close enough to right, at least close enough to have created a satisfactory solution :)
mananaysiempre|1 year ago
And that’s part of why I’m confused. If you’d screwed up the profiling in some obvious way, I’d have chalked it up to bad profiling and been perfectly unconfused. But your methods are good as far as I can see, and with the detail you’ve gone into I feel I see sufficiently far. Also, well, whatever you did, it evidently did help. So the question of what the hell is happening is all the more poignant.
(I agree with the other commenter that you may have dismissed WaitForSingleObject too quickly—can your tools give you flame graphs?.. In general, though, if machine code produced by an optimizing compiler takes a minute on a modern machine—i.e. hundreds of billions of issued instructions—to process data not measured in gigabytes, then something has gone so wrong that even the most screwed-up of profiling methodologies shouldn’t miss the culprit that much. A minute of work is bound to be a very, very target-rich environment, enough so that I’d expect even ol’ GDB & Ctrl-C to be helpful. Thus my discounting the possibility that your profiling is wrong.)