(no title)
kenjin4096 | 11 months ago
First, I want to say thank you to Nelson for spending almost a month to get to the root of this issue.
Secondly, I want to say I'm extremely embarrassed and sorry that I made such a huge oversight. I, and probably the rest of the CPython team did not expect the compiler we were using for the baseline to have that bug.
I posted an apology blog post here. https://fidget-spinner.github.io/posts/apology-tail-call.htm...
jraph|11 months ago
But it's nothing like this. You announced a 10-15% perf improvement but that improvement is more like 1-5% on a non buggy compiler. It's not even like that 10-15% figure is wrong, it's just that it's correct only under very specific conditions, unknowingly to you.
IIUC, you did your homework: you made an improvement, you measured a 10-15% perf improvement, the PR was reviewed by other people, etc. It just so happens that this 10-15% figure is misleading because of an issue with the version of clang you happened to use to measure. Unless I'm missing something, it looks like a fair mistake anyone could have reasonably made. It even looks like it was hard to not fall into this trap. You could have been more suspicious seeing such a high number, but hindsight is 20/20.
Apparently, you still brought significant performance improvements, your work also helped uncover a compiler regression. The wrong number seems quite minor in comparison. I wonder who was actually hurt by this. I only discover the "case" right now but at a first glance it doesn't feel like you owe an apology to anyone. Kudos for all this!
ehsankia|11 months ago
rtollert|11 months ago
Hah! Is this a Gettier problem [0]?
1. True: The PR improves Python performance 15-20%. 2. True: Ken believes that the PR improves Python performance 15-20%. 3. True: Ken is justified in believing that the PR improves Python performance 15-20%.
Of course, PR discussions don't generally revolve around whether or not the PR author "knows" that the PR does what they claim it does. Still: these sorts of epistemological brain teasers seem to come up in the performance measurement field distressingly often. I wholeheartedly agree that Ken deserves all the kudos he has received; still, I also wonder if some of the strategies used to resolve the Gettier problem might be useful for code reviewers to center themselves every once in a while. Murphy's Law and all that.
[0]: https://en.wikipedia.org/wiki/Gettier_problem
DannyBee|11 months ago
Beyond that - 3-5% is a lot for something as old as the python interpreter if it holds up. I would still be highly proud of that.
After 30 years, i've learned (like i expect you have) to be suspicious of any significant (IE >1%) performance improvement in a system that has existed a long time.
They happen for sure, but are less common. Often, people are shifting time around, and so it just isn't part of your benchmark anymore[1]. Secondarily, benchmarking is often done in controlled environments, to try to isolate the effect. Which seems like the right thing to do. But then the software is run in non-isolated environments (IE with a million other things on a VM or desktop computer), which isn't what you benchmarked it in. I've watched plenty of verifiably huge isolated gains disappear or go negative when put in production environments.
You have the particularly hard job that you have to target lots of environments - you can't even do something like say "okay look if it doesn't actually speed it up in production, it didn't really speed it up", because you have no single production target. That's a really hard world to live in and try to improve.
In the end, performance tuning and measurement is really hard. You have nothing to be sorry for, except learning that :)
Don't let this cause you to be afraid to get it wrong - you will get it wrong anyway. We all do. Just do what you are doing here - say 'whoops, i think we screwed this up', try to figure out how to deal with it, and try to figure out how to avoid it in the future (if you can).
[1] This is common not just in performance, but in almost anything, including human processes. For example, to make something up, the team working on the code review tool would say "we've reduced code review time by 15% and thus sped up everyone's workflow!". Awesome! But actually, it turns out they made more work for some other part of the system, so the workflow didn't get any faster, they just moved the 15% into a part of the world they weren't measuring :)
theLiminator|11 months ago
Laughs in corporate code
haberman|11 months ago
> Theoretically, this control flow graph paired with a profile should give the compiler all of the information it needs to generate the most optimal code [for a traditional switch()-based interpreter]. In practice, when a function is this big and connected, we often find ourselves fighting the compiler. It spills an important variable when we want it to keep it in a register. It hoists stack frame manipulation that we want to shrink wrap around a fallback function invocation. It merges identical code paths that we wanted to keep separate for branch prediction reasons. The experience can end up feeling like trying to play the piano while wearing mittens.
That second-to-last sentence is exactly what has happened here. The "buggy" compiler merged identical code paths, leading to worse performance.
The "fixed" compiler no longer does this, but the fix is basically just tweaking a heuristic inside the compiler. There's no actual guarantee that this compiler (or another compiler) will continue to have the heuristic tweaked in the way that benefits us the most.
The tail call interpreter, on the other hand, lets us express the desired machine code pattern in the interpreter itself. Between "musttail", "noinline", and "preserve_none" attributes, we can basically constrain the problem such that we are much less at the mercy of optimizer heuristics.
For this reason, I think the benefit of the tail call interpreter is more than just a 3-5% performance improvement. It's a reliable performance improvement that may be even greater than 3-5% on some compilers.
kenjin4096|11 months ago
sunshowers|11 months ago
jxjnskkzxxhx|11 months ago
kzrdude|11 months ago
Why do I even bring this message - I want to say that let's not let what we see in the news influence our perception of the real people of the world. Just because fraud and crimes get elevated in the news, does not mean that the common man is a criminal or a fraud. :)
ptx|11 months ago
[0] https://github.com/faster-cpython/benchmarking-public
cb321|11 months ago
As alluded to in https://news.ycombinator.com/item?id=43319010, I see these tests were collected against just 2 Intel and 2 ARM CPUs. So, if you are looking for feedback to improve, you should probably also include (at least) a AMD Zen4 or Zen5 in there. CPU & compiler people have been both trying to "help perf while not trusting the other camp" for as long as I can remember and I doubt that problem will ever go away.
A couple more CPUs will help but not solve generalizability of results. E.g., if somebody tests against some ancient 2008 Nehalem hardware, they might get very different answers. Similarly, answers today might not reflect 2035 very well.
The reality of our world of complex hardware deployment (getting worse with GPUs) is that "portable performance" is almost a contradiction in terms. We all just do the best we can at some semblance of the combination. The result is some kind of weighted average of "not #ifdef'd too heavily source" and "a bit" faster "informally averaged over our user population & their informally averaged workloads" and this applies at many levels of the whole computing stack.
EDIT: And, of course, a compiled impl like Cython or Nim is another way to go if you care about performance, but I do understand the pull & network effects of the Python ecosystem. So, that may not always be practical.
kenjin4096|11 months ago
Clang 19 was released last year. We only benched it a few months ago. We did notice there was a significant slowdown on macOS, but that was against Xcode Clang, which is a different compiler. I thought it might've been an Xcode thing, which in the past has bit CPython before (such as Xcode LTO working/not working versus normal Clang) so I didn't investigate deeper (facepalming now in retrospect) and chalked it up to a compiler difference.
TLDR: We didn't run benchmarks of clang 19 versus 18. We only ran benchmarks of clang 19 versus gcc, Xcode clang, and MSVC. All of which are not apples-to-apples to Clang 19, so I naiively thought it was a compiler difference.
EDIT: As to how we could improve this process, I'm not too sure, but I know I'll be more discerning when there's a >4% perf hit now when upgrading compilers.
delijati|11 months ago
[1] https://github.com/python/cpython/pull/128718
kenjin4096|11 months ago
The blog post mentions it brings a 1-5% perf improvement. Which is still significant for CPython. It does not complicate the source because we use a DSL to generate CPython's interpreters. So the only complexity is in autogenerated code, which is usually meant for machine consumption anyways.
The other benefit (for us maintainers I guess), is that it compiles way faster and is more debuggable (perf and other tools work better) when each bytecode is a smaller function. So I'm inclined to keep it for perf and productivity reasons.
robertlagrant|11 months ago
pseufaux|11 months ago
nickm12|11 months ago
Some of the sibling comments are saying there's "no need to apologize". One way this might be read is "no need to hold yourself to a higher standard". If this is the sentiment, then I disagree—we live in an age where accountability and intellectual integrity are undervalued. Being able to say that you and your team should have investigated a too-good-to-be-true result further before reporting it is setting a higher standard for yourselves.
But another way to read this is "no need to feel bad" and on this I agree. We all often fail to sufficiently challenge our own work, especially when it looks like something we worked hard on had a big impact.
throwaway2037|11 months ago
In what age were accountability and intellectual integrity "correctly" valued?
ok_dad|11 months ago
You have a new account here and your blog is just one article so far so you might be a new-ish developer(?), but you are doing great, keep it up! If you are not a new developer, you are still doing great, keep it up!
rbetts|11 months ago
acdha|11 months ago
SmellTheGlove|11 months ago
Smart people go and build things. Other smart people find problems. Nothings broken with that.
sundbry|11 months ago
tiagod|11 months ago
immibis|11 months ago
It's possible to improve your luck by applying more care, but it's also possible to apply so much care that you do not try things that would have been useful, so I'd rather you keep erring on the side that you did!
9cb14c1ec0|11 months ago
Vaslo|11 months ago
thdhhghgbhy|11 months ago
ggu|11 months ago
Ken Jin is a volunteer who's been tirelessly working to make CPython faster over the past couple of years for not enough credit. IMHO, Ken did nothing wrong here, and there's really nothing to be embarrassed about. The fact that it took a month (more if you consider the folks helping to reproduce the original results) for anyone to notice anything wrong shows how complicated the situation was! Put yourself in Ken's shoes -- having updated to LLVM-19 to enable the preserve_none flag, would you then hypothesize that LLVM-19 may have introduced an unrelated performance regression that no one noticed for five months? Lots of people underestimate how hard benchmarking is IMO.
A 1-5% performance improvement is also pretty valuable, just not quite as spectacular as we thought originally :-)