I consider myself rather smart and good at what I do. It's nice to have a look at problems like these once in a while, to remind myself of how little I know, and how much closer I am to the average than to the top.
Well it is a specialized problem. If you've never worked on anything similar previously, it is going to take time. Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems - after college I interviewed for various electronics/hardware companies where you'd get asked to optimize low-level code - which would have looked quite foreign, if you had never actually worked on such problems before.
It comes with test suites, so that gives you a base to start from. You can at the very least do trial-and-error and come up with some heuristics on the fly. You're at a huge disadvantage to someone who has some familiarity but can convincingly play it off as being a newcomer, though.
Yours is a good mentality to have because it creates the emotional drive to learn more, so don't lose that. That being said, this isn't really that complicated. Its just a matter of taking enough time to look at the code and understand how its structured. I feel like the thing that differentiates developer skill is pretty much being able to do that, specifically in the process of having the model of the program in your head.
disagree. nobody has a monopoly on what metric makes someone good. I don't understand all this leet code optimization. actually i do understand it, but it's a game that will attract game optimizers.
I suspect this was released by Anthropic as a DDOS attack on other AI companies. I prompted 'how do we solve this challenge?' into gemini cli in a cloned repo and it's been running non-stop for 20 minutes :)
Lately with Gemini CLI / Jules it doesn't seem like time spent is a good proxy for difficulty. It has a big problem with getting into loops of "I am preparing the response for the user. I am done. I will output the answer. I am confident. Etc etc".
I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".
That Claude Opus 4.5 result of 4,973 is what you get if you just vectorize the reference kernel. In fact you should be under 4,900 doing that with very little effort (I tried doing this by hand yesterday).
The performance killer is the "random" access reads of the tree node data which the scalar implementation hides, together with the lack of load bandwidth, and to tackle that you'd have to rewrite the kernel to optimize the tree data loading and processing.
Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.
Can you share the agent-comparison harness code or point to something similar? I want to learn about benchmarking models in a basic or practical sense.
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
This is an interesting way to recruit. Much better than standard 2 leetcode medium/hard questions in 45 mins.
It would take something like one week full time to work on this. It's not something you can do if you have a full-time job and apply to several other companies. I find it unreasonable to ask a candidate to spend that much time for an uncertain result.
It's true that being ready for leetcode takes practice, but at least it's standard so you can re-use the skills to other interviews. Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.
This is a really fun problem! I suggest anyone who likes optimization in a very broad sense to try their hand at it. Might be the most fun I've had while interviewing. I had to spend a week-worth of evenings on it to fully scratch the itch, and I managed to get 1112 cycles. But that was mostly manual, before the current crop of agentic models (clopus 4.5, gpt5.2). I wonder how far you can RalphWiggum it!
I was in the demoscene long ago and that kind of optimisation is definitely in the ballpark of what we did: optimize algorithm down to machine code level (and additionally, cheat like hell to make you believe we ran the algorithm for real :-)).
But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?
Having recently learned more about SIMD, PTX and optimization techniques, this is a nice little challenge to learn even more.
As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.
Unless misread, 2 hours isn't the time limit for the candidate to do this but the time Claude eventually needed to outperform best returned solution. Best candidate could've taken 6h~2d to achieve this result.
I'm at 1137 with one hour with opus now...
Pipelined vectorized hash, speculation, static code for each stage, epilogues and prologues for each stage-to-stage...
I think I'm going to get sub 900 since i just realized i can in-parallel compute whether stage 5 of the hash is odd just by looking at bits 16 and 0 of stage 4 with less delay.....
"Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
available time, as measured by test_kernel_cycles on a frozen separate copy
of the simulator." from perf_takehome.py
I just withdrew my application over this test. It forces an engineering anti-pattern: requiring runtime calculation for static data (effectively banning O(1) pre-computation).
When I pointed out this contradiction via email, they ignored me completely and instead silently patched the README to retroactively enforce the rule.
It’s not just a bad test; it’s a massive red flag for their engineering culture. They wasted candidates' time on a "guess the hidden artificial constraint" game rather than evaluating real optimization skills.
This isn't the gotcha moment you think it is. Storing the result on disk is some stupid "erm achkually" type solution that goes against the spirit of the optimization problem.
They want to see how you handle low level optimizations, not get tripped over some question semantics.
This is a kind of task that's best solved by possibly spending more than the allocated 2 hours on it, once any obvious low-hanging fruit is picked. An optimization task is what a machine does best. So the real problem would be to construct a machine that would be able to run the optimization. A right optimization framework that results from the effort could also efficiently solve many more similar problems in the future.
I understand that this test is intended to somehow test the raw brianpower, the ability to tackle an unfamiliar and complicated domain, and to work under stress. But I hope it's not representative of the actual working conditions at Anthropic. It's like asking a candidate to play a Quake deathmatch when hiring to a special forces assault squad.
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
That doesn’t seem snarky to me. They said if you beat Opus, not their best solution. Removing “perhaps” (i.e. MAYBE) would be worse since that assumes everyone wants to interview at Anthropic. I guess they could have been friendlier: “if you beat X, we’d love to chat!”
I feel that came out wrong but the "maybe" was intended to be a way of saying "no guarantees", to avoid giving people the idea "solve this, get hired".
The writing was on the wall for about half a year (publicly) now. The oAI 2nd place at the atcoder world championship competition was the first one, and I remember it being dismissed at the time. Sakana also got 1st place in another atcoder competition a few weeks ago. Google also released a blog a few months back on gemini 2.5 netting them 1% reduction in training time on real-world tasks by optimising kernels.
If the models get a good feedback loop + easy (cheap) verification, they get to bang their tokens against the wall until they find a better solution.
I think this is the actual “bitter lesson”—the scalable solution (letting LLMs bang against the problem nonstop) will eventually far outperform human effort. There will come a point—whether sooner or later—where this’ll be the expected norm for handling such problems. I think the only question is whether there is any distinction between problems like this (clearly defined with a verifiable outcome) vs the space of all interesting computer programs. (At the moment I think there’s space between them. TBD.)
I wonder if the Ai is doing anything novel? Or if it's like a brute force search of applying all types of existing optimizations that already exist and have been written about.
I liked the core challenge. Finding the balance of ALU and VALU, but I think that the problem with the load bandwidth could lead to problems
Like optimizing for people who assume the start indices always will be zero. I am close to 100% sure that's required to get below 2096 total loads but it's just not fun
If it however had some kind of dynamic vector lane rotate that could have been way more interesting
I got to 1364 cycles for now, semi-manually: Using design space exploration organized via backlog.md project, and then recombination from that. 20 agents in parallel.
Asked to generate drawio for the winner so I can grok it more easily, then I gave feedback.
I'm getting flashbacks from my computer engineering curriculum. Probably the first place I'd start is replacing comparison operators on the ALU with binary arithmetic since it's much faster than branch logic. Next would probably be changing the `step` function from brute iterators on the instructions to something closer to a Btree? Then maybe a sparse set for the memory management if we're going to do a lot of iterations over the flat memory like this.
> This repo contains a version of Anthropic's original performance take-home, before Claude Opus 4.5 started doing better than humans given only 2 hours.
Was the screening format here that this problem was sent out, and candidates had to reply with a solution within 2 hours?
Or, are they just saying that the latest frontier coding models do better in 2 hours than human candidates have done in the past in multiple days?
> Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours
Is this saying that Claude matched the best human performance, where the human had two hours? I think that is the correct reading, but I'm not certain they don't mean that Claude had two hours, and matched the best human performance where the human had an arbitrary amount of time. The former is impressive but the later would be even more so.
I cleared this assignment but did not clear the follow up interview that was way easier than this. So I gave up on tech interviews in general, stayed where I was.
“If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.”
The company that wanted to simply get away with the thievery of terabytes of intellectual property, what a great place to work at! Not. Anthropic has no shame.
Are you allowed to change the instruction sequence? I see some optimization opportunities - it'd be obviously the correct thing to do an optimizing compiler, but considering the time allotted, Id guess you could hand-optimize it, but that feels like cheating.
>so we can be appropriately impressed and perhaps discuss interviewing.
Something comes across really badly here for me. Some weird mix of bragging, mocking, with a hint of aloof.
I feel these top end companies like the smell of their own farts and would be an insufferable place to work. This does nothing but reinforce it for some reason.
I have to agree. It's off-putting to me too. I'm impressed by the performance of their models on this take-home but I'm not impressed at their (perhaps unintentional) derision of human programmers.
Thanks for noticing this. I got the same feeling when reading this. It may not sound like much, and it doesn't mean it's an insufferable place to work, but it's a hint it might be.
Rant: On a similar note, I recently saw a post on Linkedin from Mistral, where they were bragging to recruit candidates from very specific schools. That sounded very pretentious (and also an HR mistake on several levels IMHO).
if anyone is interested to try their agent-fu, here's some more-real-world rabbit-hole i went optimizing in 2024. Note this is now dead project, noone's using it, and probably same for the original. i managed to get it 2x-4x faster than original, took me several days then. btw There are some 10x optimizations possible but they break few edge cases, so not entirely correct.
I am able to beat this 1487 benchmark by switching between LLMs, doesn't seem that hard lol. Albeit, I do not fully understand what the solution is, loll
When this was being used it was probably given to candidates who had already started the interview loop and been screened.
The current e-mail invitation in the README is just another avenue for exceptional people to apply. If someone is already highly qualified from their background and resume they can go through the front door (direct application). For those who have incredible talent but not necessarily the background or resume to unlock the front door yet, this is a fun way to demonstrate it.
Did a bit of soul searching and manually optimised to 1087 but I give up. What is the number we are chasing here? IMO I would not join a company giving such a vague problem because you can feel really bad afterwards, especially if this does not open a door to the next stage of the interview. As an alternative we could all instead focus on a real kernel and improve it :)
Author of the take-home here: That's quite a good cycle count, substantially better than Claude's, you should email it to performance-recruiting@anthropic.com.
I generally have a policy of "over 4 hours and I charge for my time." I did this in the 4-hour window, and it was a lot of fun. Much better than many other take-home assignments.
I’ve been sent the Anthropic interview assignments a few times. I’m not a developer so I don’t bother. At least at the time they didn’t seem to have technical but not-dev screenings. Maybe they do now.
Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.
And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
You're both right and wrong. You're right in the sense that the sort of creativity the task is looking for isn't really possible in two hours. That's something that takes a lot of time and effort over years to be able to do. You're wrong because that's exactly the point. Being able to solve the problem takes experience. Literally. It's having tackled these sorts of problems over and over in the past until you can draw on that understanding and knowledge reasonably quickly. The test is meant to filter out people who can't do it.
I also think it's possible to interpret the README as saying humans can't do better than the optimizations that Claude does when Claude spends two hours of compute time, regardless of how long the human takes. It's not clear though. Maybe Claude didn't write the README.
Your comments history suggests you’re rather bitter about “nerds” who are likely a few standard deviations smarter than you (Anthropic OG team, Jeff Dean, proof nerds, Linus, …)
If they're hiring performance engineers then they're hiring for exactly these sets of skills.
It's a take-home test, which means some people will spend more than a couple of hours on it to get the answer really good. They would have gone after those people in particular.
This would be an inappropriate assignment for a web dev position, but I'm willing to bet that a 1% improvement in cycles per byte in inference (or whatever) saves Anthropic many millions of dollars. This is one case where the whiteboard assignment is clearly related to the actual job duties.
> Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.
Some comments were deferred for faster rendering.
lbreakjai|1 month ago
epolanski|1 month ago
It doesn't matter really, what matters is our ability to stare into the void of what we don't know and start making progress.
Our ability to process and master new topics is part of the job.
I'm sure you've done that countless times.
TrackerFF|1 month ago
fergie|1 month ago
mangatmodi|1 month ago
It's not about you being average, just a different knowledge set.
xenihn|1 month ago
chistev|1 month ago
elzbardico|1 month ago
But this is good. Staying humble makes you hungrier for learning.
ActorNightly|1 month ago
gervwyk|1 month ago
LouisSayers|1 month ago
Always room to learn in software :)
deadbabe|1 month ago
apsurd|1 month ago
the hot take is, there are other games.
pvalue005|1 month ago
bjackman|1 month ago
I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.
bird0861|1 month ago
languid-photic|1 month ago
Each ran the same spec headlessly in their native harness (one shot).
Results:
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".lawrencechen|1 month ago
HarHarVeryFunny|1 month ago
The performance killer is the "random" access reads of the tree node data which the scalar implementation hides, together with the lack of load bandwidth, and to tackle that you'd have to rewrite the kernel to optimize the tree data loading and processing.
ponyous|1 month ago
a24j|1 month ago
raphaelj|1 month ago
forgotpwd16|1 month ago
giancarlostoro|1 month ago
game_the0ry|1 month ago
This is an interesting way to recruit. Much better than standard 2 leetcode medium/hard questions in 45 mins.
paxys|1 month ago
yodsanklai|1 month ago
It's true that being ready for leetcode takes practice, but at least it's standard so you can re-use the skills to other interviews. Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.
abra0|1 month ago
lukah|1 month ago
clocksmith|1 month ago
avaer|1 month ago
[1] https://en.wikipedia.org/wiki/Demoscene [2] https://en.wikipedia.org/wiki/Code_golf
It even uses Chrome tracing tools for profiling, which is pretty cool: https://github.com/anthropics/original_performance_takehome/...
wiz21c|1 month ago
But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?
KeplerBoy|1 month ago
nice_byte|1 month ago
sureglymop|1 month ago
As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.
forgotpwd16|1 month ago
amirhirsch|1 month ago
I think I'm going to get sub 900 since i just realized i can in-parallel compute whether stage 5 of the hash is odd just by looking at bits 16 and 0 of stage 4 with less delay.....
WithinReason|1 month ago
lalaland1125|1 month ago
fabian4|1 month ago
tap12783487|1 month ago
[deleted]
bytesandbits|1 month ago
petters|1 month ago
koolba|1 month ago
The README only gives numbers without any information on what you’re supposed to do or how you are rated.
glalonde|1 month ago
vermilingua|1 month ago
unknown|1 month ago
[deleted]
NightBlossom|1 month ago
When I pointed out this contradiction via email, they ignored me completely and instead silently patched the README to retroactively enforce the rule.
It’s not just a bad test; it’s a massive red flag for their engineering culture. They wasted candidates' time on a "guess the hidden artificial constraint" game rather than evaluating real optimization skills.
hackern3972|1 month ago
They want to see how you handle low level optimizations, not get tripped over some question semantics.
nine_k|1 month ago
I understand that this test is intended to somehow test the raw brianpower, the ability to tackle an unfamiliar and complicated domain, and to work under stress. But I hope it's not representative of the actual working conditions at Anthropic. It's like asking a candidate to play a Quake deathmatch when hiring to a special forces assault squad.
saagarjha|1 month ago
This is a valid way to solve the problem.
tucnak|1 month ago
ahussain|1 month ago
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
That doesn’t seem snarky to me. They said if you beat Opus, not their best solution. Removing “perhaps” (i.e. MAYBE) would be worse since that assumes everyone wants to interview at Anthropic. I guess they could have been friendlier: “if you beat X, we’d love to chat!”
riffraff|1 month ago
NewJazz|1 month ago
kristopolous|1 month ago
sourcegrift|1 month ago
altmanaltman|1 month ago
FriendlyMike|1 month ago
ec109685|1 month ago
OisinMoran|1 month ago
NitpickLawyer|1 month ago
If the models get a good feedback loop + easy (cheap) verification, they get to bang their tokens against the wall until they find a better solution.
cgearhart|1 month ago
lostmsu|1 month ago
myahio|1 month ago
tayo42|1 month ago
piokoch|1 month ago
LarsKrimi|1 month ago
Like optimizing for people who assume the start indices always will be zero. I am close to 100% sure that's required to get below 2096 total loads but it's just not fun
If it however had some kind of dynamic vector lane rotate that could have been way more interesting
eisbaw|1 month ago
Asked to generate drawio for the winner so I can grok it more easily, then I gave feedback.
Edit: 1121 cycles
karmasimida|1 month ago
eisbaw|1 month ago
seamossfet|1 month ago
Maro|1 month ago
Was the screening format here that this problem was sent out, and candidates had to reply with a solution within 2 hours?
Or, are they just saying that the latest frontier coding models do better in 2 hours than human candidates have done in the past in multiple days?
saagarjha|1 month ago
mrklol|1 month ago
throwaway0123_5|1 month ago
Is this saying that Claude matched the best human performance, where the human had two hours? I think that is the correct reading, but I'm not certain they don't mean that Claude had two hours, and matched the best human performance where the human had an arbitrary amount of time. The former is impressive but the later would be even more so.
pickpocket|1 month ago
arsl16|1 month ago
kristianpaul|1 month ago
afro88|1 month ago
Does this confirm they actually do knee cap models after the launch period to save money, without telling users?
sevenzero|1 month ago
nottorp|1 month ago
atomlib|1 month ago
falloutx|1 month ago
demirbey05|1 month ago
measurablefunc|1 month ago
unknown|1 month ago
[deleted]
torginus|1 month ago
saagarjha|1 month ago
Incipient|1 month ago
Something comes across really badly here for me. Some weird mix of bragging, mocking, with a hint of aloof.
I feel these top end companies like the smell of their own farts and would be an insufferable place to work. This does nothing but reinforce it for some reason.
sponnath|1 month ago
qbane|1 month ago
yodsanklai|1 month ago
Rant: On a similar note, I recently saw a post on Linkedin from Mistral, where they were bragging to recruit candidates from very specific schools. That sounded very pretentious (and also an HR mistake on several levels IMHO).
mips_avatar|1 month ago
svilen_dobrev|1 month ago
https://github.com/svilendobrev/transit-python3
htrp|1 month ago
arsl16|1 month ago
spencerflem|1 month ago
Graziano_M|1 month ago
karmasimida|1 month ago
lostmsu|1 month ago
piokoch|1 month ago
Aurornis|1 month ago
The current e-mail invitation in the README is just another avenue for exceptional people to apply. If someone is already highly qualified from their background and resume they can go through the front door (direct application). For those who have incredible talent but not necessarily the background or resume to unlock the front door yet, this is a fun way to demonstrate it.
cjrp|1 month ago
saagarjha|1 month ago
unknown|1 month ago
[deleted]
greesil|1 month ago
avaer|1 month ago
The machine is fake and simulated: https://github.com/anthropics/original_performance_takehome/...
But presumably similar principles apply.
benreesman|1 month ago
This is the general framework for reasoning about correct memory addressing in the presence of arbitrary constraints like those of hardware.
sublimefire|1 month ago
trishume|1 month ago
pshirshov|1 month ago
potato-peeler|1 month ago
NightBlossom|1 month ago
pickpocket|1 month ago
mayankd|1 month ago
zeroCalories|1 month ago
pclmulqdq|1 month ago
djmips|1 month ago
browningstreet|1 month ago
sealeck|1 month ago
mips_avatar|1 month ago
dhruv3006|1 month ago
rvz|1 month ago
SinghCoder|1 month ago
alexpadula|1 month ago
mrdootdoot|1 month ago
yasmineroy33|27 days ago
[deleted]
OhNoNotAgain_99|1 month ago
[deleted]
mannykannot|1 month ago
eisbaw|1 month ago
kartibbb|1 month ago
[deleted]
kartibbb|1 month ago
[deleted]
tmp-127853716|1 month ago
[deleted]
falloutx|1 month ago
woof|1 month ago
Would you prefer C or C++?
"2) AI companies are content with slop and do not even bother with clear problem statements."
It's a filter. If you don't get the problem, you'll waste their time.
"3) LOC and appearance matter, not goals or correctness."
The task was goal+correctness.
"4) Anthropic must be a horrible place to work at."
Depends on what you do. For this position it's probably one of the best companies to work at.
myahio|1 month ago
[deleted]
unknown|1 month ago
[deleted]
jackblemming|1 month ago
And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
onion2k|1 month ago
You're both right and wrong. You're right in the sense that the sort of creativity the task is looking for isn't really possible in two hours. That's something that takes a lot of time and effort over years to be able to do. You're wrong because that's exactly the point. Being able to solve the problem takes experience. Literally. It's having tackled these sorts of problems over and over in the past until you can draw on that understanding and knowledge reasonably quickly. The test is meant to filter out people who can't do it.
I also think it's possible to interpret the README as saying humans can't do better than the optimizations that Claude does when Claude spends two hours of compute time, regardless of how long the human takes. It's not clear though. Maybe Claude didn't write the README.
tmule|1 month ago
muglug|1 month ago
It's a take-home test, which means some people will spend more than a couple of hours on it to get the answer really good. They would have gone after those people in particular.
Analemma_|1 month ago
rvz|1 month ago
Good. That should be the minimum requirement.
Not another Next.js web app take home project.
saagarjha|1 month ago