top | item 44092565

(no title)

samsartor | 9 months ago

I'm doing my PhD in ML shit. Before that I was a systems programming guy, lots of C++, bit of CUDA, big fan of Rust. On the side I'm obsessed with RISC-V. Own a couple of boards. I made a stupid little cuda-like-compiler on top of the RISC-V vector extensions, just for fun.

What I'm saying is, tensorrent couldn't find a more excitable third-party developer if they grew one in a lab. And you know what? I can't make heads or tails out of all their various abstractions. I've tried! I've read the docs, I've read the examples, I've gone to meetups. I think OP is right that "one more abstraction bro" probably doesn't solve the problem.

At a guess, the problem isn't a technical one, it is an organizational one. They don't have anybody to stand in for me, or devs like me (eg dumb people). There is no product leadership on the API design. Just a lot of really brilliant engineers obsessively tuning for their own usecases, unwilling to ever trade-off a hit in performance or expressivity for readability or writeability.

discuss

liaopeiyuan|9 months ago

This is my sentiment too after trying to get a Blackhole to run a recent VLM (like Pixtral) over the weekend. Not just unit tests, but actual training loops. I write a lot of JAX in my day job to train large models but I used to do a bit of ML compiler development, which I guess also puts me in the dumb people crowd. I'm equally impressed by how smooth the lower-level setup is and frustrated by how little progress I was able to make towards the seemingly last mile of "just rewrite the code a little bit more bro I just need to get rid of this one hlo op because it's not supported."

I don't think anyone is seriously training an NN on TT hardware at the moment and I think that's an issue. I think tinygrad works not only because geohot is one hell of an engineer but also because comma dogfoods it. TT's engineers are absolutely brilliant (from reading their commits) but I think they are stretched too thin. Bounties are not gonna work - you can't expect an outsider with no internal access/bandwidth/knowledge to suddenly make e.g. Mixtral work as the issue spans at least across tt-xla/tt-mlir. And to agree with ^ training is a kind of artifact where good CX can only be derived from strong leadership and a leaner view of the stack. NVIDIA accumulated that over the decades and the rest are trying to catch up by aggressive hiring (not to say that hiring is necessary). e.g. Annapurna has a presence on the CMU campus when I was there and has the Anthropic team to test it out.

I'm an incredibly excited third-party developer as I think the pitch appeals a lot to grad students (who do model research) who need to run small experiments within the 13B range and reasonably scale them up to draw the first half of the scaling curve.

I lose too much productivity to abstractions and incomplete e2e support in TT's current shape. I'd love to give it another go in 6 months.

1024bees|9 months ago

There is a natural tension between developing an API that is nice to use and having a full fledged graph compiler. Most graph compilers, and the hardware that requires them will be complex and difficult to approach. The "original sin" was pytorch vs tensorflow -- tensorflow capturing the entire graph and then compiling it with XLA (or whatever it was before, I'm probably mixing up tf1 and tf2 here) was such an intractable mess to actually hack on (also the runtime had unapproachable complexity, from what I recall). This has probably changed, but pytorch won out because it was both nice to use and develop.

There are clear reasons why a hardware company would use a graph compiler -- they think such an approach is higher performance, and makes tenstorrent look better on price per dollar when compared to competitors (read: nvda).

There is some legitimate criticism of TT here, their hardware is composed or simple blocks that compose into a complex system (5 separate CPUs being programmed per tensix tile, many tiles per chip), and that complexity has to be wrangled in the software stack -- paying that complexity in hardware so there is less of a VLIW model in software might remove a few abstractions.