top | item 44310678

Show HN: I built a tensor library from scratch in C++/CUDA

119 points| nirw4nna | 9 months ago |github.com | reply

Hi HN,

Over the past few months, I've been building `dsc`, a tensor library from scratch in C++/CUDA. My main focus has been on getting the basics right, prioritizing a clean API, simplicity, and clear observability for running small LLMs locally.

The key features are: - C++ core with CUDA support written from scratch. - A familiar, PyTorch-like Python API. - Runs real models: it's complete enough to load a model like Qwen from HuggingFace and run inference on both CUDA and CPU with a single line change[1]. - Simple, built-in observability for both Python and C++.

Next on the roadmap is adding BF16 support and then I'll be working on visualization for GPU workloads.

The project is still early and I would be incredibly grateful for any feedback, code reviews, or questions from the HN community!

GitHub Repo: https://github.com/nirw4nna/dsc

[1]: https://github.com/nirw4nna/dsc/blob/main/examples/models/qw...

28 comments

[+] aklein|9 months ago|reply

I noticed you interface with the native code via ctypes. I think cffi is generally preferred (eg, https://cffi.readthedocs.io/en/stable/overview.html#api-mode...). Although you'd have more flexibility if you build your own python extension module (eg using pybind), which will free you from a simple/strict ABI. Curious if this strict separation of C & Python was a deliberate design choice.

[+] nirw4nna|9 months ago|reply

Yes, when I designed the API I wanted to keep a clear distinction between Python and C. At some point I had two APIs: 1 in Python and the other in high-level C++ and they both shared the same low-level C API. I find this design quite clean and easy to work with if multiple languages are involved. When I'll get to perf I plan to experiment a bit with nanobind (https://github.com/wjakob/nanobind) and see if there's a noticeable difference wrt ctypes.

[+] helltone|9 months ago|reply

This is very cool. I'm wondering if some of the templates and switch statements would be nicer if there was an intermediate representation and a compiler-like architecture.

I'm also curious about how this compares to something like Jax.

Also curious about how this compares to zml.

[+] nirw4nna|9 months ago|reply

You are absolutely correct! I started working on a sort of compiler a while back but decided to get the basics down first. The templates and switch(s) are not really the issue but rather going back and forth between C & Python. This is an experiment I did a few months ago: https://x.com/nirw4nna/status/1904114563672354822 as you can see there is a ~20% perf gain just by generating a naive C++ kernel instead of calling 5 separate kernels in the case of softmax.

[+] kajecounterhack|9 months ago|reply

Cool stuff! Is the goal of this project personal learning, inference performance, or something else?

Would be nice to see how inference speed stacks up against say llama.cpp

[+] nirw4nna|9 months ago|reply

Thanks! To be honest, it started purely as a learning project. I was really inspired when llama.cpp first came out and tried to build something similar in pure C++ (https://github.com/nirw4nna/YAMI), mostly for fun and to practice low-level coding. The idea for DSC came when I realized how hard it was to port new models to that C++ engine, especially since I don't have a deep ML background. I wanted something that felt more like PyTorch, where I could experiment with new architectures easily. As for llama.cpp, it's definitely faster! They have hand-optimizing kernels for a whole bunch of architectures, models and data types. DSC is more of a general-purpose toolkit. I'm excited to work on performance later on, but for now, I'm focused on getting the API and core features right.

[+] liuliu|9 months ago|reply

Both uses cublas under the hood. So I think it is similar for prefilling (of course, this framework is too early and don't have FP16 / BF16 support for GEMM it seems). Hand-roll gemv is faster for token generation hence llama.cpp is better.

[+] unknown|9 months ago|reply

[deleted]

[+] rrhjm53270|9 months ago|reply

Do you have any plan for the serialization and deserialization of your tensor and nn library?

[+] nirw4nna|9 months ago|reply

Right now I can load tensors directly from a safetensors file or from a NumPy array so I don't really have in mind to add my own custom format but I do plan to support GGUF files.

[+] einpoklum|9 months ago|reply

It's very C-like, heavy use of macros, prefixes instead of namespaces, raw pointers for arrays etc. Technically you're compiling C++, but... not really.

No negative or positive comment on its usability though, I'm not an ML/Neural Network simulation person.

[+] caned|9 months ago|reply

I've found adherence to C++ conventions in low-level software to be a rather contentious issue, mostly recently when working in an ML compiler group. One set abhorred the use of macros, the other any kind of polymorphism or modern C++ feature.

Coming from a background of working with OS kernels and systems software, I don't mind the kind of explicit "C++ lite" style used by the OP. Left to my own devices, I usually write things that way. I would think twice if I was trying to design a large framework, but ... I try to avoid those.

[+] nirw4nna|9 months ago|reply

Yes! This was actually one of my initial goals! I actually like to work in a C-style-C++ let's say where I turn off C++ features I don't need and just use the one I actually need like templates, objects ecc... I find this style to be easy to reason about when it comes to performance.

[+] amtk2|9 months ago|reply

super n00b question , what kind of labtop do you need to do project like this? Is mac ok? or do you need dedicated linux labtop?

[+] nirw4nna|9 months ago|reply

I developed this on an HP Omen 15 with an i7-8750H, a GTX 1050TI and 32GB or RAM with Linux Mint as my OS.

[+] kadushka|9 months ago|reply

Any laptop with an Nvidia card

[+] revskill|9 months ago|reply

Why not zig.

[+] nirw4nna|9 months ago|reply

Because I happen to know C++ and I just wanted to build something rather than learn a new language. Zig looks very interesting though, there are already other projects in this space that use it with great success (see: https://github.com/zml/zml).