I heard their inferencing framework is way lower than typical deployment methods. Can this be verified from that open-source project? How does it stack up against vllm or llama.cpp
I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.
right, I think PTX use is a bigger deal than its getting coverage for. this opens an opening for other vendors to get their foot in with PTX to LLVM-ir translation for existing cuda kernels.
Don't think the decision is based on infra, or any technical reasons.
It's more on the service support side.
How a 200-person company supports 44M iPhone users in China?
reissbaker|1 year ago
I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.
nialv7|1 year ago
helloericsf|1 year ago
find0x90|1 year ago
DesiLurker|1 year ago
feverzsj|1 year ago
helloericsf|1 year ago
chvid|1 year ago
tw1984|1 year ago