top | item 43155406

(no title)

deyiao | 1 year ago

I heard their inferencing framework is way lower than typical deployment methods. Can this be verified from that open-source project? How does it stack up against vllm or llama.cpp

discuss

reissbaker|1 year ago

By "lower" you mean cheaper/better?

I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.

nialv7|1 year ago

i think they meant lower level.

helloericsf|1 year ago

What do you mean by "lower"? To my understanding, they will open 5 infra related repos this week. Let's revisit your comparison question on Friday.

find0x90|1 year ago

I don't see any use of PTX, might be in one of the other repos they plan to release.

DesiLurker|1 year ago

right, I think PTX use is a bigger deal than its getting coverage for. this opens an opening for other vendors to get their foot in with PTX to LLVM-ir translation for existing cuda kernels.

feverzsj|1 year ago

Maybe. Apple ditched them in China, because their infra can't handle large scale users.

helloericsf|1 year ago

Don't think the decision is based on infra, or any technical reasons. It's more on the service support side. How a 200-person company supports 44M iPhone users in China?

chvid|1 year ago

Is that true? I thought Apple was going to use their own infrastructure.

tw1984|1 year ago

deepseek doesn't have any experience on support a 50 million user base. that was the reason cited by apple a few weeks ago.