top | item 44513992

(no title)

sidk24 | 7 months ago

LLMs are fast, until they aren't.

Most frameworks show token latency. But why are some tokens slow? What’s stalling the GPU? Is it poor SM occupancy? Kernel launch delay? Cache stalls?

I built *LLMTraceFX*, a token-level GPU profiler for LLM inference workloads.

*What it does:* - Parses GPU execution traces (like vLLM outputs) - Analyzes performance at token granularity - Detects kernel-level bottlenecks: stall %, cache latency, launch overhead, etc. - Uses Claude API to explain why a token was slow and how to optimize it (e.g. "fuse kernels", "fix memory access pattern") - Generates flame graphs + bottleneck dashboards

*Output*: JSON reports, HTML dashboards, CLI summaries, Claude suggestions.

*Stack*: Python, FastAPI, Plotly, Modal.com for GPU runtime, Claude API (no infra required)

*GitHub repo*: https://github.com/Siddhant-K-code/LLMTraceFX

---

Would love feedback on: - Other formats to support (HuggingFace, llama.cpp, ONNX?) - What to show beyond kernel breakdowns - Ideas for integrating with compilers, optimizers

Cheers!

discuss

No comments yet.