(no title)
sidk24 | 7 months ago
Most frameworks show token latency. But why are some tokens slow? What’s stalling the GPU? Is it poor SM occupancy? Kernel launch delay? Cache stalls?
I built *LLMTraceFX*, a token-level GPU profiler for LLM inference workloads.
*What it does:* - Parses GPU execution traces (like vLLM outputs) - Analyzes performance at token granularity - Detects kernel-level bottlenecks: stall %, cache latency, launch overhead, etc. - Uses Claude API to explain why a token was slow and how to optimize it (e.g. "fuse kernels", "fix memory access pattern") - Generates flame graphs + bottleneck dashboards
*Output*: JSON reports, HTML dashboards, CLI summaries, Claude suggestions.
*Stack*: Python, FastAPI, Plotly, Modal.com for GPU runtime, Claude API (no infra required)
*GitHub repo*: https://github.com/Siddhant-K-code/LLMTraceFX
---
Would love feedback on: - Other formats to support (HuggingFace, llama.cpp, ONNX?) - What to show beyond kernel breakdowns - Ideas for integrating with compilers, optimizers
Cheers!
No comments yet.