zhye's comments | WingNews

zhye | 1 year ago | on: Universal LLM Deployment Engine with ML Compilation

Glad to see MLC is becoming more mature :) I can imagine the unified engine could help build agents on multiple devices.

Any ideas on how those edge and cloud models collaborate on compound tasks (e.g. the compound ai systems: https://bair.berkeley.edu/blog/2024/02/18/compound-ai-system...)

zhye | 2 years ago | on: Punica: Serving multiple LoRA finetuned LLM as one

It will take some effort to implement operators but not too much (cutlass's group gemm already support different mnk's), however the performance benefit is marginal compared to padding all LoRA ranks to the same rank because all these kernels are not compute bound.

zhye | 2 years ago | on: Punica: Serving multiple LoRA finetuned LLM as one

Good question, in general implementing kernels on page tables is tricky in Tensor Compilers because integer set analysis might fail sometimes (but can be fixed with some tweaks). I think using compilers like TVM can help deploy serving systems on different platforms (e.g. AMD GPUs) and I'm optimistic about this direction (and we have to make Tensor Compilers more user-friendly).

zhye | 2 years ago | on: Scaling LLama2-70B with Multiple Nvidia/AMD GPU

Serving LLM with AMD GPUs to serve LLM looks impressive, MLC is evolving fast! Any results on NVLink/xGMI instead of PCIe?