Why not let us try the new model for free like the 5 uses available for the 70B model? Seems like a no brainer to hook new users if what you're selling is worth it, eh?
> The model, based on Meta Llama 3.1 8B, runs on a Phind-customized NVIDIA TensorRT-LLM inference server that offers extremely fast speeds on H100 GPUs. We start by running the model in FP8, and also enable flash decoding and fused CUDA kernels for MLP.
as far as i know you are running your own GPUs - what do you do in overload? have a queue system? what do you do in underload? just eat the costs? is there a "serverless" system here that makes sense/is anyone working on one?
rushingcreek|1 year ago
johndough|1 year ago
fshr|1 year ago
swyx|1 year ago
as far as i know you are running your own GPUs - what do you do in overload? have a queue system? what do you do in underload? just eat the costs? is there a "serverless" system here that makes sense/is anyone working on one?