yelite's comments

yelite | 2 years ago | on: Efficient Memory Management for Large Language Model Serving with PagedAttention

> How doesn't paging worsen speed performance though?

It does worsen the performance of the attention kernel, if comparing to kernels which takes keys and values in continuous memory layout.

> Wouldn't the speed improvements be coming from that instead? Don't put an expected short input and output in the same batch as a big input and big output?

Actually it puts everything in the same batch. The reason for its high throughput is that sequences are removed from the batch as soon as it's finished, and new sequences can be added to the batch on-the-fly if there is enough space in KV cache. This is called continuous batching (https://www.anyscale.com/blog/continuous-batching-llm-infere...).

Paged attention and "virtualized" KV cache play an important role in an efficient implementation of continuous batching. Text generation in LLM is a dynamic process and it's not possible to predict how long the output is when scheduling incoming requests. Therefore a dynamic approach is needed for KV cache allocation, even though it hurts the performance of attention.

yelite | 5 years ago | on: “Location-Based Pay” – Who are we to complain?

Before remote working becomes a trend recently, location-based pay is just a result of price being determined by supply and demand, plus the fact that location is a major constraint for both job seeking and recruiting. For anyone believes their work has intrinsic value, if you try to calculate this value into a number (salary), ultimately you need to use some kind of market reference (Like, I am able to get an offer of $xxxx from another company). This market reference is heavily based on location if remote work isn't a viable option to you.

Now, why do companies still stick to location-based pay when many other companies are embracing remote work? I think that's just cultural inertia and eventually software engineers will be paid without taking their location into account. But that's not a good thing for everyone, because the salary at that point will probably be much lower than what people get paid in SF area today.

yelite | 6 years ago | on: Ways to reduce the costs of an HTTP(S) API on AWS

> consider that those savings probably won't even pay for more than a quarter of a one of their developers

Although I never run a business, I do believe this kind of optimization is quite meaningful even though they will never be the top priority of a business.

Those optimizations lower operational cost while being mostly maintainance free (except the one that switches off from AWS certificate manager, which may increase some effort when renewing), risk free (unlike refactoring a large legacy system) and requiring little engineering effort (Maybe 10 engineering days from investigation to writing the blog post?)

In addition this blog post itself brings intangible benefit on their branding, website ranking and hiring.

page 1