top | item 45955515

(no title)

jarboot | 3 months ago

> Autoscaling is configured via CloudWatch alarms on CPU usage: > Scale-out policy adds workers when CPU > 30%. > Scale-in policy removes idle workers when CPU < 20%.

Does this handle the case where there are longer-running activities that have low CPU usage? Couldn't these be canceled during scalein?

Temporal would retry them, but it would make some workflow runs take longer, which could be annoying for some user-interactive workflows.

Otherwise I've seen needing to hit the metrics endpoint to query things like `worker_task_slots_available` to scale up, or query pending activities, pending workflows, etc to scale down per worker.

discuss

order

norapap|3 months ago

They can be cancelled if CPU drops below the scale-in threshold. In my case the activities were CPU-heavy, batch-style, and not client-facing — so preferred occasional retries and slightly longer runtimes over blowing up the AWS bill. For that workload, CPU-based autoscaling was perfectly fine.

I originally ran this setup on Temporal Cloud, and pulling detailed worker/queue metrics directly from Cloud can be tricky... you need to expose custom worker metrics yourself, then pipe them into CloudWatch. If you host Temporal yourself, it is easier:)