stsffap's comments

stsffap | 23 days ago

ADK is great framework for building agents as it is runtime agnostic and you can choose which properties are important to you. You can run your ADK agents also on Restate (https://google.github.io/adk-docs/integrations/restate) if you want to turn your agents into durable agents that can reliably communicate with other agents.

stsffap | 23 days ago | on: A remote control for your agents

Hey HN, I work on Restate.

We kept seeing the same problem with AI agents in production: you can observe them (traces, logs, dashboards), but when something goes wrong, your options are basically restart the process and lose all progress, or wait and hope.

We built Restate as a durable execution engine, and it turns out the primitives it provides, journaling every step, giving each execution a stable ID, map really well onto the control problem for agents.

This post walks through concrete scenarios: cancelling hundreds of agents stuck retrying a dead endpoint, pausing agents during DB maintenance instead of letting them burn through retries, and restarting a failed three-hour workflow from the exact step that failed (without redoing the expensive work before it).

Curious what control problems others have hit with agents in production. Happy to answer questions.

stsffap | 4 months ago | on: A Durable Coding Agent – With Modal and Restate

We built this to explore what it takes to move a coding agent from "works on my laptop" to "handles millions of users in production."

The core insight: durability matters more than you'd think for agents. When an agent takes 5-10 minutes on a task, crashes become inevitable. Rate limits hit. Sandboxes timeout. Users interrupt mid-task. Traditional retry logic gets messy fast.

Our approach uses Restate for durable execution (workflows continue from the last completed step) and Modal for ephemeral sandboxes. We get automatic failure recovery, interruptions for new input, great scalability, and scale-to-zero without any custom retry code. The tradeoffs: coupling to Restate's execution model and requiring discipline around deterministic replay.

How are you handling long-running agent workflows to make them run reliably at scale?

stsffap | 4 months ago | on: Keep your applications running while AWS is down

Author here. AWS’s recent us-east-1 outage inspired us to share our approach to geo-replication with Restate.

The core idea: geo-replication should be a deployment concern, not something you architect into every line of application code. You write normal business logic, then configure replication policies at deployment time and let Restate handle the rest.

The configuration is straightforward: `default-replication = "{region: 2, node: 3}"` ensures data is replicated to at least 2 regions and 3 nodes. This ensures that your apps can tolerate a region outage or losing two arbitrary nodes while staying fully available. Behind the scenes, Restate handles leader election, log replication, and state synchronization. We use S3 cross-region replication for snapshots with delayed log trimming to ensure consistency.

We tested this with a 6-node cluster across 3 AWS regions under 400 req/s load. Killing an entire region resulted in sub-60-second automatic failover with zero downtime and no data loss. Only 1% of requests saw latency spikes during the failover window. Once nodes in us-east-1 were no longer running, P50 latency increased when replication shifted from nearby us-east-1/us-east-2 to distant us-east-2/us-west-1.

Happy to answer technical questions or discuss tradeoffs!

stsffap | 5 months ago | on: Control and Alt and Restate 1.5

Restate 1.5 adds some nice QoL improvements for durable execution:

Observability: Full execution history with live timelines showing retries, nested calls, and state changes. All stored locally in RocksDB, no external deps needed.

Better failure handling: Instead of dead-letter queues, you can now pause invocations that hit terminal errors and restart them via UI once you fix the root cause. Invocations retain their progress/state.

Granular retry policies: Configure retries per service/handler. Invocations can pause after max retries instead of failing (useful for config errors, blocked APIs, etc).

Performance: SQL queries 5-20x faster, making the UI much snappier.

AWS Lambda: Automatic payload compression when approaching the 6MB limit, plus new Rust SDK support.

Also includes a docs overhaul with new tutorials for AI agents, workflows, and microservice orchestration.

Cloud version is now public (managed option), or self-host the open source version.

stsffap | 5 months ago | on: Restate Cloud Is Open to Everyone – Build Durable Workflows and Agents Today

Restate Cloud is now publicly available with usage-based pricing (free tier: 50k actions/month, no CC required).

Restate provides durable execution for workflows and AI agents - think "transactional guarantees for distributed code." It handles state persistence, automatic retries, and crash recovery so your workflows always complete. Already being used for AI orchestration, payment processing, and banking infrastructure.

The platform combines the dev-ex of cloud-native orchestration with database-level guarantees, while also running as a single binary that scales from localhost to multi-region. New features include detailed execution timelines, client-side encryption with customer keys, and seamless integration with Cloudflare Workers/Vercel/Deno that automatically handles the versioning problem (no more breaking durable executions with code changes).

Open source core + managed cloud offering.

stsffap | 8 months ago | on: Restate 1.4: We've Got Your Resiliency Covered

We’re excited to announce Restate v1.4, a significant update for developers and operators building and supporting resilient applications. The new release improves cluster resiliency and workload balancing, and also adds a multitude of efficiency and ergonomics improvements across the board. Experience less unavailability and achieve more with fewer resources.

stsffap | 1 year ago

If events need to be processed in strict order across different partitions, then you need to send these events to a single key (and thereby to a single partition). A partition consists of multiple service keys for which Restate ensures strict order processing. It is noteworthy that different keys don't block each other from being executed (no head of line blocking across different keys).

The order in which events are processed for each key is their arrival order. If you need to handle out-of-order events, then you can implement this as part of a virtual object which can store events and re-order them based on other events that carry some form of watermark or based on time.

stsffap | 1 year ago

Hi everyone, I am helping building Restate. If you want to try out deploying a distributed Restate cluster, then you can do this with only a few commands. All you need is Docker and to follow our guide here: https://docs.restate.dev/guides/cluster.

Let us know, what you think about it :-)

stsffap | 1 year ago

We also put a lot of energy into making operations of Restate as simple as possible. We learned it the hard way when building Apache Flink that operating a distributed system is challenging especially if it relies on other external systems like ZooKeeper. Therefore, Restate comes as an all-batteries-included single binary that does not need any external dependencies. So you don't have to understand and operate multiple systems. Moreover, you can start with a single node deployment and later turn it into a multi-node deployment by "simply" starting new processes that connect to the existing cluster.

Restate itself sits in between your services and your user's requests. It is designed to push invocations to your service endpoints which allows it to play nicely together with serverless platforms such as AWS Lambda, Cloud Run functions, etc.

stsffap | 1 year ago

By default, failing ctx.run() calls (like the accountService call) will be retried indefinitely until they succeed unless you have configured a retry policy for them. In the case of a configured retry policy where you have exhausted the number of retry attempts, Restate will mark this call as terminally failed and record it in its log as such and return it to the caller.

stsffap | 1 year ago

We tried to design the additional usage grant (https://github.com/restatedev/restate/blob/39f34753be0e27af8...) as permissive as possible. Our intention is to only prevent the big cloud service providers from offering Restate as a managed service as it has happened in the past with other open source projects. If you find the additional usage grant still too restrictive, then let us talk how to adjust it to enable you while still maintaining our initial intention.

stsffap | 1 year ago

We will create a more detailed comparison to Temporal shortly. Until then @sewen gave a nice summarizing comparison here: https://news.ycombinator.com/item?id=40660568.

And yes, Restate does not have any external dependencies. It comes as a single self-contained binary that you can easily deploy and operate wherever you are used to run your code.

stsffap | 1 year ago

Currently, Restate does not support this functionality out of the box. Since Restate does not need access to input/output messages or state (it ships it as bytes to the service endpoint), you could add your own client-side encryption mechanism. In the foreseeable future, Restate will probably add a more integrated solution for it.

stsffap | 1 year ago

While Restate is not optimized for analytical workloads it should be fast enough to also use it for simpler analytical workloads. Admittedly, it currently lacks a fluent API to express a dataflow graph but this is something that can be added on top of the existing APIs. As @gvdongen mentioned a scatter-gather like pattern can be easily expressed with Restate.

Regarding whether to parallelize or to batch, I think this strongly depends on what the actual operation involves. If it involves some CPU-intensive work like model inference, for example, then running more parallel tasks will probably speed things up.

stsffap | 1 year ago

From a quick glance at what JobRunr does (especially running asynchronous/delayed background tasks), it seems that Restate would be a very good fit for it as well. Restate will also handle persistence for you w/o having to deploy & operate a separate RDBMS or NoSQL store. Note that I am not a JobRunr expert, though.

stsffap | 1 year ago

Restate is built as a sharded replicated state machine similar to how TiKV (https://tikv.org/), Kudu (https://kudu.apache.org/kudu.pdf) or CockroachDB (https://github.com/cockroachdb/cockroach) are designed. Instead of relying on a specific consensus implementation, we have decided to encapsulate this part into a virtual log (inspired by Delos https://www.usenix.org/system/files/osdi20-balakrishnan.pdf) since it makes it possible to tune the system more easily for different deployment scenarios (on-prem, cloud, cost-effective blob storage). Moreover, it allows for some other cool things like seamlessly moving from one log implementation to another. Apart from that the whole system design has been influenced by ideas from stream processing systems such as Apache Flink (https://flink.apache.org/), log storage systems such as LogDevice (https://logdevice.io/) and others.

We plan to publish a more detailed follow-up blog post where we explain why we developed a new stateful system, how we implemented it, and what the benefits are. Stay tuned!

stsffap | 1 year ago

1. There is no maximum execution duration for a Restate workflow. Workflows can run only for a few seconds or span months with Restate. One thing to keep in mind for long-running workflows is that you might have to evolve the code over its lifetime. That's why we recommend writing them as a sequence of delayed tail calls (https://news.ycombinator.com/item?id=40659687)

2. Restate currently does not impose a strict size limit for input/output messages by default (it has the option to limit it though to protect the system). Nevertheless, it is recommended to not go overboard with the input/output sizes because Restate needs to send the input messages to the service endpoint in order to invoke it. Thus, the larger the input/output sizes, the longer it takes to invoke a service handler and sending the result back to the user (increasing latency). Right now we do issue a soft warning whenever a message becomes larger than 10 MB.

3. If the user does not specify a timeout for its call to Restate, then the system won't time it out. Of course, for long-running invocations it can happen that the external client fails or its connection gets interrupted. In this case, Restate allows to re-attach to an ongoing invocation or to retrieve its result if it completed in the meantime.

4. There is no limit on the max number of state transitions of a workflow in Restate.

5. Restate keeps the journal history around for as long as the invocation/workflow is ongoing. Once the workflow completes, we will drop the journal but keep the completed result for 24 hours.

stsffap | 1 year ago

A special case is if the operation is calling another Restate service. In this case, Restate will make sure that the callee will be executed exactly once and there is no need for the user to pass an idempotency key or something similar. Only when interacting with the external world from a Restate service, the operation needs to be idempotent.

stsffap | 1 year ago

We are actively looking for feedback on what SDK to develop next. Quite a few people have voiced interest in Python so far. This will make it more likely that we might tackle this soonish. We'll keep you posted.
page 1