top | item 32124333

(no title)

Yes, please give us more info about using control theory and how one might think about building such a system please..

discuss

Too|3 years ago

This is how Kubernetes works in many ways. Crash a pod and the control loop inside the ReplicaSet will create a new pod for you. Scaling nodes is based on similar principles of desired vs actual values.

wsc981|3 years ago

My guess would be add assertions everywhere instead of throwing exceptions.

Karellen|3 years ago

Why not throw exceptions, and just never use try/catch? That way, all exceptions are uncaught and should terminate the program, in a way that takes advantage of the programming language's native error reporting facilities.

roeles|3 years ago

I don't know of a way to test this behavior (I mainly code C++ and unit test with Google Test). One could spawn a process and capture the output and return value, but that sounds a bit heavy for just testing if your error handling still works as intended.

angarg12|3 years ago

These kind of systems are not always appropriate, but when they do, they work wonderfully.

Our use case was to build a service to manage the dynamic part of our infrastructure. These are infra pieces that are created/deleted/modified on the fly according to some policies, instead of defining them statically as code. The implementation is simply a lambda function that runs every minute, loads a policy, compares the current state of the system, and then creates/deletes/modify resources as needed.

I am currently in the process of writing a talk that I will deliver to the rest of my org. This will help me crystalize my thoughts, but here are some pointer on why I think it worked in this case:

* The service is stateless. On each run it just loads a policy, compares it to the current state of the system, and acts accordingly. This avoids handling complicated state or coordinating executions. In theory two policies could contradict each other, but in practice we partition our policies in such a way overlap is not possible.

* Operations are idempotent. This is one of the reasons the system converges to a desired state. This makes the service resilient to both failures and eventual consistency.

* Deviation from policies doesn't affect correctness. We are fortunate that our system is not directly customer facing. Deviation from policies affects only performance. The system can work several minutes (if not hours) outside policy band without consequences. This probably will be a critical blocker for most production systems.

Other than control theory and let it fail, I had the chance to play with other cool concepts while implementing this service. These are some of those:

* Parse, don't validate/anti-corruption layer: The service downloads and parses a policy at the beginning of each run. If parsing fails, it errors out. Otherwise, it passes the policy object to the rest of the execution. This makes the system easy to test, and avoids the anti-pattern of peppering your code with instructions reading input, just to find mid-execution that the policy was invalid.

* Pluggable policies: The main body of the service is a very simple sense/act loop. For the actors, we use a strategy pattern, where policies can choose what strategy to use. This approach has helped us to introduce new behavior with minimal code changes.

* Typescript as configuration language: This service replaces an older, less flexible one. A major pain of the old service is that policies where defined as Jinja templates over plain text files. This became unmaintainable as the number and complexity of policies grew. Our new service defines policies in Typescript. Policies are statically typed, and we use regular programming constructs (functions, loops, variables...) to build them at compile time. The output is still a plain JSON file.

Hope that helps.

t0mek|3 years ago

This approach is very similar to the way how Kubernetes custom resource reconciliation works (and Kubernetes in general, but the custom resources is the way how you can bring your own logic there).

In Kubernetes you can define your own types, Custom Resources (basically JSONs with schema) and deploy "operators" - services that should handle these new types. Every time you create or modify your custom resource, the operator is triggered and it should "reconcile" your resource.

Now this reconciliation process is stateless. It doesn't know what exactly changed in your resource, so it should just go through the list of all the things that it needs to do (create or remove pods, services, configmaps, etc.) and if something is not right (e.g. a missing service), try to make it right or fail. In any case, the output should be written in the custom resource's .status section.

There's no active waiting - if the operator sees that some other resource is not ready yet (a required pod is still starting), it should just mark your resource as not ready and finish. If the pod state changes, the next reconciliation will notice it. It should do as much as it can to bring reality to the expectation, but not more.

If implemented correctly, this is surprisingly resilient. The idempotent nature of the reconciliation loop makes it perfect for errors handling. For instance, your reconciliation may fail because some pod is not running correctly. It's nothing that your operator can fix. But if the pod auto-heals (maybe the network connectivity was restored or an external service is available again), the operator will auto-heal as well, without a manual intervention. The next reconciliation loop will just see the pod is available again and carry on.