TL;DR Use a workflow platform (e.g. Cadence, Amazon Step Functions) to manage your operations.
In the last post, we talk about the convergence loop as a process that performs operations to converge the actual states with the target states. Let's take a closer look at operations. Operations change the state of the system. Operational management is about managing the ways of doing operations on your platform.
What's an operation exactly? An operation can be as simple as hitting an RPC endpoint or sending a TCP request to a database node. It's just some code logic that needs to be run. The challenge is to manage the operations in a scalable manner, with proper rate limiting, auditing (who does what at what time), observability (logging/metric), and failure tolerance.
More concretely, examples of operations are: upgrading the OS across the fleet, taking down a host/rack for maintenance, and changing the topology of the Redis cluster to make it more failure tolerant.
Initially and conceptually, all operations are manual; it involves humans to start all operations. When we are comfortable and confident with the operations, we make the operations start automatically based on certain signals or conditions. The difference between manual and automatic operations is just whether humans are expected to be involved in the loop.
The most primitive form of doing an operation is to wrap the logic inside a CLI. This is good enough for simple and quick operations. This is also a good form for read-only operations that check the status of the system.
The downsides:
If you put the logic in a service API endpoint, there's a centralized place for you to apply rate limiting or observability.
You can potentially run an operation that takes hours.
There's still no failure tolerance. If an operation fails in the middle, it can be tricky to figure out the current state. You would mostly make each step of your job idempotent so that it's safe to retry. Trying to roll back a partially failed operation would most likely be a nightmare.
To make the system more scalable, you can create a pool of workers that pick up operations to run.
This can be a dedicated job system for managing the running of operations.
Workflow is probably the optimal solution for the management of operations. It has the benefits of scalability (distributed running), failure tolerance (auto checkpointing), context linking, and extended running time.
What's a workflow? A workflow is a process with multiple steps. Each step is called an activity.
What's the syntax of a workflow? It depends on the workflow platform. For Cadence, you define your workflow in normal Go/Java code (using the Cadence client library), which is amazing. For the Amazon Step function, you describe your workflow in a JSON or YAML file, which is its own domain-specific language.
During a workflow execution, a workflow worker manages the end-to-end execution, and activity workers handle individual activities. The state of the workflow is persisted so that if a workflow worker fails, another one can pick up and continue seamlessly. Ultimately, the execution of a workflow is like the execution of a state machine.
A database for storing all the pending jobs.
A scheduler for assigning jobs to workers.
The workers may advertise the kind of jobs they would take.
When a worker finishes a job, it updates the job status in the database.
The scheduler monitors the heartbeats of workers and marks jobs as failed when needed.
Components
The process of a workflow execution: