An orchestration platform is faced with the following task: You are given a number of (e.g. 10K) physical machines. Your users want you to run their services; they specify certain resource requirements but other than that do not care where the service will run. Now you have to figure things out.
First, it's worth noting two different kinds of services: stateless and stateful ones.
What's a stateless service?
On the other hand, a stateful service maintains some context across requests.
Why bother to distinguish the two? In the context of orchestration platform, stateless services are easier to handle because they are easily interchangeable.
This section discusses a potential architecture of an orchestration platform.
Config store: There is a database or config store that stores the target states of your system. The config describes a hierarchy of objects (e.g. instances, clusters, nodes). It contains general information such as resource requirements for each node and technology-specific settings.
Scheduler: The first consumer of your config is a scheduling service. It's a stateless service. It decides which hosts your nodes should run on. The placement information can be persisted somewhere as the source of truth.
Host agent: On each host, there is a host agent running. The agent controls what happens on the host. It may issue commands to dockerd or systemd to launch or shut down services, managing their lifecycle. The target service that gets started may start other processes as well. The host agent also collects the actual running states of the services.
This is the key control logic that reconciles any difference between the target states and the actual states.
State machine
The convergence loop works like a state machine. It checks for issues, performs some operation, waits for some signal, and performs the next operation, until the actual state of the system matches the target state.
Issues
Issues can be defined on the actual states when they deviate from expectations. The issues can serve as signals/triggers for the convergence loop.
There could be issues at different independent layers. Examples:
Layers | Issues | Reconciliation |
Container | Container is not up. | Start the container and wait for its health. |
Application | The MySQL dynamic config variable is different. | Apply the config variable to the process. |
Not all issues are reconciled immediately and automatically. Sometimes, the issues are risky so manual intervention is needed. Sometimes, the issues are too complicated for an automatic operation
Centralized vs distributed
The convergence loop can be a centralized process—a service that reads the target states, collects actual states from host agents, and issues commands to each host agent to reconcile the differences.
Alternatively, this logic can happen in a distributed fashion at each host agent; each host agent reads the target state of its host and takes actions to fix the differences. This is more scalable and extensible (the host agent can potentially perform more complicated/long-running actions).
Universal vs Application-specific
Some reconciliation logic is universal while others are application-specific.
When a new application/service is onboarded to an orchestration platform, especially stateful ones, some integration logic needs to be written to tell the platform how to handle certain state deviations.
Terminology: A pod is the smallest and simplest unit in the Kubernetes object model that you create or deploy. A pod represents a single instance of a running process in a cluster and can contain one or more containers. Containers in the same pod can share resources and dependencies, and communicate with each other.
Config store: etcd.
Scheduler: Kubernetes scheduler does exactly what we described above. It places pods onto hosts/nodes. The placement info is stored in etcd.
Host agent: Kubelet is the host agent that manages the containers through containerd
. It reports actual state information to the control plane every 10 seconds (configurable).
Convergence loop: Kubelet monitors the desired pod states by talking to the Kubernetes API server (which in turn reads from etcd) and ensures that all of the pods assigned to its node are running and healthy. It can restart unhealthy pods locally.
For scalability concerns, the API server can shard its requests of different kinds to different etcd instances. This may not be a common practice though.