Monitor the upgrade
Users will benefit from implementing observability tools for monitoring cluster upgrades and validating the state of the cluster.
Observability tools allow visibility into data, such as cluster metrics, logs, and traces, to provide users with insight into the behavior of the cluster and the applications running inside it. It is standard for any production cluster to implement appropriate observability tooling to monitor every aspect of the cluster.
Standard tools include Prometheus, Grafana, FluentBit, AlertManager, and Jaeger. The overall objective of these types of tools is to ensure that users can analyze their clusters in any way necessary to maintain operational hygiene. These aspects may include audit logging for forensic analysis and maintaining security, analyzing performance bottlenecks, detecting failures and breakages in cluster infrastructure, and aggregating log data from applications.
In the context of cluster upgrades, observability tools provide valuable data related to the following:
-
Obtaining an application performance baseline.
Having a performance baseline before executing an upgrade allows users to compare the performance impact after the upgrade is complete. Any drops in performance may require investigation and further analysis.
-
Verifying the availability/uptime of the cluster’s applications.
Users may want to have data to verify whether cluster upgrades are causing any application downtime. Observe running application behavior, such as error messages, dropped requests, latency spikes, etc., to determine if cluster upgrades are impacting applications. Analyzing the impact on running applications will help determine a root cause and implement a mitigation plan.
-
Verifying the cluster’s overall health during and after an upgrade.
A cluster upgrade can only be validated as a success or failure if there’s observability data to confirm whether any problems occurred during the upgrade. Implementing tools to gather cluster data will help users gain confidence that their clusters were upgraded successfully without adverse impact.
Implementing observability is a crucial aspect of running a production cluster and is particularly useful for validating cluster upgrades. Users will benefit from the ability to investigate potential upgrade-related issues, confirm whether upgrades were completed successfully, and obtain a baseline for what to expect during the cluster upgrade process.
Develop a disaster recovery plan
Despite all the planning and testing users may implement to prepare for an upgrade, the upgrade process may still cause issues for the cluster and its applications. This situation may require disaster recovery to revert the cluster to a working, usable, and stable state. A disaster recovery plan is vital for users running managed Kubernetes on cloud providers because providers typically don’t allow downgrading the cluster control plane version.
A disaster recovery plan aims to roll back and revert the state of the cluster and its applications as quickly and accurately as possible to mitigate downtime and potential data loss. The key elements of a recovery plan will include the following:
-
Backing up the cluster state.
There are many tools available for backing up Kubernetes clusters. Velero is an example of an open-source project that allows backing up the Kubernetes objects running in a cluster and the data in any Persistent Volume resources. Deploying the cluster via infrastructure-as-code tools like Terraform allows easier backup of the cluster’s configuration. Overall, the objective for backups is to ensure that the entire cluster can be replicated easily in an identical state, if necessary. Backups should be periodically tested by restoring to a separate cluster to verify that they are valid.
-
Documenting the current state of the cluster.
Combining documentation with backups will help users reassemble clusters in a disaster recovery situation. Useful items to document include any manual customizations applied to the cluster, architectural design decisions to explain why the cluster is set up with a particular strategy, historical benchmarks from observability tools to validate the recovered cluster’s state, what client-side tools and commands are executed to setup and configure the cluster, and any other details required to set up an exact replica of the broken cluster.
-
Monitoring observability tools to determine where problems are originating from.
Examining data to determine the root cause of a problem caused by an upgrade may enable easier disaster recovery than trying to reset the cluster’s state. Observability tools can provide insight into what infrastructure or applications in a cluster are failing and can enable users to focus their investigations accordingly.
-
Recording any learnings from disaster recovery situations.
This will ensure that these findings lead to improvements in addressing similar problems in the future.
Designing a proper rollback plan is essential for ensuring that cluster issues can be mitigated effectively. Recovering the cluster’s state via either root cause analysis or replicating the cluster’s infrastructure can be necessary for a disaster recovery plan, especially for production environments that are sensitive to downtime.