Analytical Platform Compute Maintenance
On the first day of the month the workflow schedule-issue-compute-infrastructure.yml will automatically raise a ticket for example Maintenance - Analytical Platform Compute .
This maintenance ticket includes EKS Cluster Upgrade if applicable and/or patching to ensure all components are up to date.
Check for new release
Check if a new release of Amazon EKS has been made available here.
Upgrade and patch the EKS Control Plane, EKS Nodes, EKS add-ons and all components where new releases are available.
The Approach
- Create a new branch / Pull Request.
- Make changes to the code.
- Create one Pull Request for all the changes.
- Check the Terraform plan is as expected for each environment.
- Request approval.
- Once approved, release the apply workflow gate for each environment in turn, testing before proceeding to the next environment.
Order
Apply in the Development, Test and Production environments, resolving any issues before progressing to the next.
Working on Analytical Platform Compute in Modernisation platform
Be aware that due to restrictions with the state file multiple people cannot work on the environment at once. Terraform plan is fine but once you have released the apply workflow the state file cannot be used by anyone until the changes are complete. This only affects the environment you are carrying out the apply.
Make the team aware and check before starting the work to avoid conflicts.
Workflows
Once you create a Pull request in the Modernisation Platform Environments repository, workflow will be instigated to carry the required checks etc. If you then subsequently push any changes up to the branch, you will need to go into github actions and cancel the previous workkflow so the new one can start.
Assumptions
- You are operating in the
modernisation-platform-environments
repository Development Container. - To interrogate the cluster, you are exec’d into the same account as the cluster you are operating on
aws-sso exec --profile analytical-platform-compute-test:modernisation-platform-developer
. - Use account
modernisation-platform-developer
for Test and Production andmodernisation-platform-sandbox
for Development. - If necessary update
~/kube/config
as followsaws eks update-kubeconfig --region eu-west-2 --name analytical-platform-compute-test
. - Set context as follows
kubectl config use-context arn:aws:eks:eu-west-2:767397661611:cluster/analytical-platform-compute-test
.
Note: amend above appropriately for the environment you are working in.
Impact on Users
As this is a live service there could be an impact on users so this will have to be taken into consideration when planning the work.
The impact on users depends on what is planned to be upgraded/patched.
For Example
If upgrading cloudwatch logs
agent, the user impact is minimal, applications will run, logs might be delayed and you will not require a maintenance window.
If upgrading karpenter, the user impact is potentially higher because jobs might not schedule as expected so you will have to agree when to schedule a maintenance window.
Schedule a Maintenance Window
To schedule a maintenance window for Test and Production go to Pagerduty Maintenance Page and use the Post Maintenance
button.
Example Pull Requests
Upgrade the EKS Control Plane
- Update the
eks_cluster_version
to the new version interraform/environments/analytical-platform-compute/environment-configuration.tf
. - Commit and push your results to the branch.
Upgrade the EKS Nodes
- Check the
eks_node_version
fromenvironment-configuration.tf
against the bottlerocket changelog to see if a new version is available. - If so the
eks_node_version
is formed from'${BOTTLEROCKET_OS_RELEASE}-${FIRST_EIGHT_CHARACTERS_OF_RELEASE_SHA}'
i.e 12.5.0-388e1050. Tip: Go to the Bottlerocket Releases page and find thelatest
release to the left is the link to the commit, follow this and look at the commit URL for examplehttps://github.com/bottlerocket-os/bottlerocket/commit/388e1050a669dd2544007f2af336832b68fa0d64
and copy the first eight characters of the sha in this case388e1050
. - Update
eks_node_version
inenvironment-configuration.tf
with the value from above. - Commit and push your results to the branch.
Upgrade the EKS add-ons
- Run the following command and interpret the results to understand what version the add-ons should be upgraded to:
aws eks describe-addon-versions > file.txt
. Search the file for each add-on name i.e.aws-ebs-csi-driver
the version is in the fieldaddonVersion
. - Check the
addonVersion
against the appropriate values in theeks_cluster_addon_version
block in theterraform/environments/analytical-platform-compute/environment-configuration.tf
and amend if needed, there are 3 blocks one for each environment. - Commit and push your results to the branch.
Source: Describe EKS Add-on versions
Patch Terraform modules
Patching is a manual process. This means you will have to check each module in each file as follows.
- Open each
.tf
file in theterraform/environments/analytical-platform-compute
directory. - Check each module i.e
source = "terraform-aws-modules/eks/aws//modules/karpenter"
ineks-custer.tf
andcmd + click
` to follow the link. - Also check any
helm_release
for example inhelm-charts-system.tf
for any new versions. - Amend the version if appropriate.
- Commit and push your results to the branch.
Applying/Releasing the Changes
Once the Terraform plan is checked and as expected, the changes can then be applied by the workflow. This needs approving via the Review pending deployments
of the apply job for the environment.
Development - The changes can be applied prior to the Pull Request approval.
Test - If the apply in the Development environment has completed as expected the changes can be applied and this can also be carried out prior to pull request approval. This should be carried out in the agreed maintenance window for Test.
Production - If the apply in the Test environment has completed as expected seek approval for the pull request and merge into main. The changes can then be applied by aproving the workflow apply process. This should be carried out in the agreed maintenance window for Production.