Skip to main content

Analytical Platform Compute Maintenance

On the first day of the month the workflow schedule-issue-compute-infrastructure.yml will automatically raise a ticket for example Maintenance - Analytical Platform Compute .

This maintenance ticket includes EKS Cluster Upgrade if applicable and/or patching to ensure all components are up to date.

Check for new release

Check if a new release of Amazon EKS has been made available here.

Upgrade and patch the EKS Control Plane, EKS Nodes, EKS add-ons and all components where new releases are available.

The Approach

  1. Create a new branch / Pull Request.
  2. Make changes to the code.
  3. Create one Pull Request for all the changes.
  4. Check the Terraform plan is as expected for each environment.
  5. Request approval.
  6. Once approved, release the apply workflow gate for each environment in turn, testing before proceeding to the next environment.

Order

Apply in the Development, Test and Production environments, resolving any issues before progressing to the next.

Working on Analytical Platform Compute in Modernisation platform

Be aware that due to restrictions with the state file multiple people cannot work on the environment at once. Terraform plan is fine but once you have released the apply workflow the state file cannot be used by anyone until the changes are complete. This only affects the environment you are carrying out the apply.

Make the team aware and check before starting the work to avoid conflicts.

Workflows

Once you create a Pull request in the Modernisation Platform Environments repository, workflow will be instigated to carry the required checks etc. If you then subsequently push any changes up to the branch, you will need to go into github actions and cancel the previous workkflow so the new one can start.

Assumptions

  • You are operating in the modernisation-platform-environments repository Development Container.
  • To interrogate the cluster, you are exec’d into the same account as the cluster you are operating on aws-sso exec --profile analytical-platform-compute-test:modernisation-platform-developer.
  • Use account modernisation-platform-developer for Test and Production and modernisation-platform-sandbox for Development.
  • If necessary update ~/kube/config as follows aws eks update-kubeconfig --region eu-west-2 --name analytical-platform-compute-test.
  • Set context as follows kubectl config use-context arn:aws:eks:eu-west-2:767397661611:cluster/analytical-platform-compute-test.

Note: amend above appropriately for the environment you are working in.

Impact on Users

As this is a live service there could be an impact on users so this will have to be taken into consideration when planning the work.

The impact on users depends on what is planned to be upgraded/patched.

For Example

If upgrading cloudwatch logs agent, the user impact is minimal, applications will run, logs might be delayed and you will not require a maintenance window.

If upgrading karpenter, the user impact is potentially higher because jobs might not schedule as expected so you will have to agree when to schedule a maintenance window.

Schedule a Maintenance Window

To schedule a maintenance window for Test and Production go to Pagerduty Maintenance Page and use the Post Maintenance button.

Example Pull Requests

Upgrade the EKS Control Plane

  1. Update the eks_cluster_version to the new version in terraform/environments/analytical-platform-compute/environment-configuration.tf.
  2. Commit and push your results to the branch.

Upgrade the EKS Nodes

  1. Check the eks_node_version from environment-configuration.tf against the bottlerocket changelog to see if a new version is available.
  2. If so the eks_node_version is formed from '${BOTTLEROCKET_OS_RELEASE}-${FIRST_EIGHT_CHARACTERS_OF_RELEASE_SHA}' i.e 12.5.0-388e1050. Tip: Go to the Bottlerocket Releases page and find the latest release to the left is the link to the commit, follow this and look at the commit URL for example https://github.com/bottlerocket-os/bottlerocket/commit/388e1050a669dd2544007f2af336832b68fa0d64 and copy the first eight characters of the sha in this case 388e1050.
  3. Update eks_node_version in environment-configuration.tf with the value from above.
  4. Commit and push your results to the branch.

Upgrade the EKS add-ons

  1. Run the following command and interpret the results to understand what version the add-ons should be upgraded to: aws eks describe-addon-versions > file.txt. Search the file for each add-on name i.e. aws-ebs-csi-driver the version is in the field addonVersion.
  2. Check the addonVersion against the appropriate values in the eks_cluster_addon_version block in the terraform/environments/analytical-platform-compute/environment-configuration.tf and amend if needed, there are 3 blocks one for each environment.
  3. Commit and push your results to the branch.

Source: Describe EKS Add-on versions

Patch Terraform modules

Patching is a manual process. This means you will have to check each module in each file as follows.

  1. Open each .tf file in the terraform/environments/analytical-platform-compute directory.
  2. Check each module i.e source = "terraform-aws-modules/eks/aws//modules/karpenter" in eks-custer.tf and cmd + click` to follow the link.
  3. Also check any helm_release for example in helm-charts-system.tf for any new versions.
  4. Amend the version if appropriate.
  5. Commit and push your results to the branch.

Applying/Releasing the Changes

Once the Terraform plan is checked and as expected, the changes can then be applied by the workflow. This needs approving via the Review pending deployments of the apply job for the environment.

  • Development - The changes can be applied prior to the Pull Request approval.

  • Test - If the apply in the Development environment has completed as expected the changes can be applied and this can also be carried out prior to pull request approval. This should be carried out in the agreed maintenance window for Test.

  • Production - If the apply in the Test environment has completed as expected seek approval for the pull request and merge into main. The changes can then be applied by aproving the workflow apply process. This should be carried out in the agreed maintenance window for Production.

This page was last reviewed on 28 October 2024. It needs to be reviewed again on 28 January 2025 by the page owner #analytical-platform-notifications .