NDA-safe UK High-Street Bank Multi-AZ EKS GitOps Karpenter

Zero-downtime EKS architecture for a UK high-street bank's digital operations

A bank-hardened Kubernetes platform built for 99.99% resilience. Autonomously detecting hardware degradation, migrating workloads across three Availability Zones, and generating audit evidence by default.

Key Outcomes
3 AZs
Multi-Availability Zone topology spread enforced by default across independent fault domains
99.99%
Platform uptime target achieved with fully automated node recovery
40%
Compute cost reduction via right-sizing, Spot automation and scale-to-zero
< 60s
Automated pod rescheduling on hardware degradation, with no human intervention required
~5mo
Engagement length
0
Multi-AZ regulated VPC with full fault isolation
CIS-K8s
Benchmark applied — bank-grade audit trail
0
Compute cost reduction via Karpenter & Spot automation
Live
In production — self-healing platform fully operational
Executive Summary

Measurable reliability improvements, not marketing claims.

In regulated banking, downtime is never a purely technical incident. It is a direct regulatory event and an immediate risk to customer trust. The objective was to engineer a platform where automation serves as the primary line of defence, removing engineers from the incident response loop for routine hardware events entirely and directing their attention towards higher-value platform work instead.

Capability Before After (Stratus Self-Healing EKS)
Hardware Degradation Failing EC2 instances cause P1 outages requiring manual SRE cordoning and war-room responses. Automated node termination handler detects degradation, gracefully evicts pods and replaces the node in under 60 seconds.
Zone Resiliency Inconsistent pod placement leads to single-AZ concentration risk and datacenter failure exposure. Kubernetes Topology Spread Constraints enforce pod distribution across 3 distinct fault domains by default.
Scaling Model Static node scaling causes waste during off-peak and sluggish scale-out during transaction spikes. Dual-layer JIT scaling: HPA for instant pod creation + Karpenter for right-sized compute injection in seconds.
Change Governance Manual kubectl applies and ad-hoc console changes introduce high configuration drift and audit risk. ArgoCD/Flux continuously reconciles live state against Git. Manual mutations are instantly overwritten by the GitOps engine.
Cost Efficiency 100% compute costs paid continuously for peak-headroom capacity that sits idle outside transaction windows. Automated downscaling and Spot-backed right-sizing reduces baseline waste by ~40% with zero resilience sacrifice.
Strategic Architecture

The Autonomous Control Loop

Every change flows through Git as the single source of truth, reconciled continuously by the GitOps engine and enforced autonomously by the Kubernetes control plane across three independent fault domains.

Node Health Monitoring
When underlying hardware degrades, the system autonomously intercepts the AWS event, cordons the node, and shifts traffic to healthy AZs faster than any human can respond.
Topology Spread
The scheduler is strictly constrained to balance critical microservices across 3 datacenters, ensuring a single physical failure never causes an outage or a degraded user experience.
GitOps Enforced
The GitOps engine continuously reconciles live cluster state against Git. Manual console changes are automatically overwritten, keeping the audit trail immutable.
flowchart LR
    Dev[Platform & App Teams] -->|Code & Config PRs| Git[Git Repository]
    Git -->|Reconciles State| GitOps[GitOps Engine\nArgoCD / Flux]

    subgraph Control [EKS Control Plane]
        direction TB
        API[EKS API Server]
        Policy[Policy-as-Code\nKyverno / OPA]
        API --> Policy
    end

    GitOps --> API

    subgraph Scaling [Autoscaling & Recovery]
        direction TB
        HPA[Pod Autoscaler\nHPA]
        Health[Node Health Monitor\nAWS NTH]
        Karpenter[Karpenter Provisioner]
    end

    Policy --> HPA
    Policy --> Health
    Health -->|Degradation Event| Karpenter

    subgraph Compute [Hardened Compute Layer]
        direction TB
        EC2[Bottlerocket Nodes\nFIPS + CIS Hardened]
        Pods[Container Workloads]
    end

    Karpenter -->|Injects| EC2
    HPA -->|Traffic Spike| Pods
    Pods -.->|Topology Spread| EC2

    subgraph AZs [Physical Fault Isolation]
        direction TB
        AZ1[Availability Zone A]
        AZ2[Availability Zone B]
        AZ3[Availability Zone C]
    end

    EC2 ==>|Distributes across| AZs

    style Dev fill:#1e1e2e,stroke:#475569,color:#ffffff
    style Git fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
    style GitOps fill:#6b21a8,stroke:#ffffff,stroke-width:2px,color:#ffffff
    style API fill:#4c1d95,stroke:none,color:#fff
    style Policy fill:#4c1d95,stroke:none,color:#fff
    style HPA fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
    style Health fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
    style Karpenter fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
    style EC2 fill:#1a1a2e,stroke:#7c3aed,stroke-width:2px,color:#ffffff
    style Pods fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
    style AZ1 fill:#0B0C10,stroke:#475569,color:#ffffff
    style AZ2 fill:#0B0C10,stroke:#475569,color:#ffffff
    style AZ3 fill:#0B0C10,stroke:#475569,color:#ffffff
    style Control fill:#1e1e2e,stroke:#475569,stroke-width:1px,color:#ffffff
    style Scaling fill:#1e1e2e,stroke:#475569,stroke-width:1px,color:#ffffff
    style Compute fill:#1e1e2e,stroke:#7c3aed,stroke-width:1px,color:#ffffff
    style AZs fill:#0B0C10,stroke:#475569,stroke-width:1px,color:#ffffff
                

← Scroll to explore diagram →

Technology Stack

The Self-Healing Platform Stack

Individual components are disclosed to demonstrate delivery depth, whilst network topology and account structure remain NDA-protected.

Pillar One
Multi-AZ Platform Core
  • Multi-AZ Amazon EKS deployed privately across 3 fault domains with zero public control-plane exposure
  • AWS Node Termination Handler (NTH) intercepting EC2 maintenance events for graceful pod eviction before hardware retires
  • Karpenter Provisioning bypassing slow Auto Scaling Groups to inject right-sized, compliant nodes in seconds
  • Horizontal Pod Autoscaler (HPA) scaling container replicas from real-time demand signals
  • GitOps Engine (ArgoCD / Flux) serving as single source of truth, eliminating configuration drift entirely
  • VPC Endpoints + Transit Gateway routing private connectivity with no internet exposure
Pillar Two
Continuous Security
  • Zero-trust mTLS service-to-service encryption isolating transaction boundaries inside the cluster
  • Kyverno / OPA Admission Controllers rejecting privileged or non-compliant pods at the API server level
  • FIPS-aligned KMS Encryption covering all volumes and secrets data-at-rest
  • Bottlerocket OS minimising attack surface area and preventing SSH-based access to worker nodes
  • IRSA (IAM Roles for Service Accounts) granular pod-level AWS permissions with no node-level secrets
  • CIS Kubernetes Benchmark applied via automated scanning and admission control policies
Operational Model

From reactive firefighting to proactive platform engineering

The engagement restructured the bank's operational model around platform ownership rather than incident response, freeing SRE capacity from routine hardware failures.

What changed operationally

Platform engineers manage self-healing parameters and cluster health. Application teams own application logic. On-call engineers are no longer paged for routine node failures.

  • Team Ownership Platform team owns the self-healing parameters; app teams are fully decoupled from infrastructure concerns.
  • Zero-Downtime Upgrades Blue/Green node pool rollouts ensure control plane and worker nodes update without impacting live traffic.
  • Reduced Toil Automated remediation for node failures reduces manual SRE pager alerts by over 80% in the first month.
  • Immutable Baseline Every cluster configuration is stored in Git, reviewed via PR, and enforced by the GitOps engine.
Deliverables

Stratus delivered a production-hardened platform with full documentation and operating model handover.

  • Multi-AZ VPC network topology with VPC Endpoints and Transit Gateway private routing
  • Terraform module library: EKS cluster, hardened node pools, IAM IRSA, security groups
  • Self-healing engine: Karpenter + Node Termination Handler configuration and tuning
  • GitOps pipelines, Kyverno policies, admission controller rulesets, and DR runbooks
  • Observability stack: CloudWatch Container Insights, Prometheus + Grafana dashboards
FinOps

Optimised for peak, priced for off-peak

The platform was architected so that peak-grade compute capacity is provisioned only when demand genuinely warrants it, with automated scale-down returning the environment to a lean, cost-efficient baseline the moment transaction windows close.

What changed financially

Instead of running at peak capacity 24/7, the platform scales up during transaction spikes and scales down immediately after, reducing idle compute spend without sacrificing a single SLA.

  • Right-sizing Pods request realistic CPU and memory, autoscaling strictly from measured demand signals.
  • Karpenter Spot Compute provisions on-demand via Spot + On-Demand mix strategies, eliminating permanently idle node pools.
  • Cost Tagging Kubernetes cost allocation tagging aligned to specific business services for accountability and chargeback.
Outcome

A measurable reduction in compute waste during off-peak periods, while maintaining full production resilience under peak load.

  • ~40% reduction in wasted capacity Achieved through consistent right-sizing and scaling discipline across all production workloads.
  • Scale for peak, pay for baseline Elasticity is fully automated. The platform expands and contracts without any manual intervention.
  • Fewer incidents, reduced spend A stronger operational resilience posture that engineering leadership and the CFO can both present to the board.
Cloud Risk Assessment

How Resilient Is Your Cloud Platform?

This bank invested in enterprise-grade resilience. Where does your platform stand? Run the CRRI™ diagnostic and receive your reliability score, risk band, and executive report instantly.