NDA-safe UK High-Street Bank Multi-AZ EKS GitOps Karpenter

Zero-downtime EKS architecture for a UK high-street bank's digital operations

A bank-hardened Kubernetes platform built for 99.99% resilience. Autonomously detecting hardware degradation, migrating workloads across three Availability Zones, and generating audit evidence by default.

Benchmark Your Reliability Maturity View architecture

Key Outcomes

3 AZs

Multi-Availability Zone topology spread enforced by default across independent fault domains

99.99%

Platform uptime target achieved with fully automated node recovery

40%

Compute cost reduction via right-sizing, Spot automation and scale-to-zero

< 60s

Automated pod rescheduling on hardware degradation, with no human intervention required

~5mo

Engagement length

Multi-AZ regulated VPC with full fault isolation

CIS-K8s

Benchmark applied — bank-grade audit trail

Compute cost reduction via Karpenter & Spot automation

Live

In production — self-healing platform fully operational

Executive Summary

Measurable reliability improvements, not marketing claims.

In regulated banking, downtime is never a purely technical incident. It is a direct regulatory event and an immediate risk to customer trust. The objective was to engineer a platform where automation serves as the primary line of defence, removing engineers from the incident response loop for routine hardware events entirely and directing their attention towards higher-value platform work instead.

Capability	Before	After (Stratus Self-Healing EKS)
Hardware Degradation	Failing EC2 instances cause P1 outages requiring manual SRE cordoning and war-room responses.	Automated node termination handler detects degradation, gracefully evicts pods and replaces the node in under 60 seconds.
Zone Resiliency	Inconsistent pod placement leads to single-AZ concentration risk and datacenter failure exposure.	Kubernetes Topology Spread Constraints enforce pod distribution across 3 distinct fault domains by default.
Scaling Model	Static node scaling causes waste during off-peak and sluggish scale-out during transaction spikes.	Dual-layer JIT scaling: HPA for instant pod creation + Karpenter for right-sized compute injection in seconds.
Change Governance	Manual kubectl applies and ad-hoc console changes introduce high configuration drift and audit risk.	ArgoCD/Flux continuously reconciles live state against Git. Manual mutations are instantly overwritten by the GitOps engine.
Cost Efficiency	100% compute costs paid continuously for peak-headroom capacity that sits idle outside transaction windows.	Automated downscaling and Spot-backed right-sizing reduces baseline waste by ~40% with zero resilience sacrifice.

Strategic Architecture

The Autonomous Control Loop

Every change flows through Git as the single source of truth, reconciled continuously by the GitOps engine and enforced autonomously by the Kubernetes control plane across three independent fault domains.

Node Health Monitoring

When underlying hardware degrades, the system autonomously intercepts the AWS event, cordons the node, and shifts traffic to healthy AZs faster than any human can respond.

Topology Spread

The scheduler is strictly constrained to balance critical microservices across 3 datacenters, ensuring a single physical failure never causes an outage or a degraded user experience.

GitOps Enforced

The GitOps engine continuously reconciles live cluster state against Git. Manual console changes are automatically overwritten, keeping the audit trail immutable.

flowchart LR
    Dev[Platform & App Teams] -->|Code & Config PRs| Git[Git Repository]
    Git -->|Reconciles State| GitOps[GitOps Engine\nArgoCD / Flux]

    subgraph Control [EKS Control Plane]
        direction TB
        API[EKS API Server]
        Policy[Policy-as-Code\nKyverno / OPA]
        API --> Policy
    end

    GitOps --> API

    subgraph Scaling [Autoscaling & Recovery]
        direction TB
        HPA[Pod Autoscaler\nHPA]
        Health[Node Health Monitor\nAWS NTH]
        Karpenter[Karpenter Provisioner]
    end

    Policy --> HPA
    Policy --> Health
    Health -->|Degradation Event| Karpenter

    subgraph Compute [Hardened Compute Layer]
        direction TB
        EC2[Bottlerocket Nodes\nFIPS + CIS Hardened]
        Pods[Container Workloads]
    end

    Karpenter -->|Injects| EC2
    HPA -->|Traffic Spike| Pods
    Pods -.->|Topology Spread| EC2

    subgraph AZs [Physical Fault Isolation]
        direction TB
        AZ1[Availability Zone A]
        AZ2[Availability Zone B]
        AZ3[Availability Zone C]
    end

    EC2 ==>|Distributes across| AZs

    style Dev fill:#1e1e2e,stroke:#475569,color:#ffffff
    style Git fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
    style GitOps fill:#6b21a8,stroke:#ffffff,stroke-width:2px,color:#ffffff
    style API fill:#4c1d95,stroke:none,color:#fff
    style Policy fill:#4c1d95,stroke:none,color:#fff
    style HPA fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
    style Health fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
    style Karpenter fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
    style EC2 fill:#1a1a2e,stroke:#7c3aed,stroke-width:2px,color:#ffffff
    style Pods fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
    style AZ1 fill:#0B0C10,stroke:#475569,color:#ffffff
    style AZ2 fill:#0B0C10,stroke:#475569,color:#ffffff
    style AZ3 fill:#0B0C10,stroke:#475569,color:#ffffff
    style Control fill:#1e1e2e,stroke:#475569,stroke-width:1px,color:#ffffff
    style Scaling fill:#1e1e2e,stroke:#475569,stroke-width:1px,color:#ffffff
    style Compute fill:#1e1e2e,stroke:#7c3aed,stroke-width:1px,color:#ffffff
    style AZs fill:#0B0C10,stroke:#475569,stroke-width:1px,color:#ffffff

← Scroll to explore diagram →

Technology Stack

The Self-Healing Platform Stack

Individual components are disclosed to demonstrate delivery depth, whilst network topology and account structure remain NDA-protected.

Pillar One

Multi-AZ Platform Core

Multi-AZ Amazon EKS deployed privately across 3 fault domains with zero public control-plane exposure
AWS Node Termination Handler (NTH) intercepting EC2 maintenance events for graceful pod eviction before hardware retires
Karpenter Provisioning bypassing slow Auto Scaling Groups to inject right-sized, compliant nodes in seconds
Horizontal Pod Autoscaler (HPA) scaling container replicas from real-time demand signals
GitOps Engine (ArgoCD / Flux) serving as single source of truth, eliminating configuration drift entirely
VPC Endpoints + Transit Gateway routing private connectivity with no internet exposure

Pillar Two

Continuous Security

Zero-trust mTLS service-to-service encryption isolating transaction boundaries inside the cluster
Kyverno / OPA Admission Controllers rejecting privileged or non-compliant pods at the API server level
FIPS-aligned KMS Encryption covering all volumes and secrets data-at-rest
Bottlerocket OS minimising attack surface area and preventing SSH-based access to worker nodes
IRSA (IAM Roles for Service Accounts) granular pod-level AWS permissions with no node-level secrets
CIS Kubernetes Benchmark applied via automated scanning and admission control policies

Operational Model

From reactive firefighting to proactive platform engineering

The engagement restructured the bank's operational model around platform ownership rather than incident response, freeing SRE capacity from routine hardware failures.

What changed operationally

Platform engineers manage self-healing parameters and cluster health. Application teams own application logic. On-call engineers are no longer paged for routine node failures.

Team Ownership Platform team owns the self-healing parameters; app teams are fully decoupled from infrastructure concerns.
Zero-Downtime Upgrades Blue/Green node pool rollouts ensure control plane and worker nodes update without impacting live traffic.
Reduced Toil Automated remediation for node failures reduces manual SRE pager alerts by over 80% in the first month.
Immutable Baseline Every cluster configuration is stored in Git, reviewed via PR, and enforced by the GitOps engine.

Deliverables

Stratus delivered a production-hardened platform with full documentation and operating model handover.

Multi-AZ VPC network topology with VPC Endpoints and Transit Gateway private routing
Terraform module library: EKS cluster, hardened node pools, IAM IRSA, security groups
Self-healing engine: Karpenter + Node Termination Handler configuration and tuning
GitOps pipelines, Kyverno policies, admission controller rulesets, and DR runbooks
Observability stack: CloudWatch Container Insights, Prometheus + Grafana dashboards

FinOps

Optimised for peak, priced for off-peak

The platform was architected so that peak-grade compute capacity is provisioned only when demand genuinely warrants it, with automated scale-down returning the environment to a lean, cost-efficient baseline the moment transaction windows close.

What changed financially

Instead of running at peak capacity 24/7, the platform scales up during transaction spikes and scales down immediately after, reducing idle compute spend without sacrificing a single SLA.

Right-sizing Pods request realistic CPU and memory, autoscaling strictly from measured demand signals.
Karpenter Spot Compute provisions on-demand via Spot + On-Demand mix strategies, eliminating permanently idle node pools.
Cost Tagging Kubernetes cost allocation tagging aligned to specific business services for accountability and chargeback.

Outcome

A measurable reduction in compute waste during off-peak periods, while maintaining full production resilience under peak load.

~40% reduction in wasted capacity Achieved through consistent right-sizing and scaling discipline across all production workloads.
Scale for peak, pay for baseline Elasticity is fully automated. The platform expands and contracts without any manual intervention.
Fewer incidents, reduced spend A stronger operational resilience posture that engineering leadership and the CFO can both present to the board.

Cloud Risk Assessment

How Resilient Is Your Cloud Platform?

This bank invested in enterprise-grade resilience. Where does your platform stand? Run the CRRI™ diagnostic and receive your reliability score, risk band, and executive report instantly.

Benchmark Your Reliability Maturity View Case Studies