Zero-downtime EKS architecture for a UK high-street bank's digital operations
A bank-hardened Kubernetes platform built for 99.99% resilience. Autonomously detecting hardware degradation, migrating workloads across three Availability Zones, and generating audit evidence by default.
Measurable reliability improvements, not marketing claims.
In regulated banking, downtime is never a purely technical incident. It is a direct regulatory event and an immediate risk to customer trust. The objective was to engineer a platform where automation serves as the primary line of defence, removing engineers from the incident response loop for routine hardware events entirely and directing their attention towards higher-value platform work instead.
| Capability | Before | After (Stratus Self-Healing EKS) |
|---|---|---|
| Hardware Degradation | Failing EC2 instances cause P1 outages requiring manual SRE cordoning and war-room responses. | Automated node termination handler detects degradation, gracefully evicts pods and replaces the node in under 60 seconds. |
| Zone Resiliency | Inconsistent pod placement leads to single-AZ concentration risk and datacenter failure exposure. | Kubernetes Topology Spread Constraints enforce pod distribution across 3 distinct fault domains by default. |
| Scaling Model | Static node scaling causes waste during off-peak and sluggish scale-out during transaction spikes. | Dual-layer JIT scaling: HPA for instant pod creation + Karpenter for right-sized compute injection in seconds. |
| Change Governance | Manual kubectl applies and ad-hoc console changes introduce high configuration drift and audit risk. | ArgoCD/Flux continuously reconciles live state against Git. Manual mutations are instantly overwritten by the GitOps engine. |
| Cost Efficiency | 100% compute costs paid continuously for peak-headroom capacity that sits idle outside transaction windows. | Automated downscaling and Spot-backed right-sizing reduces baseline waste by ~40% with zero resilience sacrifice. |
The Autonomous Control Loop
Every change flows through Git as the single source of truth, reconciled continuously by the GitOps engine and enforced autonomously by the Kubernetes control plane across three independent fault domains.
flowchart LR
Dev[Platform & App Teams] -->|Code & Config PRs| Git[Git Repository]
Git -->|Reconciles State| GitOps[GitOps Engine\nArgoCD / Flux]
subgraph Control [EKS Control Plane]
direction TB
API[EKS API Server]
Policy[Policy-as-Code\nKyverno / OPA]
API --> Policy
end
GitOps --> API
subgraph Scaling [Autoscaling & Recovery]
direction TB
HPA[Pod Autoscaler\nHPA]
Health[Node Health Monitor\nAWS NTH]
Karpenter[Karpenter Provisioner]
end
Policy --> HPA
Policy --> Health
Health -->|Degradation Event| Karpenter
subgraph Compute [Hardened Compute Layer]
direction TB
EC2[Bottlerocket Nodes\nFIPS + CIS Hardened]
Pods[Container Workloads]
end
Karpenter -->|Injects| EC2
HPA -->|Traffic Spike| Pods
Pods -.->|Topology Spread| EC2
subgraph AZs [Physical Fault Isolation]
direction TB
AZ1[Availability Zone A]
AZ2[Availability Zone B]
AZ3[Availability Zone C]
end
EC2 ==>|Distributes across| AZs
style Dev fill:#1e1e2e,stroke:#475569,color:#ffffff
style Git fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
style GitOps fill:#6b21a8,stroke:#ffffff,stroke-width:2px,color:#ffffff
style API fill:#4c1d95,stroke:none,color:#fff
style Policy fill:#4c1d95,stroke:none,color:#fff
style HPA fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
style Health fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
style Karpenter fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
style EC2 fill:#1a1a2e,stroke:#7c3aed,stroke-width:2px,color:#ffffff
style Pods fill:#1a1a2e,stroke:#7c3aed,color:#ffffff
style AZ1 fill:#0B0C10,stroke:#475569,color:#ffffff
style AZ2 fill:#0B0C10,stroke:#475569,color:#ffffff
style AZ3 fill:#0B0C10,stroke:#475569,color:#ffffff
style Control fill:#1e1e2e,stroke:#475569,stroke-width:1px,color:#ffffff
style Scaling fill:#1e1e2e,stroke:#475569,stroke-width:1px,color:#ffffff
style Compute fill:#1e1e2e,stroke:#7c3aed,stroke-width:1px,color:#ffffff
style AZs fill:#0B0C10,stroke:#475569,stroke-width:1px,color:#ffffff
← Scroll to explore diagram →
The Self-Healing Platform Stack
Individual components are disclosed to demonstrate delivery depth, whilst network topology and account structure remain NDA-protected.
- Multi-AZ Amazon EKS deployed privately across 3 fault domains with zero public control-plane exposure
- AWS Node Termination Handler (NTH) intercepting EC2 maintenance events for graceful pod eviction before hardware retires
- Karpenter Provisioning bypassing slow Auto Scaling Groups to inject right-sized, compliant nodes in seconds
- Horizontal Pod Autoscaler (HPA) scaling container replicas from real-time demand signals
- GitOps Engine (ArgoCD / Flux) serving as single source of truth, eliminating configuration drift entirely
- VPC Endpoints + Transit Gateway routing private connectivity with no internet exposure
- Zero-trust mTLS service-to-service encryption isolating transaction boundaries inside the cluster
- Kyverno / OPA Admission Controllers rejecting privileged or non-compliant pods at the API server level
- FIPS-aligned KMS Encryption covering all volumes and secrets data-at-rest
- Bottlerocket OS minimising attack surface area and preventing SSH-based access to worker nodes
- IRSA (IAM Roles for Service Accounts) granular pod-level AWS permissions with no node-level secrets
- CIS Kubernetes Benchmark applied via automated scanning and admission control policies
From reactive firefighting to proactive platform engineering
The engagement restructured the bank's operational model around platform ownership rather than incident response, freeing SRE capacity from routine hardware failures.
Platform engineers manage self-healing parameters and cluster health. Application teams own application logic. On-call engineers are no longer paged for routine node failures.
- Team Ownership Platform team owns the self-healing parameters; app teams are fully decoupled from infrastructure concerns.
- Zero-Downtime Upgrades Blue/Green node pool rollouts ensure control plane and worker nodes update without impacting live traffic.
- Reduced Toil Automated remediation for node failures reduces manual SRE pager alerts by over 80% in the first month.
- Immutable Baseline Every cluster configuration is stored in Git, reviewed via PR, and enforced by the GitOps engine.
Stratus delivered a production-hardened platform with full documentation and operating model handover.
- Multi-AZ VPC network topology with VPC Endpoints and Transit Gateway private routing
- Terraform module library: EKS cluster, hardened node pools, IAM IRSA, security groups
- Self-healing engine: Karpenter + Node Termination Handler configuration and tuning
- GitOps pipelines, Kyverno policies, admission controller rulesets, and DR runbooks
- Observability stack: CloudWatch Container Insights, Prometheus + Grafana dashboards
Optimised for peak, priced for off-peak
The platform was architected so that peak-grade compute capacity is provisioned only when demand genuinely warrants it, with automated scale-down returning the environment to a lean, cost-efficient baseline the moment transaction windows close.
Instead of running at peak capacity 24/7, the platform scales up during transaction spikes and scales down immediately after, reducing idle compute spend without sacrificing a single SLA.
- Right-sizing Pods request realistic CPU and memory, autoscaling strictly from measured demand signals.
- Karpenter Spot Compute provisions on-demand via Spot + On-Demand mix strategies, eliminating permanently idle node pools.
- Cost Tagging Kubernetes cost allocation tagging aligned to specific business services for accountability and chargeback.
A measurable reduction in compute waste during off-peak periods, while maintaining full production resilience under peak load.
- ~40% reduction in wasted capacity Achieved through consistent right-sizing and scaling discipline across all production workloads.
- Scale for peak, pay for baseline Elasticity is fully automated. The platform expands and contracts without any manual intervention.
- Fewer incidents, reduced spend A stronger operational resilience posture that engineering leadership and the CFO can both present to the board.
How Resilient Is Your Cloud Platform?
This bank invested in enterprise-grade resilience. Where does your platform stand? Run the CRRI™ diagnostic and receive your reliability score, risk band, and executive report instantly.