NDA-safe Lottery & Gaming MLOps SageMaker EMR

The AI Factory: Module-Driven MLOps at Scale.

Engineering a modular, compliance-ready SageMaker and EMR platform for regulated analytics, designed for speed, auditability, and absolute cost control.

Key Outcomes
-60% Provisioning Time
From multi-day IT tickets to under 2 hours via automated Service Catalog vending
1-Click Self-Service
Data science teams provision compliant ML environments without raising tickets
40% FinOps Savings
ML compute spend reclaimed via Spot instance automation and lifecycle cleanup
100% Audit-Ready IAM
Governance and data isolation enforced by default through every vended workspace

"By productising infrastructure into a versioned Terraform module library, we moved model delivery from a multi-day bottleneck to a two-hour automated flow."

~4mo
Engagement length
0
Reduction in provisioning time — days to under 2 hours
GDPR
Aligned compliance framework, internal audit-ready
0
ML compute cost reduction via Spot automation & lifecycle cleanup
Live
In production — self-service vending fully operational
Executive Summary

From manual tickets to automated, self-service vending.

In a regulated gaming environment, data isolation and security are non-negotiable. However, the friction of manually provisioning compliant ML environments was creating multi-week deployment lags and substantial compute waste. By productising infrastructure as versioned modules, we shifted the model from reactive ticket-handling to a self-serve platform engineering paradigm where compliance is inherited, not applied after the fact.

Workstream Legacy State Productised State
Environment Provisioning 3–5 days: manual IT tickets and handoffs with no self-service path Under 2 hours. Automated Service Catalog vending with no human in the loop.
Security Posture Inconsistent, manually stitched IAM roles, different per team and environment Standardised module-driven IAM. Security guardrails inherited by default on every workspace.
Cost Management Manual, reactive cleanup of idle clusters, often missed, causing budget overrun Automated lifecycle termination. Time-to-live policies on all notebooks and ephemeral clusters.
Audit Evidence Ad-hoc, painful log gathering for compliance reviews Immutable IaC Git trail. Pull Request history provides complete change provenance for auditors.
Strategic Architecture Overview

The Automated MLOps Vending Machine

From a data scientist requesting an environment to the automated, secure delivery of hardened SageMaker and EMR tooling. Zero manual intervention.

100% Module-Driven
Every resource, from SageMaker notebooks to global networking, is vended from the versioned Terraform module library. No console mutations permitted.
Compliance-by-Design
Security guardrails like VPC Endpoints and hardened IAM scopes are baked into the modules. Compliance is inherited automatically, not bolted on later.
No Manual Mutations
All infrastructure changes flow through Git PRs. Configuration drift is impossible. Auditors get a complete, immutable trail of every resource state change.
flowchart LR
    DS[Data Science Teams] -->|Request Environment| Portal[AWS Service Catalog]
    Portal -->|Execute Provisioning| TF

    subgraph IaCEngine [Terraform Module Engine]
        direction TB
        TF[Terraform Root Module] --> SM_Mod[SageMaker Modules]
        TF --> EMR_Mod[EMR Studio Modules]
        TF --> Sec_Mod[Network & Security Modules]
    end

    subgraph Workspace [Hardened ML Workspace]
        direction TB
        SM_W[SageMaker Studio]
        EMR_W[Ephemeral EMR Clusters]
    end

    TF -->|Vends 100% Automated| SM_W
    TF --> EMR_W

    subgraph Gov [Automated Governance]
        direction TB
        Audit[CloudWatch Logs]
        Tags[Cost Tagging]
        Life[Lifecycle Cleanup]
    end

    subgraph Storage [Secured Data Perimeter]
        direction TB
        S3[(Secured S3 Buckets)]
        VPCE[Private VPC Endpoints]
        Pol[Hardened IAM Scopes]
    end

    SM_W --> Gov
    SM_W --> Storage
    EMR_W --> Gov
    SM_W -->|Model Deployment| API[SageMaker Inference APIs]

    style DS fill:#0B0C10,stroke:#475569,stroke-width:1px,color:#ffffff
    style Portal fill:#0B0C10,stroke:#475569,stroke-width:1px,color:#ffffff
    style TF fill:#6b21a8,stroke:#ffffff,stroke-width:2px,color:#ffffff
    style SM_Mod fill:#4c1d95,stroke:none,color:#fff
    style EMR_Mod fill:#4c1d95,stroke:none,color:#fff
    style Sec_Mod fill:#4c1d95,stroke:none,color:#fff
    style SM_W fill:#1a1a2e,stroke:#7c3aed,stroke-width:2px,color:#ffffff
    style EMR_W fill:#1a1a2e,stroke:#7c3aed,stroke-width:1px,color:#ffffff
    style API fill:#1a1a2e,stroke:#7c3aed,stroke-width:1px,color:#ffffff
    style IaCEngine fill:#1e1e2e,stroke:#475569,stroke-width:1px,color:#ffffff
    style Gov fill:#1e1e2e,stroke:#475569,stroke-width:1px,color:#ffffff
    style Storage fill:#1e1e2e,stroke:#475569,stroke-width:1px,color:#ffffff
    style Workspace fill:#1e1e2e,stroke:#7c3aed,stroke-width:1px,color:#ffffff

← Scroll to explore diagram →

Architecture Overview

Compliance-Ready MLOps Stack

Aligned with regulated operating models to ensure data integrity and auditability. The Principle of Least Privilege is hardcoded into the versioned Terraform modules.

ML Platform
Core Data & ML Platform
  • Amazon SageMaker Studio and Notebooks provisioned entirely via code. No console-based setup permitted.
  • Amazon EMR for heavy analytics, scaled dynamically based on job queues with Spot instance routing for fault-tolerant jobs.
  • Terraform module library version-controlled, acting as the single source of truth for every resource in the platform.
  • AWS Service Catalog the "Vending Machine" allowing Data Science teams to self-serve without raising tickets.
Governance Layer
Security & Governance
  • Private Networking: VPC endpoints ensuring no data or model artefacts traverse the public internet at any point.
  • S3 Access Points: Per-team delegation eliminates monolithic bucket-policy sprawl and over-permissive data access.
  • Encryption-at-rest: Enforced KMS keys for all EBS volumes, EMR clusters, S3 data, and SageMaker model artefacts.
  • Hardened Sessions: Tightened IAM session boundaries for experimental zones. No persistent long-lived credentials.
Operating Model

Platform Ownership & Deliverables

Transitioning the organisation from reactive management to a proactive platform model with clear ownership boundaries.

Operating Model
The Operational Model
  • Ownership: Platform team owns the "Module Vending Machine"; Data Science teams own the machine learning models — a clean separation of concerns.
  • Change Flow: All modules are versioned, peer-reviewed, and promoted via environments (dev → staging → prod) before reaching production.
  • Cost Cadence: Automated monthly reporting on ML unit economics and waste reduction delivered to engineering leadership as a FinOps dashboard.
Deliverables
What We Handed Over
  • Terraform module library: SageMaker, EMR Studio, VPC, and IAM modules with versioned release tags.
  • Secure data perimeter: S3 access points, VPC endpoints, and hardened bucket policies per team scope.
  • FinOps engine: Spot instance scheduling, lifecycle cleanup jobs, and tagging compliance enforcement.
  • Operating model guardrails: Runbooks, IAM session boundaries, and a Data Science onboarding guide.
FinOps & Automation

Reclaiming 40% of ML Compute Spend

Machine learning experimentation is notoriously expensive. We engineered automated cost controls into the baseline platform to prevent budget overrun before it happens.

What Changed
Automated Cost Controls
  • Spot Orchestration: Automated routing of non-critical training jobs to deeply discounted EC2 Spot instances. Significant savings with zero manual intervention.
  • Lifecycle Hooks: Python-based cleanup logic targeting idle SageMaker notebooks and orphaned EBS volumes that were previously left running indefinitely.
  • Tagging Compliance: Infrastructure cannot be vended without strict cost-allocation tags. Every resource is attributable to a team and project from day one.
Outcome
40% Compute Reduction
  • Result: A 40% reduction in OPEX alongside faster delivery times. Self-serve environments didn't increase costs, they reduced them through enforced guardrails.
  • Principle: "Vended guardrails" over manual policing. Cost control is architectural, not procedural. It cannot be bypassed.
  • Value: Budget reclaimed from waste was redirected toward core ML engineering, model tuning, and data infrastructure investment.
Reliability Assessment

Is Your Cloud Architecture Battle-Ready?

AI workloads demand resilient infrastructure. The CRRI™ assessment evaluates your cloud maturity across five operational domains and delivers a prioritised executive report.