Case Studies

Zero to Production-Grade: Rebuilding Mandolin's Entire Cloud Platform from Scratch

Engineering Leader — Mandolin

Mandolin is a healthcare AI startup ($40M Series A, Greylock), operating in hyper-growth with HIPAA/GDPR compliance requirements and enterprise healthcare tenants.

The Challenge

Four deployment systems, no source of truth, every service on a public endpoint no segmentation, no mTLS, static credentials scattered across codebases. No DR, no defined RTO. Huge operating cost with 45 to 1 hour per build and zero developer self-service.

Constraints: Zero production downtime, solo execution, live healthcare enterprise tenants throughout, HIPAA and GDPR compliance from day one.

Strategic Solution

5 GKE clusters across 4 GCP projects self-hosted ArgoCD, Argo Workflows + BuildKit CI, ARC runners. Replaced all managed SaaS.
Private-by-default networking internal LBs, per-tenant mTLS, ESO + Secret Manager, Workload Identity. Zero static credentials. MongoDB → Cloud SQL migration.
Resolve.ai for AI-driven incident triage (67% MTTR reduction); Pulumi Neo for AI-assisted IaC two tools that multiplied solo engineering leverage.
AI-driven OCR pipeline for automated data extraction from prior auth forms and insurance documents eliminated manual entry across the specialty drug intake workflow.
Designing and implementing in-house model training infrastructure HIPAA-compliant GPU workloads, training and model registry to support proprietary healthcare AI.
Self-service Build & Deploy portal one-click deployments, RC tracking, and per-branch TTL preview environments.
Developer portal + Glean AI search onboarding and knowledge discovery time cut 50%.
Reinvented the developer production workflow using AI-assisted IaC and AI-driven incident triage.
OPA/Kyverno compliance guardrails, Wiz + Snyk vulnerability scanning, Kubecost + BigQuery FinOps.

Impact & Metrics

53% infra cost reduction, 67% MTTR reduction, 53 days, solo execution (while building a team from scratch)
53% reduction in infrastructure costs while simultaneously growing the resource footprint.
67% reduction in MTTR via Resolve.ai-automated incident triage.
CI builds: 45 minutes → 23 minutes (49% faster). Tenant onboarding: 2–3 days → 30 minutes.
DR RTO: undefined → <10 min. Zero public endpoints. HIPAA/GDPR-compliant from day one.
Developer onboarding and knowledge discovery time cut by 50% via Glean AI search
Educated engineering org on AI-assisted development using Claude Code and OpenAI/Codex established HIPAA-safe usage policies, dynamic workflow guardrails, and team-wide training that accelerated code delivery and code quality with zero AI-related compliance incidents.

Key Lessons

AI tooling (Resolve.ai, Pulumi Neo, Glean) multiplied solo engineering leverage treating AI as a force-multiplier is now a first-class infrastructure strategy.
GitOps-as-the-only-write-path is an architectural constraint, not a configuration design it in from day one.
Right-sizing requires 3 months of production samples — 30 days misses peak CPU by up to 44%.

GitOps GKE Healthcare AI Tools Platform Engineering

Hyperscale ML/Search Platform at Adobe

Architect & Technical Leader

Adobe's Core Search and Sensei platform serves as the intelligence layer behind flagship products, processing 30B+ daily requests.

The Challenge

AI/ML workloads were outgrowing the existing infrastructure, creating scaling, latency, and cost challenges.

Constraints: Sub‑5ms P95 latency, strict data governance, and legacy systems that couldn’t be disrupted.

Strategic Solution

Designed a hybrid GPU/CPU architecture across AWS, Azure, and on‑prem HPC.
Optimized NVIDIA clusters using RDMA InfiniBand, MIG, and Volcano scheduling.
Unified fragmented Kubernetes environments into a multi‑tenant platform.

Impact & Metrics

Multi Billion requests, GPU utilization +38%
Supported 30B+ daily API requests with >99.98% availability.
Increased GPU utilization by 38% through smarter scheduling.
Reduced cloud storage costs by 65% with tiering and lifecycle policies.

Key Lessons

GPU-aware scheduling is the difference between expensive hardware and efficient hardware.
Hybrid cloud is complex, but it unlocks elasticity and cost control at hyperscale.

AI/ML HPC Kubernetes

Enterprise Elasticsearch Consolidation at Adobe

Architect & Technical Leader

Adobe’s search infrastructure was fragmented across 18+ managed clusters with varying versions, driving high licensing costs and operational complexity.

The Challenge

Managed service lock-in and version fragmentation were creating a multi-million dollar licensing burden without the necessary operational control.

Constraints: 10B+ documents, 6K/sec ingestion, zero-downtime migration requirement.

Strategic Solution

Migrated 18 managed AWS Elasticsearch clusters to a standardized, self-managed hybrid architecture.
Transitioned the entire stack from proprietary managed versions to a unified Open Source version.
Engineered automated data lifecycle and tiering policies for 10B+ documents.

Impact & Metrics

Millions in annual savings, 30% cost reduction
Reduced annual Elasticsearch licensing costs by millions of dollars (30% net savings).
Achieved full operational control over search performance and security posture.
Maintained sub-5ms P95 search latency while sustaining 6K/sec ingestion throughput.

Key Lessons

Standardization is the prerequisite for cost optimization at scale.
Moving to Open Source provides financial leverage and eliminates vendor lock-in.

Elasticsearch Open Source Cost Optimization

Global SRE Operating Model at F5

Sr. Director of Product Engineering & Head of SRE

F5’s Distributed Cloud platform powers global multi‑cloud networking and security for enterprise customers.

The Challenge

Silos, inconsistent incident response, and burnout were slowing down a platform facing explosive traffic growth.

Constraints: 24/7 global operations, strict compliance (FedRAMP, PCI‑DSS), and security‑critical workloads.

Strategic Solution

Rebuilt SRE into a follow‑the‑sun model with clear ownership and escalation paths.
Implemented policy‑as‑code and zero‑trust architecture across the platform.
Modernized observability with ML‑driven anomaly detection and unified tracing.

Impact & Metrics

55+ engineers, MTTR −73%
Reduced MTTR by 73% and improved incident consistency.
Lowered attrition by 10% by eliminating hero culture.
Absorbed 400% growth in attack traffic without degradation.

Key Lessons

Culture eats tools for breakfast, fix trust and clarity first.
Compliance-as-code is the only sustainable path at enterprise velocity.

SRE Transformation Compliance

FedRAMP High & Zero-Trust Architecture at F5

Sr. Director of Product Engineering & Head of SRE

F5's Distributed Cloud platform required the highest levels of security to serve federal and highly regulated enterprise customers.

The Challenge

Achieving and sustaining high-bar compliance (FedRAMP High) while maintaining rapid feature velocity in a multi-cloud environment.

Constraints: Global footprint (25+ PoPs), complex multi-tenant networking, and zero-trust requirements.

Strategic Solution

Architected a zero-trust platform using mutual TLS (mTLS), identity-based policy, and runtime eBPF monitoring.
Implemented 'Compliance-as-Code' to automate security guardrails across the entire CI/CD lifecycle.
Collaborated with the CISO to deliver FedRAMP High, PCI-DSS, and SOC 2 compliance through automated evidence collection.

Impact & Metrics

FedRAMP High, PCI-DSS, SOC 2
Successfully achieved FedRAMP High, PCI-DSS, and SOC 2 certifications.
Accelerated feature velocity by 40% by shifting security and compliance left.
Delivered a unified security posture across 25+ global points of presence.

Key Lessons

Security is not a checkbox; it's a foundational architectural property.
Automated compliance is the only way to scale security in a distributed system.

Security FedRAMP Zero-Trust

Platform Modernization & FinOps at Arkose Labs

Director of Engineering & SRE

Arkose Labs fights fraud at internet scale, requiring real‑time decisioning under unpredictable attack traffic.

The Challenge

Cloud spend was rising faster than revenue, and technical debt was slowing delivery.

Constraints: High‑volume DDoS traffic, strict latency SLAs, and rapid enterprise growth.

Strategic Solution

Re‑architected the platform into EKS microservices with an eBPF service mesh.
Established FinOps governance, dashboards, and cost‑aware engineering practices.
Introduced SLO‑based release gates to balance reliability and velocity.

Impact & Metrics

22% cloud spend reduction
Reduced cloud spend by 22% while supporting 7x transaction growth.
Maintained 99.9% SLA even during attack spikes.
Improved release quality and reduced customer‑impacting incidents.

Key Lessons

FinOps only works when engineering owns the cost model.
eBPF unlocks observability without the tax of sidecars.

FinOps Modernization eBPF

Enterprise CI/CD & Platform Modernization at Macys.com

Architect & Technical Leader

Macy’s needed a modern deployment platform to support rapid retail innovation and peak‑season reliability.

The Challenge

Deployments were slow, manual, and risky, causing downtime during revenue‑critical periods.

Constraints: High‑traffic retail workloads, legacy on‑prem systems, and multi‑cloud fragmentation.

Strategic Solution

Built a modern CI/CD platform using Jenkins, Spinnaker, and Kubernetes.
Implemented blue‑green and canary strategies for safe, incremental rollouts.
Designed a hybrid cloud architecture across GCP, AWS, and VMware Tanzu.

Impact & Metrics

Near-zero downtime releases
Achieved near‑zero downtime releases across the e‑commerce stack.
Cut deployment time from days to under an hour.
Enabled consistent workloads across hybrid environments.

Key Lessons

Standardized pipelines are the backbone of developer velocity.
Canary releases are non‑negotiable for retail reliability.

CI/CD Kubernetes Hybrid Cloud

Early Career of Engineering

Various Engineering Roles (2004-2016).

Engineering roles at Workday, Chegg, RocketFuel, Adobe, and others.

The Challenge

Building early distributed systems during a period of rapid cloud evolution.

Constraints: Fast‑moving product requirements and emerging cloud technologies.

Strategic Solution

Built and scaled early SaaS and distributed systems.
Developed deep expertise in reliability, performance, and platform design.

Impact & Metrics

Foundational Distributed Systems
Delivered core components for high‑growth SaaS products.
Maintained and evolved cloud‑native systems across multiple companies.
Mentored engineers and led technical initiatives.

Key Lessons

Distributed systems fundamentals stay constant, the ecosystem around them evolves.

Distributed Systems SaaS

Engineering Case Studies

Zero to Production-Grade: Rebuilding Mandolin's Entire Cloud Platform from Scratch

Hyperscale ML/Search Platform at Adobe

Enterprise Elasticsearch Consolidation at Adobe

Global SRE Operating Model at F5

FedRAMP High & Zero-Trust Architecture at F5

Platform Modernization & FinOps at Arkose Labs

Enterprise CI/CD & Platform Modernization at Macys.com

Early Career of Engineering