Engineering Case Studies
Strategic technical initiatives delivered at enterprise scale.
Zero to Production-Grade: Rebuilding Mandolin's Entire Cloud Platform from Scratch
Engineering Leader — MandolinMandolin is a healthcare AI startup ($40M Series A, Greylock), operating in hyper-growth with HIPAA/GDPR compliance requirements and enterprise healthcare tenants.
Four deployment systems, no source of truth, every service on a public endpoint no segmentation, no mTLS, static credentials scattered across codebases. No DR, no defined RTO. Huge operating cost with 45 to 1 hour per build and zero developer self-service.
Constraints: Zero production downtime, solo execution, live healthcare enterprise tenants throughout, HIPAA and GDPR compliance from day one.
- 5 GKE clusters across 4 GCP projects self-hosted ArgoCD, Argo Workflows + BuildKit CI, ARC runners. Replaced all managed SaaS.
- Private-by-default networking internal LBs, per-tenant mTLS, ESO + Secret Manager, Workload Identity. Zero static credentials. MongoDB → Cloud SQL migration.
- Resolve.ai for AI-driven incident triage (67% MTTR reduction); Pulumi Neo for AI-assisted IaC two tools that multiplied solo engineering leverage.
- AI-driven OCR pipeline for automated data extraction from prior auth forms and insurance documents eliminated manual entry across the specialty drug intake workflow.
- Designing and implementing in-house model training infrastructure HIPAA-compliant GPU workloads, training and model registry to support proprietary healthcare AI.
- Self-service Build & Deploy portal one-click deployments, RC tracking, and per-branch TTL preview environments.
- Developer portal + Glean AI search onboarding and knowledge discovery time cut 50%.
- Reinvented the developer production workflow using AI-assisted IaC and AI-driven incident triage.
- OPA/Kyverno compliance guardrails, Wiz + Snyk vulnerability scanning, Kubecost + BigQuery FinOps.
- 53% infra cost reduction, 67% MTTR reduction, 53 days, solo execution (while building a team from scratch)
- 53% reduction in infrastructure costs while simultaneously growing the resource footprint.
- 67% reduction in MTTR via Resolve.ai-automated incident triage.
- CI builds: 45 minutes → 23 minutes (49% faster). Tenant onboarding: 2–3 days → 30 minutes.
- DR RTO: undefined → <10 min. Zero public endpoints. HIPAA/GDPR-compliant from day one.
- Developer onboarding and knowledge discovery time cut by 50% via Glean AI search
- Educated engineering org on AI-assisted development using Claude Code and OpenAI/Codex established HIPAA-safe usage policies, dynamic workflow guardrails, and team-wide training that accelerated code delivery and code quality with zero AI-related compliance incidents.
- AI tooling (Resolve.ai, Pulumi Neo, Glean) multiplied solo engineering leverage treating AI as a force-multiplier is now a first-class infrastructure strategy.
- GitOps-as-the-only-write-path is an architectural constraint, not a configuration design it in from day one.
- Right-sizing requires 3 months of production samples — 30 days misses peak CPU by up to 44%.
Hyperscale ML/Search Platform at Adobe
Architect & Technical LeaderAdobe's Core Search and Sensei platform serves as the intelligence layer behind flagship products, processing 30B+ daily requests.
AI/ML workloads were outgrowing the existing infrastructure, creating scaling, latency, and cost challenges.
Constraints: Sub‑5ms P95 latency, strict data governance, and legacy systems that couldn’t be disrupted.
- Designed a hybrid GPU/CPU architecture across AWS, Azure, and on‑prem HPC.
- Optimized NVIDIA clusters using RDMA InfiniBand, MIG, and Volcano scheduling.
- Unified fragmented Kubernetes environments into a multi‑tenant platform.
- Multi Billion requests, GPU utilization +38%
- Supported 30B+ daily API requests with >99.98% availability.
- Increased GPU utilization by 38% through smarter scheduling.
- Reduced cloud storage costs by 65% with tiering and lifecycle policies.
- GPU-aware scheduling is the difference between expensive hardware and efficient hardware.
- Hybrid cloud is complex, but it unlocks elasticity and cost control at hyperscale.
Enterprise Elasticsearch Consolidation at Adobe
Architect & Technical LeaderAdobe’s search infrastructure was fragmented across 18+ managed clusters with varying versions, driving high licensing costs and operational complexity.
Managed service lock-in and version fragmentation were creating a multi-million dollar licensing burden without the necessary operational control.
Constraints: 10B+ documents, 6K/sec ingestion, zero-downtime migration requirement.
- Migrated 18 managed AWS Elasticsearch clusters to a standardized, self-managed hybrid architecture.
- Transitioned the entire stack from proprietary managed versions to a unified Open Source version.
- Engineered automated data lifecycle and tiering policies for 10B+ documents.
- Millions in annual savings, 30% cost reduction
- Reduced annual Elasticsearch licensing costs by millions of dollars (30% net savings).
- Achieved full operational control over search performance and security posture.
- Maintained sub-5ms P95 search latency while sustaining 6K/sec ingestion throughput.
- Standardization is the prerequisite for cost optimization at scale.
- Moving to Open Source provides financial leverage and eliminates vendor lock-in.
Global SRE Operating Model at F5
Sr. Director of Product Engineering & Head of SREF5’s Distributed Cloud platform powers global multi‑cloud networking and security for enterprise customers.
Silos, inconsistent incident response, and burnout were slowing down a platform facing explosive traffic growth.
Constraints: 24/7 global operations, strict compliance (FedRAMP, PCI‑DSS), and security‑critical workloads.
- Rebuilt SRE into a follow‑the‑sun model with clear ownership and escalation paths.
- Implemented policy‑as‑code and zero‑trust architecture across the platform.
- Modernized observability with ML‑driven anomaly detection and unified tracing.
- 55+ engineers, MTTR −73%
- Reduced MTTR by 73% and improved incident consistency.
- Lowered attrition by 10% by eliminating hero culture.
- Absorbed 400% growth in attack traffic without degradation.
- Culture eats tools for breakfast, fix trust and clarity first.
- Compliance-as-code is the only sustainable path at enterprise velocity.
FedRAMP High & Zero-Trust Architecture at F5
Sr. Director of Product Engineering & Head of SREF5's Distributed Cloud platform required the highest levels of security to serve federal and highly regulated enterprise customers.
Achieving and sustaining high-bar compliance (FedRAMP High) while maintaining rapid feature velocity in a multi-cloud environment.
Constraints: Global footprint (25+ PoPs), complex multi-tenant networking, and zero-trust requirements.
- Architected a zero-trust platform using mutual TLS (mTLS), identity-based policy, and runtime eBPF monitoring.
- Implemented 'Compliance-as-Code' to automate security guardrails across the entire CI/CD lifecycle.
- Collaborated with the CISO to deliver FedRAMP High, PCI-DSS, and SOC 2 compliance through automated evidence collection.
- FedRAMP High, PCI-DSS, SOC 2
- Successfully achieved FedRAMP High, PCI-DSS, and SOC 2 certifications.
- Accelerated feature velocity by 40% by shifting security and compliance left.
- Delivered a unified security posture across 25+ global points of presence.
- Security is not a checkbox; it's a foundational architectural property.
- Automated compliance is the only way to scale security in a distributed system.
Platform Modernization & FinOps at Arkose Labs
Director of Engineering & SREArkose Labs fights fraud at internet scale, requiring real‑time decisioning under unpredictable attack traffic.
Cloud spend was rising faster than revenue, and technical debt was slowing delivery.
Constraints: High‑volume DDoS traffic, strict latency SLAs, and rapid enterprise growth.
- Re‑architected the platform into EKS microservices with an eBPF service mesh.
- Established FinOps governance, dashboards, and cost‑aware engineering practices.
- Introduced SLO‑based release gates to balance reliability and velocity.
- 22% cloud spend reduction
- Reduced cloud spend by 22% while supporting 7x transaction growth.
- Maintained 99.9% SLA even during attack spikes.
- Improved release quality and reduced customer‑impacting incidents.
- FinOps only works when engineering owns the cost model.
- eBPF unlocks observability without the tax of sidecars.
Enterprise CI/CD & Platform Modernization at Macys.com
Architect & Technical LeaderMacy’s needed a modern deployment platform to support rapid retail innovation and peak‑season reliability.
Deployments were slow, manual, and risky, causing downtime during revenue‑critical periods.
Constraints: High‑traffic retail workloads, legacy on‑prem systems, and multi‑cloud fragmentation.
- Built a modern CI/CD platform using Jenkins, Spinnaker, and Kubernetes.
- Implemented blue‑green and canary strategies for safe, incremental rollouts.
- Designed a hybrid cloud architecture across GCP, AWS, and VMware Tanzu.
- Near-zero downtime releases
- Achieved near‑zero downtime releases across the e‑commerce stack.
- Cut deployment time from days to under an hour.
- Enabled consistent workloads across hybrid environments.
- Standardized pipelines are the backbone of developer velocity.
- Canary releases are non‑negotiable for retail reliability.
Early Career of Engineering
Various Engineering Roles (2004-2016).Engineering roles at Workday, Chegg, RocketFuel, Adobe, and others.
Building early distributed systems during a period of rapid cloud evolution.
Constraints: Fast‑moving product requirements and emerging cloud technologies.
- Built and scaled early SaaS and distributed systems.
- Developed deep expertise in reliability, performance, and platform design.
- Foundational Distributed Systems
- Delivered core components for high‑growth SaaS products.
- Maintained and evolved cloud‑native systems across multiple companies.
- Mentored engineers and led technical initiatives.
- Distributed systems fundamentals stay constant, the ecosystem around them evolves.