Production-ready in 14 days. Fully cut over in 22.
Executive Summary
In just over three weeks, we migrated our production infrastructure from AWS to GCP. More importantly, the platform was functionally production-ready in 14 days, with the remaining time focused on validation, replication, and a no-downtime cutover.
This was not a lift-and-shift migration. We rebuilt our infrastructure foundation using modular, strongly typed infrastructure as code.
Critically, the migration did not pause or relax our regulatory obligations. Throughout the process, we continued to meet our security, audit, and operational controls, including those required for SOC 2 Type II compliance.
This post focuses on why and how the migration happened.
Part 2 dives deep into the hardest technical problems we solved.
Background
This effort started as a cost and scalability review.
Our AWS infrastructure had grown organically over time. While functional, it reflected years of incremental decisions that made cost optimization, consistency, and auditability increasingly difficult. Any significant platform change had to improve—not jeopardize—our compliance posture.
After evaluating alternatives, GCP stood out for managed databases, Kubernetes integration, and networking capabilities, while still providing strong primitives for security, logging, access control, and audit trails.
The original plan assumed ~90 days.
That constraint changed.
Instead of asking whether we could simply move faster, we reframed the question:
What would need to be true for this to be possible without cutting corners—or compromising compliance?
Compliance Was a Hard Requirement, Not a Phase
One constraint remained non-negotiable throughout the project:
This migration could not weaken our security controls or compliance posture—even temporarily.
That meant:
- No production access outside existing approval paths
- No long-lived credentials introduced “just for the migration”
- No loss of audit logs, monitoring, or alerting
- No undocumented infrastructure or manual drift
Every decision—architecture, tooling, and sequencing—was filtered through that lens.
This constraint significantly shaped how we approached speed.
Reframing the Problem
We chose not to migrate infrastructure as it existed.
Instead, we rebuilt the platform intentionally with:
- Consistent patterns aligned to compliance controls
- Strong abstractions that enforce least privilege
- Safe defaults for logging, encryption, and deletion protection
- Reusable components that encode policy, not bypass it
Speed would come from structure, not shortcuts.
Within 7 days, we built over 15 reusable Pulumi components—work that normally takes months—while embedding compliance expectations directly into the system.
Leveraging AI to Compress Time Without Cutting Corners
One of the most important accelerators in this migration was how we used AI-assisted development—specifically Claude and purpose-built subagents.
This wasn’t about asking an LLM to “write infrastructure.” It was about amplifying experienced engineers by offloading mechanical work while keeping architectural decisions firmly human-driven.
How We Used Claude (and Subagents)
We treated Claude as a force multiplier in three specific areas:
1. Component Scaffolding at Scale
Once we designed a canonical Pulumi component template (inputs, outputs, defaults, error handling), Claude was used to:
- Generate initial component scaffolding in Go
- Apply consistent structure across 15+ components
- Enforce naming conventions and documentation patterns
This eliminated hours of repetitive setup work while preserving consistency.
2. Subagents for Parallel Workstreams
We used multiple focused subagents in parallel, each with a narrow responsibility:
- One agent focused on GCP networking and IAM primitives
- Another generated Pulumi components from existing tickets
- Another validated assumptions against provider documentation
- Another helped draft runbooks and troubleshooting notes
Each agent operated within strict boundaries and fed results back for human review.
This allowed us to parallelize work that would normally be serialized across days or weeks.
3. Accelerated Iteration, Not Autonomous Decisions
All critical decisions—architecture, security boundaries, compliance constraints—remained human-owned.
Claude’s role was to:
- Reduce context-switching
- Speed up iteration
- Catch obvious mistakes early
- Turn written intent into working code faster
Nothing shipped without review. Nothing bypassed controls.
Why This Mattered
The biggest time sink in large migrations isn’t decision-making—it’s translation:
- Translating intent into boilerplate
- Translating tickets into code
- Translating patterns across services
AI collapsed that translation layer.
By combining:
- Strong upfront design
- Opinionated templates
- AI-assisted generation
- Strict review and guardrails
We compressed months of work into days without increasing risk.
A Key Takeaway
AI didn’t replace engineering judgment—it removed drag.
The faster we could move from “this is the pattern” to “this is implemented everywhere,” the more time we had to focus on the hard problems: networking, identity, compliance, and cutover safety.
This approach was a major reason we were able to be production-ready in 14 days while maintaining regulatory and operational rigor.
Timeline: Ready in 14 Days, Cut Over in 22
-
Days 1–7
- Core Pulumi component library implemented
- Guardrails for IAM, networking, logging, and deletion protection baked in
-
Days 8–11
- GCP projects provisioned with baseline security controls
- Core networking and HA VPN configured
- GKE clusters deployed with workload identity
- Base services installed using GitOps (ArgoCD)
-
Days 12–14
- CI/CD pipelines updated using federated identity
- Applications deployed into GCP
- End-to-end validation completed
- Platform deemed production-ready
At this point, the system met both functional and compliance requirements.
The remaining time focused on reducing cutover risk:
-
Days 15–21
- Live database replication from AWS RDS to GCP AlloyDB
- Extended testing, monitoring, and audit verification
-
Day 22
- Sandbox and production environments cut over
- No customer-visible downtime
- No regression in compliance posture
Architecture Philosophy: Components Over Monoliths
Infrastructure was modeled as reusable building blocks rather than a single monolith.
devops-tools/google/
├── pulumi-gcp-project-component
├── pulumi-gcp-network-component
├── pulumi-gcp-gke-component
├── pulumi-gcp-cloudvpn-component
├── pulumi-gcp-cloudrouter-component
├── pulumi-gcp-alloydb-component
├── pulumi-gcp-memorystore-component
├── pulumi-gcp-oidc-component
└── gcp-core
Each component enforced:
- Encryption by default
- Least-privilege IAM
- Centralized logging and metrics
- Safe deletion and lifecycle controls
Conclusion (Part 1)
We didn’t move fast by ignoring constraints.
We moved fast because the constraints were encoded into the system.
By rebuilding our cloud platform with modular components, strong typing, and compliance-aware defaults, we reached a production-ready state in 14 days and completed a clean cutover in 22—without weakening our security or audit posture.
In Part 2, we’ll dive into the hard technical problems that made this possible, including:
- HA VPN and BGP failure modes
- Pulumi orchestration patterns
- Cross-cloud identity with OIDC
- GitOps and CI/CD design decisions
👉 Continue to Part 2: The Hard Technical Problems We Solved
Compliance Mapping Appendix (Controls → Architecture Choices)
This appendix summarizes how common regulatory and SOC 2 Type II–aligned controls were maintained throughout the migration.
It is intentionally high-level and non-exhaustive, focusing on architectural decisions rather than policy language.
Identity & Access Management
Control intent: Least privilege, strong authentication, auditable access
Architecture choices:
- OIDC-based federation for:
- GitHub Actions → GCP
- GKE workloads → AWS
- No long-lived access keys or static secrets
- Environment-scoped service accounts
- IAM roles defined and versioned in infrastructure as code
Outcome:
Access is short-lived, scoped, traceable, and reviewed through code changes.
Change Management
Control intent: Authorized, reviewed, and traceable changes
Architecture choices:
- All infrastructure defined via Pulumi (no console changes)
- Git-based workflows with pull request review
- Deterministic deployments via
gcp-coreorchestration - Environment-specific stacks with explicit configuration
Outcome:
Every infrastructure change is reviewed, logged, and reproducible.
Logging & Monitoring
Control intent: Detect, investigate, and respond to security events
Architecture choices:
- Centralized cloud logging enabled by default
- GKE audit logs retained
- Network activity observable through VPC flow logs
- No ephemeral or undocumented infrastructure
Outcome:
Operational and security visibility was preserved throughout the migration.
Data Protection & Encryption
Control intent: Protect sensitive data at rest and in transit
Architecture choices:
- Managed services with encryption enabled by default
- TLS enforced for service-to-service communication
- HA VPN with IPSec for cross-cloud traffic
- No plaintext credentials in pipelines or manifests
Outcome:
Data remained encrypted end-to-end during migration and cutover.
Availability & Resilience
Control intent: Maintain service availability and fault tolerance
Architecture choices:
- Multi-tunnel HA VPN with automatic failover
- Managed regional database services
- Gradual cutover with live replication
- Deletion protection enabled for production resources
Outcome:
The migration introduced no customer-visible downtime or data loss.
Asset Inventory & Configuration Management
Control intent: Know what exists and how it is configured
Architecture choices:
- Single source of truth via infrastructure as code
- Explicit component boundaries
- No unmanaged or “temporary” resources
Outcome:
The platform remained fully inventoried and auditable at all times.
Key Takeaway
Compliance was not validated after the migration—it was enforced continuously by design.
By embedding controls directly into infrastructure primitives, we avoided the common trade-off between speed and assurance. The system itself made it difficult to do the wrong thing.