~/home ~/blog ~/projects ~/about ~/resume

Rebuilding Our Cloud Platform: An AWS to GCP Migration in 22 Days (Part 1)

Production-ready in 14 days. Fully cut over in 22.


Executive Summary

In just over three weeks, we migrated our production infrastructure from AWS to GCP. More importantly, the platform was functionally production-ready in 14 days, with the remaining time focused on validation, replication, and a no-downtime cutover.

This was not a lift-and-shift migration. We rebuilt our infrastructure foundation using modular, strongly typed infrastructure as code.

Critically, the migration did not pause or relax our regulatory obligations. Throughout the process, we continued to meet our security, audit, and operational controls, including those required for SOC 2 Type II compliance.

This post focuses on why and how the migration happened.
Part 2 dives deep into the hardest technical problems we solved.


Background

This effort started as a cost and scalability review.

Our AWS infrastructure had grown organically over time. While functional, it reflected years of incremental decisions that made cost optimization, consistency, and auditability increasingly difficult. Any significant platform change had to improve—not jeopardize—our compliance posture.

After evaluating alternatives, GCP stood out for managed databases, Kubernetes integration, and networking capabilities, while still providing strong primitives for security, logging, access control, and audit trails.

The original plan assumed ~90 days.

That constraint changed.

Instead of asking whether we could simply move faster, we reframed the question:

What would need to be true for this to be possible without cutting corners—or compromising compliance?


Compliance Was a Hard Requirement, Not a Phase

One constraint remained non-negotiable throughout the project:

This migration could not weaken our security controls or compliance posture—even temporarily.

That meant:

  • No production access outside existing approval paths
  • No long-lived credentials introduced “just for the migration”
  • No loss of audit logs, monitoring, or alerting
  • No undocumented infrastructure or manual drift

Every decision—architecture, tooling, and sequencing—was filtered through that lens.

This constraint significantly shaped how we approached speed.


Reframing the Problem

We chose not to migrate infrastructure as it existed.

Instead, we rebuilt the platform intentionally with:

  • Consistent patterns aligned to compliance controls
  • Strong abstractions that enforce least privilege
  • Safe defaults for logging, encryption, and deletion protection
  • Reusable components that encode policy, not bypass it

Speed would come from structure, not shortcuts.

Within 7 days, we built over 15 reusable Pulumi components—work that normally takes months—while embedding compliance expectations directly into the system.


Leveraging AI to Compress Time Without Cutting Corners

One of the most important accelerators in this migration was how we used AI-assisted development—specifically Claude and purpose-built subagents.

This wasn’t about asking an LLM to “write infrastructure.” It was about amplifying experienced engineers by offloading mechanical work while keeping architectural decisions firmly human-driven.

How We Used Claude (and Subagents)

We treated Claude as a force multiplier in three specific areas:

1. Component Scaffolding at Scale

Once we designed a canonical Pulumi component template (inputs, outputs, defaults, error handling), Claude was used to:

  • Generate initial component scaffolding in Go
  • Apply consistent structure across 15+ components
  • Enforce naming conventions and documentation patterns

This eliminated hours of repetitive setup work while preserving consistency.

2. Subagents for Parallel Workstreams

We used multiple focused subagents in parallel, each with a narrow responsibility:

  • One agent focused on GCP networking and IAM primitives
  • Another generated Pulumi components from existing tickets
  • Another validated assumptions against provider documentation
  • Another helped draft runbooks and troubleshooting notes

Each agent operated within strict boundaries and fed results back for human review.

This allowed us to parallelize work that would normally be serialized across days or weeks.

3. Accelerated Iteration, Not Autonomous Decisions

All critical decisions—architecture, security boundaries, compliance constraints—remained human-owned.

Claude’s role was to:

  • Reduce context-switching
  • Speed up iteration
  • Catch obvious mistakes early
  • Turn written intent into working code faster

Nothing shipped without review. Nothing bypassed controls.

Why This Mattered

The biggest time sink in large migrations isn’t decision-making—it’s translation:

  • Translating intent into boilerplate
  • Translating tickets into code
  • Translating patterns across services

AI collapsed that translation layer.

By combining:

  • Strong upfront design
  • Opinionated templates
  • AI-assisted generation
  • Strict review and guardrails

We compressed months of work into days without increasing risk.

A Key Takeaway

AI didn’t replace engineering judgment—it removed drag.

The faster we could move from “this is the pattern” to “this is implemented everywhere,” the more time we had to focus on the hard problems: networking, identity, compliance, and cutover safety.

This approach was a major reason we were able to be production-ready in 14 days while maintaining regulatory and operational rigor.


Timeline: Ready in 14 Days, Cut Over in 22

  • Days 1–7

    • Core Pulumi component library implemented
    • Guardrails for IAM, networking, logging, and deletion protection baked in
  • Days 8–11

    • GCP projects provisioned with baseline security controls
    • Core networking and HA VPN configured
    • GKE clusters deployed with workload identity
    • Base services installed using GitOps (ArgoCD)
  • Days 12–14

    • CI/CD pipelines updated using federated identity
    • Applications deployed into GCP
    • End-to-end validation completed
    • Platform deemed production-ready

At this point, the system met both functional and compliance requirements.
The remaining time focused on reducing cutover risk:

  • Days 15–21

    • Live database replication from AWS RDS to GCP AlloyDB
    • Extended testing, monitoring, and audit verification
  • Day 22

    • Sandbox and production environments cut over
    • No customer-visible downtime
    • No regression in compliance posture

Architecture Philosophy: Components Over Monoliths

Infrastructure was modeled as reusable building blocks rather than a single monolith.

devops-tools/google/
├── pulumi-gcp-project-component
├── pulumi-gcp-network-component
├── pulumi-gcp-gke-component
├── pulumi-gcp-cloudvpn-component
├── pulumi-gcp-cloudrouter-component
├── pulumi-gcp-alloydb-component
├── pulumi-gcp-memorystore-component
├── pulumi-gcp-oidc-component
└── gcp-core

Each component enforced:

  • Encryption by default
  • Least-privilege IAM
  • Centralized logging and metrics
  • Safe deletion and lifecycle controls

Conclusion (Part 1)

We didn’t move fast by ignoring constraints.

We moved fast because the constraints were encoded into the system.

By rebuilding our cloud platform with modular components, strong typing, and compliance-aware defaults, we reached a production-ready state in 14 days and completed a clean cutover in 22—without weakening our security or audit posture.

In Part 2, we’ll dive into the hard technical problems that made this possible, including:

  • HA VPN and BGP failure modes
  • Pulumi orchestration patterns
  • Cross-cloud identity with OIDC
  • GitOps and CI/CD design decisions

👉 Continue to Part 2: The Hard Technical Problems We Solved


Compliance Mapping Appendix (Controls → Architecture Choices)

This appendix summarizes how common regulatory and SOC 2 Type II–aligned controls were maintained throughout the migration.
It is intentionally high-level and non-exhaustive, focusing on architectural decisions rather than policy language.

Identity & Access Management

Control intent: Least privilege, strong authentication, auditable access
Architecture choices:

  • OIDC-based federation for:
    • GitHub Actions → GCP
    • GKE workloads → AWS
  • No long-lived access keys or static secrets
  • Environment-scoped service accounts
  • IAM roles defined and versioned in infrastructure as code

Outcome:
Access is short-lived, scoped, traceable, and reviewed through code changes.


Change Management

Control intent: Authorized, reviewed, and traceable changes
Architecture choices:

  • All infrastructure defined via Pulumi (no console changes)
  • Git-based workflows with pull request review
  • Deterministic deployments via gcp-core orchestration
  • Environment-specific stacks with explicit configuration

Outcome:
Every infrastructure change is reviewed, logged, and reproducible.


Logging & Monitoring

Control intent: Detect, investigate, and respond to security events
Architecture choices:

  • Centralized cloud logging enabled by default
  • GKE audit logs retained
  • Network activity observable through VPC flow logs
  • No ephemeral or undocumented infrastructure

Outcome:
Operational and security visibility was preserved throughout the migration.


Data Protection & Encryption

Control intent: Protect sensitive data at rest and in transit
Architecture choices:

  • Managed services with encryption enabled by default
  • TLS enforced for service-to-service communication
  • HA VPN with IPSec for cross-cloud traffic
  • No plaintext credentials in pipelines or manifests

Outcome:
Data remained encrypted end-to-end during migration and cutover.


Availability & Resilience

Control intent: Maintain service availability and fault tolerance
Architecture choices:

  • Multi-tunnel HA VPN with automatic failover
  • Managed regional database services
  • Gradual cutover with live replication
  • Deletion protection enabled for production resources

Outcome:
The migration introduced no customer-visible downtime or data loss.


Asset Inventory & Configuration Management

Control intent: Know what exists and how it is configured
Architecture choices:

  • Single source of truth via infrastructure as code
  • Explicit component boundaries
  • No unmanaged or “temporary” resources

Outcome:
The platform remained fully inventoried and auditable at all times.


Key Takeaway

Compliance was not validated after the migration—it was enforced continuously by design.

By embedding controls directly into infrastructure primitives, we avoided the common trade-off between speed and assurance. The system itself made it difficult to do the wrong thing.

Moose is a Chief Information Security Officer specializing in cloud security, infrastructure automation, and regulatory compliance. With 15+ years in cybersecurity and 25+ years in hacking and signal intelligence, he leads cloud migration initiatives and DevSecOps for fintech platforms.