~/home ~/blog ~/projects ~/about ~/resume

Rebuilding Our Cloud Platform: The Hard Technical Problems (Part 2)

This post is the technical deep dive.
If you haven’t read Part 1, start there for the context and architecture story:
👉 Rebuilding Our Cloud Platform: An AWS to GCP Migration in 22 Days (Part 1)

This article focuses on the engineering challenges that determined whether the migration would succeed or fail—and the lessons that only show up once things break.


Why This Part Exists

Most cloud migration posts stop at architecture diagrams and service lists.

This one exists for a different reason:

  • To document failure modes
  • To explain why things broke
  • To show how we encoded the fixes so they never happen again

If you’re planning a cross-cloud migration, this is the part that saves you days—or weeks.


Infrastructure as Reusable Components (Pulumi + Go)

Every GCP service was implemented as a Pulumi component written in Go. This wasn’t about preference—it was about leverage.

Design Principles

Each component:

  • Exposes a single, strongly typed input struct
  • Encodes safe defaults
  • Hides provider complexity
  • Fails fast at compile time

Example (simplified):

type GKEClusterArgs struct {
    Name        pulumi.StringInput
    Region      pulumi.StringInput
    NodePools   []NodePoolArgs
    Autopilot   pulumi.BoolInput
    DeletionProtection pulumi.BoolInput
}

This approach prevented entire classes of runtime failures and made intent explicit in code review.


Centralized Orchestration with gcp-core

All components are orchestrated from a single project: gcp-core.

The challenge with deploying 15+ interconnected components is managing dependencies. Deploy a GKE cluster before its VPC exists? Failure. Deploy AlloyDB before Private Service Access is configured? Failure.

We needed a system that:

  1. Declared dependencies explicitly
  2. Deployed in the correct order automatically
  3. Conditionally enabled components per environment
  4. Failed fast with clear error messages

The Component Registry Pattern

The registry acts as a deployment orchestrator:

// registry.go
type DependencyType int

const (
    DependsOnNone DependencyType = iota
    DependsOnProject
    DependsOnNetwork
    DependsOnPrivateServiceAccess
    DependsOnGKE
)

type ComponentDefinition struct {
    Name         string
    Dependencies []DependencyType
    DeployFunc   func(*pulumi.Context, *Config) error
    Enabled      bool
}

type ComponentRegistry struct {
    components map[string]*ComponentDefinition
    deployed   map[string]bool
}

func (r *ComponentRegistry) Register(name string, deps []DependencyType, deployFunc func(*pulumi.Context, *Config) error) {
    r.components[name] = &ComponentDefinition{
        Name:         name,
        Dependencies: deps,
        DeployFunc:   deployFunc,
        Enabled:      false, // Must be explicitly enabled
    }
}

Topological Sorting for Deployment Order

The registry performs a topological sort to determine deployment order:

func (r *ComponentRegistry) DeployAll(ctx *pulumi.Context, cfg *Config) error {
    // Build dependency graph
    graph := make(map[string][]string)
    for name, comp := range r.components {
        if !comp.Enabled {
            continue
        }
        graph[name] = r.getDependencyNames(comp.Dependencies)
    }

    // Topological sort
    sorted, err := topologicalSort(graph)
    if err != nil {
        return fmt.Errorf("circular dependency detected: %w", err)
    }

    // Deploy in order
    for _, name := range sorted {
        comp := r.components[name]
        if err := comp.DeployFunc(ctx, cfg); err != nil {
            return fmt.Errorf("failed to deploy %s: %w", name, err)
        }
        r.deployed[name] = true
    }

    return nil
}

Configuration-Driven Enablement

Each environment’s stack config determines which components deploy:

# Pulumi.production.yaml
config:
  gcp-core:components:
    - name: "network"
      enabled: true
    - name: "gke"
      enabled: true
      config:
        nodeCount: 3
        machineType: "e2-standard-4"
    - name: "alloydb"
      enabled: true
      config:
        cpuCount: 4
        availabilityType: "REGIONAL"
    - name: "vpn"
      enabled: true

# Pulumi.sandbox.yaml - minimal setup
config:
  gcp-core:components:
    - name: "network"
      enabled: true
    - name: "gke"
      enabled: true
      config:
        nodeCount: 1
        machineType: "e2-small"
    - name: "alloydb"
      enabled: false  # Use CloudSQL instead

Registration in main.go

func main() {
    pulumi.Run(func(ctx *pulumi.Context) error {
        cfg := LoadConfig(ctx)
        registry := NewComponentRegistry()

        // Register all components with dependencies
        registry.Register("project", []DependencyType{DependsOnNone}, deployProject)
        registry.Register("network", []DependencyType{DependsOnProject}, deployNetwork)
        registry.Register("router", []DependencyType{DependsOnNetwork}, deployRouter)
        registry.Register("vpn", []DependencyType{DependsOnNetwork, DependsOnRouter}, deployVPN)
        registry.Register("private-service-access", []DependencyType{DependsOnNetwork}, deployPrivateServiceAccess)
        registry.Register("gke", []DependencyType{DependsOnNetwork}, deployGKE)
        registry.Register("alloydb", []DependencyType{DependsOnPrivateServiceAccess}, deployAlloyDB)
        registry.Register("memorystore", []DependencyType{DependsOnPrivateServiceAccess}, deployMemorystore)

        // Enable components based on config
        for _, compCfg := range cfg.Components {
            if comp, exists := registry.components[compCfg.Name]; exists {
                comp.Enabled = compCfg.Enabled
            }
        }

        // Deploy everything in correct order
        return registry.DeployAll(ctx, cfg)
    })
}

Benefits

This pattern gave us:

  • Predictable deployments - Same order every time
  • Environment flexibility - Production gets AlloyDB, sandbox gets nothing
  • Fast failure - Circular dependencies caught at startup
  • Clear errors - “Component X requires Y” vs cryptic provider errors

HA VPN: The Hardest Problem

Target Architecture

We required full bi-directional connectivity between AWS and GCP during migration.

Requirements:

  • No single point of failure
  • Support live database replication
  • Zero-downtime cutover

High-Level Topology

flowchart LR
    AWS[AWS VPC<br/>172.16.0.0/16]
    GCP[GCP VPC<br/>10.0.0.0/20]

    AWS -- IPSec + BGP --> GCP
    AWS -- IPSec + BGP --> GCP
    AWS -- IPSec + BGP --> GCP
    AWS -- IPSec + BGP --> GCP

HA VPN with Four Tunnels

AWS VPC (172.16.0.0/16)
  ├─ VPN Connection 0
  │   ├─ Tunnel 0 ──▶ GCP Gateway Interface 0
  │   └─ Tunnel 1 ──▶ GCP Gateway Interface 0
  └─ VPN Connection 1
      ├─ Tunnel 0 ──▶ GCP Gateway Interface 1
      └─ Tunnel 1 ──▶ GCP Gateway Interface 1

This configuration provides:

  • GCP 99.99% HA VPN SLA
  • Active/active traffic
  • Automatic failover

The BGP Failure That Cost Us Hours

The tunnels came up. IPSec was healthy. BGP stayed DOWN.

On the GCP side, the Cloud Router showed the session as DOWNSHIP. On the AWS side, the Site-to-Site VPN status showed IPSEC IS UP but BGP IS DOWN.

The Root Cause: IP Math is Hard

AWS provides two BGP IP addresses for each tunnel: one for the Virtual Private Gateway (VGW) and one for your Customer Gateway (CGW). These belong to a tiny /30 subnet (e.g., 169.254.201.32/30).

169.254.201.32/30
├─ .32  Network address (RESERVED)
├─ .33  AWS VGW BGP IP (The Peer IP)
├─ .34  AWS CGW BGP IP (The Local IP for GCP)
└─ .35  Broadcast (RESERVED)

The mistake:
We set the GCP “Local BGP IP” to .32. We assumed GCP would just take the first available address in the range. It didn’t. It used exactly what we told it, which was the network address. BGP doesn’t log “hey, you used a reserved IP.” It just doesn’t establish.

The Debugging Process

  1. Verify IPSec: Traffic was reaching the gateway, so the phase 1/2 negotiation was fine.
  2. Ping the Peer: We couldn’t ping .33 from a GKE pod. This confirmed a routing/layer 3 issue.
  3. Inspect the Config: We compared the Pulumi code against the AWS-generated configuration file.
// The fix in our VPN component
RouterInterfaceArgs: &compute.RouterInterfaceArgs{
    IpRange: pulumi.String("169.254.201.34/30"), // Fixed from .32
},

The lesson:

When debugging BGP, always verify the /30 boundaries. BGP is a protocol built on trust and exact matches; if your peering IPs are off by even one bit, it will fail silently.


Private Service Access: The Second Trap

Services like AlloyDB and Memorystore live in Google-managed networks—not your VPC CIDRs.

By default:

  • These ranges are not advertised over BGP
  • AWS cannot route to them

The Fix: Custom Route Advertisements

advertisedIpRanges:
  - range: 10.229.0.0/24   # AlloyDB
  - range: 10.229.1.0/29   # Memorystore
  - range: 10.188.0.0/14   # GKE Pod CIDR

This single configuration change unblocked:

  • Cross-cloud database access
  • Cache access
  • Pod-to-AWS service communication

Verifying VPN & BGP Health

These commands became part of our runbook:

# Tunnel status
gcloud compute vpn-tunnels describe <tunnel> \
  --region us-east4 --format="value(status)"

# BGP peer status
gcloud compute routers get-status <router> \
  --region us-east4 \
  --format="table(result.bgpPeerStatus[].name,result.bgpPeerStatus[].status)"

If BGP isn’t UP, nothing else matters.


Cross-Cloud Identity with OIDC

Static AWS Access Keys or GCP JSON keys are an anti-pattern. They are hard to rotate, easy to leak, and a nightmare to audit.

We implemented Workload Identity Federation in both directions.

GKE → AWS (Accessing KMS/S3)

We wanted GKE pods to be able to decrypt SOPS-encrypted secrets stored in AWS KMS without having an AWS user.

  1. GKE Side: Every GKE cluster is an OIDC issuer.
  2. AWS Side: We registered the GKE OIDC issuer in IAM.
  3. Trust Policy: We created an AWS IAM role that trusts the GKE issuer.
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/container.googleapis.com/v1/projects/my-project/locations/us-east1/clusters/my-cluster"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "container.googleapis.com/v1/projects/my-project/locations/us-east1/clusters/my-cluster:sub": "system:serviceaccount:my-namespace:my-service-account"
      }
    }
  }]
}

GitHub Actions → GCP (CI/CD)

For deployments, GitHub Actions generates a short-lived OIDC token. Our Pulumi code uses this token to assume a GCP Service Account.

// Example Pulumi code for GCP OIDC Provider
provider, _ := iam.NewWorkloadIdentityPoolProvider(ctx, "github-provider", &iam.WorkloadIdentityPoolProviderArgs{
    WorkloadIdentityPoolId: pool.ID(),
    Id:                     pulumi.String("github-actions"),
    Oidc: &iam.WorkloadIdentityPoolProviderOidcArgs{
        IssuerUri: pulumi.String("https://token.actions.githubusercontent.com"),
    },
    AttributeMapping: pulumi.StringMap{
        "google.subject": pulumi.String("assertion.sub"),
        "attribute.repository": pulumi.String("assertion.repository"),
    },
})

Why This Matters

  • Zero Secrets: No keys are stored in GitHub Secrets or Kubernetes Secrets.
  • Top-Tier Auditing: CloudTrail shows exactly which Kubernetes Service Account assumed which role.
  • Automatic Rotation: Tokens are short-lived and expire automatically.

GitOps & CI/CD: The Operational Glue

Scaling to 15+ components requires more than just good code—it requires a predictable release cycle.

ArgoCD for Infrastructure Apps

We used ArgoCD to manage everything inside the GKE clusters (Ingress, ExternalDNS, Cert-Manager).

  • Application-of-Applications Pattern: One root app manages child apps.
  • Self-Healing: If a developer manually changes a service, ArgoCD reverts it to the Git state within seconds.

GitHub Actions for Pulumi

Every PR triggers a pulumi preview. Once merged, a pulumi up runs.

# Simplified GHA Workflow
jobs:
  pulumi:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      id-token: write # Required for OIDC
    steps:
      - uses: actions/checkout@v4
      - uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: "projects/1234567/locations/global/workloadIdentityPools/my-pool/providers/my-provider"
          service_account: "pulumi-deployer@my-project.iam.gserviceaccount.com"
      - run: pulumi up --stack production --yes

Critical Flow

  1. Developer pushes to main.
  2. GHA assumes GCP Service Account via OIDC.
  3. Pulumi updates GCP core infrastructure.
  4. GKE cluster state changes.
  5. ArgoCD detects the change (e.g., a new namespace or configmap) and syncs the internal apps.

Cost Impact (Order-of-Magnitude)

Savings came from:

  • Managed database pricing
  • Kubernetes control plane differences
  • Reduced auxiliary service costs

These numbers are intentionally conservative and represent small-to-medium environments, not enterprise scale.


Conference Takeaways (TL;DR Slides)

If this were a conference talk, these would be the slides:

  1. Speed comes from structure, not urgency
  2. Encode failure modes into components
  3. Networking will be your hardest problem
  4. BGP does not forgive mistakes
  5. OIDC should be the default, not the exception

Final Thoughts

Part 1 told the story.
Part 2 showed the scars.

This migration worked because we treated infrastructure as a system, not a collection of services—and we encoded what we learned so the same mistakes can’t happen twice.

👉 Back to Part 1 for the full context and architecture story.


End of series.

Moose is a Chief Information Security Officer specializing in cloud security, infrastructure automation, and regulatory compliance. With 15+ years in cybersecurity and 25+ years in hacking and signal intelligence, he leads cloud migration initiatives and DevSecOps for fintech platforms.