This post is the technical deep dive.
If you haven’t read Part 1, start there for the context and architecture story:
👉 Rebuilding Our Cloud Platform: An AWS to GCP Migration in 22 Days (Part 1)
This article focuses on the engineering challenges that determined whether the migration would succeed or fail—and the lessons that only show up once things break.
Why This Part Exists
Most cloud migration posts stop at architecture diagrams and service lists.
This one exists for a different reason:
- To document failure modes
- To explain why things broke
- To show how we encoded the fixes so they never happen again
If you’re planning a cross-cloud migration, this is the part that saves you days—or weeks.
Infrastructure as Reusable Components (Pulumi + Go)
Every GCP service was implemented as a Pulumi component written in Go. This wasn’t about preference—it was about leverage.
Design Principles
Each component:
- Exposes a single, strongly typed input struct
- Encodes safe defaults
- Hides provider complexity
- Fails fast at compile time
Example (simplified):
type GKEClusterArgs struct {
Name pulumi.StringInput
Region pulumi.StringInput
NodePools []NodePoolArgs
Autopilot pulumi.BoolInput
DeletionProtection pulumi.BoolInput
}
This approach prevented entire classes of runtime failures and made intent explicit in code review.
Centralized Orchestration with gcp-core
All components are orchestrated from a single project: gcp-core.
The challenge with deploying 15+ interconnected components is managing dependencies. Deploy a GKE cluster before its VPC exists? Failure. Deploy AlloyDB before Private Service Access is configured? Failure.
We needed a system that:
- Declared dependencies explicitly
- Deployed in the correct order automatically
- Conditionally enabled components per environment
- Failed fast with clear error messages
The Component Registry Pattern
The registry acts as a deployment orchestrator:
// registry.go
type DependencyType int
const (
DependsOnNone DependencyType = iota
DependsOnProject
DependsOnNetwork
DependsOnPrivateServiceAccess
DependsOnGKE
)
type ComponentDefinition struct {
Name string
Dependencies []DependencyType
DeployFunc func(*pulumi.Context, *Config) error
Enabled bool
}
type ComponentRegistry struct {
components map[string]*ComponentDefinition
deployed map[string]bool
}
func (r *ComponentRegistry) Register(name string, deps []DependencyType, deployFunc func(*pulumi.Context, *Config) error) {
r.components[name] = &ComponentDefinition{
Name: name,
Dependencies: deps,
DeployFunc: deployFunc,
Enabled: false, // Must be explicitly enabled
}
}
Topological Sorting for Deployment Order
The registry performs a topological sort to determine deployment order:
func (r *ComponentRegistry) DeployAll(ctx *pulumi.Context, cfg *Config) error {
// Build dependency graph
graph := make(map[string][]string)
for name, comp := range r.components {
if !comp.Enabled {
continue
}
graph[name] = r.getDependencyNames(comp.Dependencies)
}
// Topological sort
sorted, err := topologicalSort(graph)
if err != nil {
return fmt.Errorf("circular dependency detected: %w", err)
}
// Deploy in order
for _, name := range sorted {
comp := r.components[name]
if err := comp.DeployFunc(ctx, cfg); err != nil {
return fmt.Errorf("failed to deploy %s: %w", name, err)
}
r.deployed[name] = true
}
return nil
}
Configuration-Driven Enablement
Each environment’s stack config determines which components deploy:
# Pulumi.production.yaml
config:
gcp-core:components:
- name: "network"
enabled: true
- name: "gke"
enabled: true
config:
nodeCount: 3
machineType: "e2-standard-4"
- name: "alloydb"
enabled: true
config:
cpuCount: 4
availabilityType: "REGIONAL"
- name: "vpn"
enabled: true
# Pulumi.sandbox.yaml - minimal setup
config:
gcp-core:components:
- name: "network"
enabled: true
- name: "gke"
enabled: true
config:
nodeCount: 1
machineType: "e2-small"
- name: "alloydb"
enabled: false # Use CloudSQL instead
Registration in main.go
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
cfg := LoadConfig(ctx)
registry := NewComponentRegistry()
// Register all components with dependencies
registry.Register("project", []DependencyType{DependsOnNone}, deployProject)
registry.Register("network", []DependencyType{DependsOnProject}, deployNetwork)
registry.Register("router", []DependencyType{DependsOnNetwork}, deployRouter)
registry.Register("vpn", []DependencyType{DependsOnNetwork, DependsOnRouter}, deployVPN)
registry.Register("private-service-access", []DependencyType{DependsOnNetwork}, deployPrivateServiceAccess)
registry.Register("gke", []DependencyType{DependsOnNetwork}, deployGKE)
registry.Register("alloydb", []DependencyType{DependsOnPrivateServiceAccess}, deployAlloyDB)
registry.Register("memorystore", []DependencyType{DependsOnPrivateServiceAccess}, deployMemorystore)
// Enable components based on config
for _, compCfg := range cfg.Components {
if comp, exists := registry.components[compCfg.Name]; exists {
comp.Enabled = compCfg.Enabled
}
}
// Deploy everything in correct order
return registry.DeployAll(ctx, cfg)
})
}
Benefits
This pattern gave us:
- Predictable deployments - Same order every time
- Environment flexibility - Production gets AlloyDB, sandbox gets nothing
- Fast failure - Circular dependencies caught at startup
- Clear errors - “Component X requires Y” vs cryptic provider errors
HA VPN: The Hardest Problem
Target Architecture
We required full bi-directional connectivity between AWS and GCP during migration.
Requirements:
- No single point of failure
- Support live database replication
- Zero-downtime cutover
High-Level Topology
flowchart LR
AWS[AWS VPC<br/>172.16.0.0/16]
GCP[GCP VPC<br/>10.0.0.0/20]
AWS -- IPSec + BGP --> GCP
AWS -- IPSec + BGP --> GCP
AWS -- IPSec + BGP --> GCP
AWS -- IPSec + BGP --> GCP
HA VPN with Four Tunnels
AWS VPC (172.16.0.0/16)
├─ VPN Connection 0
│ ├─ Tunnel 0 ──▶ GCP Gateway Interface 0
│ └─ Tunnel 1 ──▶ GCP Gateway Interface 0
└─ VPN Connection 1
├─ Tunnel 0 ──▶ GCP Gateway Interface 1
└─ Tunnel 1 ──▶ GCP Gateway Interface 1
This configuration provides:
- GCP 99.99% HA VPN SLA
- Active/active traffic
- Automatic failover
The BGP Failure That Cost Us Hours
The tunnels came up. IPSec was healthy. BGP stayed DOWN.
On the GCP side, the Cloud Router showed the session as DOWNSHIP. On the AWS side, the Site-to-Site VPN status showed IPSEC IS UP but BGP IS DOWN.
The Root Cause: IP Math is Hard
AWS provides two BGP IP addresses for each tunnel: one for the Virtual Private Gateway (VGW) and one for your Customer Gateway (CGW). These belong to a tiny /30 subnet (e.g., 169.254.201.32/30).
169.254.201.32/30
├─ .32 Network address (RESERVED)
├─ .33 AWS VGW BGP IP (The Peer IP)
├─ .34 AWS CGW BGP IP (The Local IP for GCP)
└─ .35 Broadcast (RESERVED)
The mistake:
We set the GCP “Local BGP IP” to .32. We assumed GCP would just take the first available address in the range. It didn’t. It used exactly what we told it, which was the network address. BGP doesn’t log “hey, you used a reserved IP.” It just doesn’t establish.
The Debugging Process
- Verify IPSec: Traffic was reaching the gateway, so the phase 1/2 negotiation was fine.
- Ping the Peer: We couldn’t ping
.33from a GKE pod. This confirmed a routing/layer 3 issue. - Inspect the Config: We compared the Pulumi code against the AWS-generated configuration file.
// The fix in our VPN component
RouterInterfaceArgs: &compute.RouterInterfaceArgs{
IpRange: pulumi.String("169.254.201.34/30"), // Fixed from .32
},
The lesson:
When debugging BGP, always verify the /30 boundaries. BGP is a protocol built on trust and exact matches; if your peering IPs are off by even one bit, it will fail silently.
Private Service Access: The Second Trap
Services like AlloyDB and Memorystore live in Google-managed networks—not your VPC CIDRs.
By default:
- These ranges are not advertised over BGP
- AWS cannot route to them
The Fix: Custom Route Advertisements
advertisedIpRanges:
- range: 10.229.0.0/24 # AlloyDB
- range: 10.229.1.0/29 # Memorystore
- range: 10.188.0.0/14 # GKE Pod CIDR
This single configuration change unblocked:
- Cross-cloud database access
- Cache access
- Pod-to-AWS service communication
Verifying VPN & BGP Health
These commands became part of our runbook:
# Tunnel status
gcloud compute vpn-tunnels describe <tunnel> \
--region us-east4 --format="value(status)"
# BGP peer status
gcloud compute routers get-status <router> \
--region us-east4 \
--format="table(result.bgpPeerStatus[].name,result.bgpPeerStatus[].status)"
If BGP isn’t UP, nothing else matters.
Cross-Cloud Identity with OIDC
Static AWS Access Keys or GCP JSON keys are an anti-pattern. They are hard to rotate, easy to leak, and a nightmare to audit.
We implemented Workload Identity Federation in both directions.
GKE → AWS (Accessing KMS/S3)
We wanted GKE pods to be able to decrypt SOPS-encrypted secrets stored in AWS KMS without having an AWS user.
- GKE Side: Every GKE cluster is an OIDC issuer.
- AWS Side: We registered the GKE OIDC issuer in IAM.
- Trust Policy: We created an AWS IAM role that trusts the GKE issuer.
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/container.googleapis.com/v1/projects/my-project/locations/us-east1/clusters/my-cluster"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"container.googleapis.com/v1/projects/my-project/locations/us-east1/clusters/my-cluster:sub": "system:serviceaccount:my-namespace:my-service-account"
}
}
}]
}
GitHub Actions → GCP (CI/CD)
For deployments, GitHub Actions generates a short-lived OIDC token. Our Pulumi code uses this token to assume a GCP Service Account.
// Example Pulumi code for GCP OIDC Provider
provider, _ := iam.NewWorkloadIdentityPoolProvider(ctx, "github-provider", &iam.WorkloadIdentityPoolProviderArgs{
WorkloadIdentityPoolId: pool.ID(),
Id: pulumi.String("github-actions"),
Oidc: &iam.WorkloadIdentityPoolProviderOidcArgs{
IssuerUri: pulumi.String("https://token.actions.githubusercontent.com"),
},
AttributeMapping: pulumi.StringMap{
"google.subject": pulumi.String("assertion.sub"),
"attribute.repository": pulumi.String("assertion.repository"),
},
})
Why This Matters
- Zero Secrets: No keys are stored in GitHub Secrets or Kubernetes Secrets.
- Top-Tier Auditing: CloudTrail shows exactly which Kubernetes Service Account assumed which role.
- Automatic Rotation: Tokens are short-lived and expire automatically.
GitOps & CI/CD: The Operational Glue
Scaling to 15+ components requires more than just good code—it requires a predictable release cycle.
ArgoCD for Infrastructure Apps
We used ArgoCD to manage everything inside the GKE clusters (Ingress, ExternalDNS, Cert-Manager).
- Application-of-Applications Pattern: One root app manages child apps.
- Self-Healing: If a developer manually changes a service, ArgoCD reverts it to the Git state within seconds.
GitHub Actions for Pulumi
Every PR triggers a pulumi preview. Once merged, a pulumi up runs.
# Simplified GHA Workflow
jobs:
pulumi:
runs-on: ubuntu-latest
permissions:
contents: read
id-token: write # Required for OIDC
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: "projects/1234567/locations/global/workloadIdentityPools/my-pool/providers/my-provider"
service_account: "pulumi-deployer@my-project.iam.gserviceaccount.com"
- run: pulumi up --stack production --yes
Critical Flow
- Developer pushes to
main. - GHA assumes GCP Service Account via OIDC.
- Pulumi updates GCP core infrastructure.
- GKE cluster state changes.
- ArgoCD detects the change (e.g., a new namespace or configmap) and syncs the internal apps.
Cost Impact (Order-of-Magnitude)
Savings came from:
- Managed database pricing
- Kubernetes control plane differences
- Reduced auxiliary service costs
These numbers are intentionally conservative and represent small-to-medium environments, not enterprise scale.
Conference Takeaways (TL;DR Slides)
If this were a conference talk, these would be the slides:
- Speed comes from structure, not urgency
- Encode failure modes into components
- Networking will be your hardest problem
- BGP does not forgive mistakes
- OIDC should be the default, not the exception
Final Thoughts
Part 1 told the story.
Part 2 showed the scars.
This migration worked because we treated infrastructure as a system, not a collection of services—and we encoded what we learned so the same mistakes can’t happen twice.
👉 Back to Part 1 for the full context and architecture story.
End of series.