A couple weeks ago, Kiip completed a transition from EC2 to VPC. In a previous post, I talked about the benefits and basic terminology of VPC. In this post, I’ll cover the planning that went into it, some software we’re using with our VPC, and executing our planned migration with zero downtime.
Transitioning from EC2 to VPC is not like transitioning from one set of EC2 machines to another. It’s more like transitioning from one hosting provider to a completely new hosting provider. The main reason for this is because nodes outside the VPC cannot talk to nodes inside the VPC (with exceptions, of course). Therefore, you need to almost bring up an entirely new cluster alongside your existing cluster, and make a big switch to the new machines while making sure everything continues working smoothly. All the while, you need to make sure you have no data loss during this process.
Determine your VPC Layout
Before moving, it is important to decide on the layout of your private network. Amazon does a good job documenting standard VPC scenarios. At Kiip, we decided to go with a standard layout: VPC with a single public and a single private subnet. This gives us around 250 IPs per subnet (Amazon reserves a few for themselves), far more than enough for our infrastructure.
At this point, I recommend spending a day or two creating VPCs and building some scripts around launching nodes into a VPC. This will get you comfortable with the new layout as well as allows you to pre-plan all of your network ACLs, security groups, routing tables, etc. It is important to be very comfortable with VPC when you’re reading to execute a move.
Our exact setup is the following, using example CIDR blocks:
- VPC: 10.101.0.0/16
- Two subnets: Public (10.101.0.0/24), Private (10.101.1.0/24)
- Two routing tables: One that is able to route through an internet gateway, and one that only talks to the internet through a NAT device for the private subnet.
- Allow-all network ACLs and security groups to begin. We use iptables on our external nodes to whitelist traffic and we decided this was enough to begin.
Create Migration Plans for Nodes
At this point we went through each of our nodes and separated the group of “stateful” nodes with “non-stateful” nodes. Nodes without state include app servers and load balancers. Non-stateful nodes require no real migration planning since they can be created in parallel without fear of data loss.
Left with only stateful nodes, the real planning begins. The following is the list of some of our stateful nodes and how we decided to handle the migration for each. Note I won’t go over each node, since some are very similar to others. Instead, I’ll just highlight a few nodes to cover different cases.
- Memcached – We went over the cost/benefit and decided that degraded response times while our caches rewarmed was a decent tradeoff. Our caches are able to warm back to 90% of their hit rate within 5 minutes. For those 5 minutes, our response times would go from around 20ms up to 200ms, an order of magnitude. But in terms of migration ease and time, we decided this would be tolerable.
- Graphite – For Graphite, we would restore from a recent EBS snapshot in the VPC. Bringing up a new node with an EBS snapshot attached takes around 5 minutes. We would lose 5 minutes of server statistical data, but this wouldn’t cause any downtime externally. Tolerable.
- RabbitMQ – We decided to bring up a brand new MQ, point the new servers at this MQ to fill up a queue of jobs while the old MQ drained, and only when the old MQ drained we would take it down and start draining the new MQ. This would cause some level of backlog in our MQ, but again would cause no external slowdown or downtime except some analytical data for our advertiser dashboard would be delayed by a few minutes. This is perfectly okay since we make no hard real time guarantees about our analytics.
MongoDB was the hardest migration to plan. Our instance constantly sees around 2000 updates per second, and any downtime would cause externally visible errors and inconsistencies that are unacceptable. The migration had to be instant. Therefore our plan was to bring up a new DB instance in the VPC, set it as a slave to the primary outside of the VPC, bring it up to date, then promote the slave in the VPC to primary. On a high level, this plan works. The devil in this case is in the details:
- All nodes in a MongoDB replica set need to be able to communicate with each other. Unfortunately, nodes outside a VPC cannot easily talk to nodes within a VPC (since they can’t even be addressed!).
- If we don’t have at least 3 nodes in our replica set, then we’re not safe against node failures. In the midst of a complicated migration, we wanted to minimize the amount that could go wrong as much as possible, so even having a few minutes where we were vulnerable to node failure and data loss was unacceptable.
- In order to switch the primary node over to a VPC node, all the nodes that communicate with MongoDB must be in the VPC!
Based on these requirements, the MongoDB migration became tricky, but doable:
- Prerequisite: All nodes that must communicate with the DB must already be in the VPC.
- Bring up 3 nodes in the public subnet of the VPC, make them slaves, bring them up to date.
- Promote a public VPC node to primary and take down the EC2 nodes.
- Bring up 3 nodes in the private subnet of the VPC, make them slaves, bring them up to date.
- Promote a private VPC node to primary and take down the public VPC nodes.
Ouch! The room for error in the above was scary, but we wrote down any failure cases we could think of and how we would handle it. Our best defense here would be our ability to identify and react to any problems quickly and predictably.
Practice and Execute
Once the plan was in place, I spent a day practicing the transition for a staging environment. This was mainly to identify any major issues with my transition checklists but also to bolster more confidence in the major migration. If you can afford the time, I highly recommend it, since this time is surely cheaper than having extended downtime and potential data loss if your migration goes poorly.
Finally, the migration was executed. The exact order of our migration was the following:
- Non-stateful nodes and stateful nodes that didn’t require data migration came up first. Load balancers, app servers, memcached, RabbitMQ, etc.
- Load balancer IP was switched over to the VPC load balancer since VPC nodes can talk to external nodes through a NAT, so the VPC nodes in our case simply talked to the EC2 stateful nodes.
- Stateful nodes transitioned one at a time, according to their plans, with the database last.
Tips and Tricks
Now, some of our useful tips and tricks for migrating. Note that we use Chef, so some of these tips are around that.
- Have the VPC nodes in a separate Chef environment, and use attributes for each chef role to slowly point nodes into the new VPC environment. For example: while the app servers were transitioned to VPC right away, they still pointed at the non-VPC stateful nodes for some time. To make this easy to switch, we had Chef attributes such as
node[:app][:memcached_environment] = "production"which we could then switch to
production-vpcthe moment we wanted to make the switch.
- Prior to transitioning, make sure all your configuration management scripts use the internal IP for nodes, since VPC nodes do not have a public IP or FQDN initially.
- Use network ACLs and routing tables to make sure separate environments in the VPC (staging, QA, etc.) cannot talk to each other. Since you can guarantee the subnets of each environment, its easy to blacklist entire CIDR blocks.