Tuesday, April 30, 2013

AWS Summit Liveblog: Cloud Backup and DR

Usual Liveblog Disclaimer: This is type as fast as I can, blog may contain typing and formatting errors, sorry about that

Session: Technical Lessons on how to do Backup and Disaster Recovery in the Cloud (whew, long title)

Presenter: Simone Brunozzi, Technology Evangelist

Simone presented in the morning keynote on the Enterprise demo, good presenter

3 parts = HA -> Backup -> Disaster Recovery

HA = Keeping Services Alive

Backup = Process of keeping a copy

DR = Recover using a backup

(Simone has is using great examples using churches and monasteries but too long to type all of that out here.)

5 Concepts of DR

1. My backup should be accessible - AWS uses API's, Direct Connect, customer owns the data, redundancy is built it, AWS has import/export capabilities

AWS Storage Gateway as an example, using a gateway cache volume on-premise that will replicate to a volume in AWS public cloud, S3, snapshots, etc.  Can be a GW-cached or GW-stored (one is a cache, the other is a full offline copy). Secure tunnel for transport over AWS Direct Connect or Internet

2. My backup should be able to scale - "Infinite scale" with S3 and Glacier, scale to multiple regions, seamless, no need to provision, cost tiers (cheaper options and at scale are available)

3. My backup should be safe - SSL Endpoints, signed API calls, stored encrypted files, server-side encryption, durability: multiple copies across different data centers, local/cloud with AWS Storage Gateway

4. My backup should work with my DR policy (I don't want to wait 10 years to recover) - easy to integrate within AWS or Hybrid, AWS Storage Gateway: Run services on Amazon EC2 for DR, cleat costs, reduced costs, You decide the redundancy/availability in relation to costs.

5. Someone should care about it - Need clear ownership, permission can be set in IAM with roles, monitor logs

Now a customer story:

Shaw Media - Canadian Media Company, before AWS - multiple datacenters, lot of equipment, downtime, different technologies across datacenters - they were told to change everything and become more agile and cost effective in the next 9 months to better serve the business

Solved the issue with AWS, fast deployment of servers, network rules, and ELB on AWS, first site in only 4 weeks, after that a full migration of 29 sites from a physical DC in 9 months - This was Phase one (This was main websites)

Phase Two - Other web services migration was next (check out the picture for the details), impressive stats.  Typical web servers, apps servers, database servers, etc.

Lessons Learned - went to fast, didn't catch it... damnit

DR - Learn from your outages (test your policy on a regular basis and refine the document)

(Sorry, he's going to fast to type or even take pictures of the slides.... Really wish he would he gone slower in this section, the content was really good grrrrrrr)

Lessons to learn from DR

1. You NEED a DR plan in place - how will you recover?  Can your business survive without it?  For AWS, across Availability Zones (AZ's) or App DR with Standby (see pictures).  The second option is cheaper to implement but will take a little longer to recover from.


Perform a business analysis of RTO & RPO (if you don't know what that is, Google it, you need to know what it is)  In a nutshell, RTO, how long to get it back, RPO, how much data can I lose?  This is the typical cost vs. performance trade off.  Take the various AWS services as an example:

2. Test your DR - Many may say Duh! to this one but I'm always surprised how little customers actually do this.  The ability to spin up capacity just for DR testing helps to minimize cost and the ability to not have a DR site to manage is pretty cool. Data Transfer speeds (Data Gravity) could be an issue in this kind of scenario

3. Reducing Costs - Took a screenshot, it was easier

Overall - great presentation although I wish he would have spent more time on the customer slides as there was some good technical content there...

No comments: