AaronDelp.com: Object Storage

Wednesday, February 27, 2013

ApacheCon LiveBlog: Powering CloudStack w/ Ceph RBD

This is a live blog from ApacheCon that I'm attending this week. This session is with Patrick McGarry.

Usual Live Blog Disclaimer: This is more of a brain dump typing as fast as I can, please excuse typos, format, and coherent thought process in general.

(No title slide picture this time - missed it)

What is Ceph - storage that does object, block, and file all in one; block is thin provision, snapshots, cloning - object has REST API
RADOS (Google it) object store at the lowest level
Why Object at the lowest level - more useful than blocks, single namespace, scales better, simple API, workload is easily parallel
Because of this: define a pool (1 to 100's), independent namespaces and object collections
(Topic change) - Architecture
aggregate a bunch of different machines so that you can have a "large enough" front end to handle large number of requests in
In this "pile" you will have monitors. Monitors provide consensus for decisions, always an odd number, do not store data (traffic controllers) to the storage nodes (OSD nodes)
On an OSD node -> physical disk -> file system -> OSD layer
CRUSH - pseudo-random placement algorithm for data placement, CEPH "secret sauce", allows for stable mapping and uniform distribution with additional ruled configuration (can apply weights, topology rules)
How does it work, take an object, talk to monitors, CRUSH breaks it up, places it around according to the rules
What happens when something breaks? If an OSD node is lost, the ones with the copy of the data replicates the blocks somewhere else according to CRUSH rules and moves on
How to talk to it? LIBRADOS - library for RADOS, support for C, C++, Java, Python, Ruby, PHP
Also RADOSGW - Rest gateway compatible with S3 & Swift
CEPH FS - A POSIX-compliant distributed file system with a Linux kernel
RBD - reliable and fully-distributed block device sitting on top of the object store
RADOS Block Device (RBD) - storage of disk images in RADOS, allows decouple of VM from the host, images stripped across the pool, snapshots, copy-on-write clones
What does this look like? vm's are now split across the cluster, great for large capacity as well as high I/O instances of vm's
same model as Amazon EBS
it is a shared environment, so you can migrate running instances across cluster
Copy-On-Write Cloning (he gets lots of question on this) - think of a Golden Image Master vm and you want 100 copies - You spin the 100 instantly and it takes up additional storage as needed and the vm's grow.
Question: Is there a performance impact to this? A: No, but as usual it depends on the architecture (how many devices are hitting it)
CloudStack 4.0 and RBD? via KVM, no Xen or VMW support today
Live migrations are supported
No snapshots yet
NFS still required for system vm's
Can be added easily as RBD Primary storage in CloudStack
snapshot and backup support should be coming in version 4.2, cloning is coming, support for secondary storage in 4.2 (backup storage is coing in 4.2)

Tuesday, February 26, 2013

ApacheCon LiveBlog: Object Storage with CloudStack & Hadoop

This is a live blog from ApacheCon that I'm attending this week. This session is with Chiradeep Vittal.

Usual Live Blog Disclaimer: This is more of a brain dump typing as fast as I can, please excuse typos, format, and coherent thought process in general.

How does Amazon build a cloud:

Commodity Hardware -> OpenSource Xen Server -> AWS Orchestration Software -> AWS API -> Amazon eCommerce Platform
How would YOU build the same cloud on CloudStack - You can in much the same way: Hardware -> Hypervisor -> CloudStack -> API -> Customer Solution

CloudStack is built in the concept of a Zone (much like an AWS Zone)

Under the zone is a logical unit of Pods (think of it as a rack)

Secondary Storage is used for Templates, snapshots, etc. (items that are storage and not changed often, need to be shared across pods)
Cloud Style Workloads = low cost, standardized hardware, highly automated & efficient (it's the Pets vs. Cattle analogy)
At scale, everything breaks eventually
Regions and Zones - Region "West", hope a Region will not go down when another Region goes down. - Replication from one Region to another Region is the norm
Secondary Storage in CloudStack 4.0 today

NFS is the server default - mounted by any CloudStack Hypervisor, easy to set up
BUT - doesn't scale well, "chatty", maybe need WAN optimize. What if 1000 hypervisors talk to one NFS share?
At large scale NFS shows some issues
One solution is use object storage for secondary storage

Object Storage has redundancy, replication, auditing built in to the technology typically
In addition, this technology enables other applications, API server in front of the object store and you know have "Dropbox", etc. typically static content and archival kinds of applications
Object is 99.9 availability and 99.(eleven 9's) durability according to Amazon S3 and Massive scale (1.3 trillion objects in AWS today serving 800k requests per second
Scalable objects can not be modified, only deleted (called an Immutable object)
Simple API with a flat namespace - think KISS princisple
CloudStack S3 API Server - understands Amazon S3 API with a Pluggable BackEnd, default backend is a POSIX filesystem (not very useful in production), Carringo was mentioned as a replacement, also HDFS replacement
Question - Does CloudStack handle all the ACL's / Answer: Yes
FollowUp - Does that mean SQL Server is a possible constraint / Answer: Yes
Integrations are available with Riak CS and OpenStack Swift
Upcoming in CloudStack 4.2 - Framework to expand this much more
Given all of this, what could we build? (Topic switch)
Want an Open Source, scales to 1 billion objects, reliability & durability on par with S3, S3 API
This is now a theoretical design (hasn't been tested)
(See picture for architecture)

Hadoop meets all of these requirements and is proven to work (200 million objects in 1 cluster, 100PB in 1 cluster), need to scale, just add a node, very easy
BUT - Name Node Scalability (at 100's of millions of blocks, could run into GC issues), Name Node is a SPOF (Single Point of Failure) - this is being worked currently, Cross Zone Replication (Hadoop has rack awareness, what if further apart?) - this isn't really tested today, where do you store metadata (ACL's for instance)
take a 1 billion objects example (bunch of assumptions here) - needs about 450GB per name node, 16TB / note = 1000 data nodes
Name Node management is federated (sorry this is vague, getting beyond my knowledge of Hadoop architecture at this point). Name Node and HA really hasn't been tested to date
NameSpace shards, how do you shard them? Do you need a DB just to store this?? What about rebalancing between node names?
Replication over lossy/slower links (solution really breaks down here today)

Async replication - how do you handle master/slave relationships?
Sync - not very feasible if you lose a zone (writes never acknowledged so will not continue)

Where do you store Metadata?

Store in HDFS along with the object, reads become expensive and meta data is mutable (needs to be edited), needs a layer on top of HDFS
Use another storage system (like HBase) - required for Name node federation anyway, but ANOTHER system to manage
Modify the Name Node to store the metadata

high performance (doesn't exist today)
not extensible and not easy to just "plug in"

What can you do with Object Store in HDFS today?

Viable for small size deployments - up too 100-200 million objects (Facebook does this) with datacenters close together
Larger deployments needs development and there is really no effort around this today