Showing posts with label Object Storage. Show all posts
Showing posts with label Object Storage. Show all posts

Wednesday, February 27, 2013

ApacheCon LiveBlog: Powering CloudStack w/ Ceph RBD


This is a live blog from ApacheCon that I'm attending this week.  This session is with Patrick McGarry.

Usual Live Blog Disclaimer: This is more of a brain dump typing as fast as I can, please excuse typos, format, and coherent thought process in general.

(No title slide picture this time - missed it)

  • What is Ceph - storage that does object, block, and file all in one; block is thin provision, snapshots, cloning - object has REST API
  • RADOS (Google it) object store at the lowest level
  • Why Object at the lowest level - more useful than blocks, single namespace, scales better, simple API, workload is easily parallel
  • Because of this: define a pool (1 to 100's), independent namespaces and object collections
  • (Topic change) - Architecture
  • aggregate a bunch of different machines so that you can have a "large enough" front end to handle large number of requests in
  • In this "pile" you will have monitors. Monitors provide consensus for decisions, always an odd number, do not store data (traffic controllers) to the storage nodes (OSD nodes)
  • On an OSD node -> physical disk -> file system -> OSD layer
  • CRUSH - pseudo-random placement algorithm for data placement, CEPH "secret sauce", allows for stable mapping and uniform distribution with additional ruled configuration (can apply weights, topology rules)
  • How does it work, take an object, talk to monitors, CRUSH breaks it up, places it around according to the rules
  • What happens when something breaks? If an OSD node is lost, the ones with the copy of the data replicates the blocks somewhere else according to CRUSH rules and moves on
  • How to talk to it? LIBRADOS - library for RADOS, support for C, C++, Java, Python, Ruby, PHP
  • Also RADOSGW - Rest gateway compatible with S3 & Swift
  • CEPH FS - A POSIX-compliant distributed file system with a Linux kernel
  • RBD - reliable and fully-distributed block device sitting on top of the object store
  • RADOS Block Device (RBD) - storage of disk images in RADOS, allows decouple of VM from the host, images stripped across the pool, snapshots, copy-on-write clones
  • What does this look like? vm's are now split across the cluster, great for large capacity as well as high I/O instances of vm's
  • same model as Amazon EBS
  • it is a shared environment, so you can migrate running instances across cluster
  • Copy-On-Write Cloning (he gets lots of question on this) - think of a Golden Image Master vm and you want 100 copies - You spin the 100 instantly and it takes up additional storage as needed and the vm's grow.
  • Question: Is there a performance impact to this? A: No, but as usual it depends on the architecture (how many devices are hitting it)
  • CloudStack 4.0 and RBD? via KVM, no Xen or VMW support today
  • Live migrations are supported
  • No snapshots yet
  • NFS still required for system vm's
  • Can be added easily as RBD Primary storage in CloudStack
  • snapshot and backup support should be coming in version 4.2, cloning is coming, support for secondary storage in 4.2 (backup storage is coing in 4.2)



Tuesday, February 26, 2013

ApacheCon LiveBlog: Object Storage with CloudStack & Hadoop


This is a live blog from ApacheCon that I'm attending this week.  This session is with Chiradeep Vittal.

Usual Live Blog Disclaimer: This is more of a brain dump typing as fast as I can, please excuse typos, format, and coherent thought process in general.


  • How does Amazon build a cloud:
    • Commodity Hardware -> OpenSource Xen Server -> AWS Orchestration Software -> AWS API -> Amazon eCommerce Platform
    • How would YOU build the same cloud on CloudStack - You can in much the same way: Hardware -> Hypervisor -> CloudStack -> API -> Customer Solution
  • CloudStack is built in the concept of a Zone (much like an AWS Zone)
    • Under the zone is a logical unit of Pods (think of it as a rack)
  • Secondary Storage is used for Templates, snapshots, etc. (items that are storage and not changed often, need to be shared across pods)
  • Cloud Style Workloads = low cost, standardized hardware, highly automated & efficient (it's the Pets vs. Cattle analogy)
  • At scale, everything breaks eventually
  • Regions and Zones - Region "West", hope a Region will not go down when another Region goes down. - Replication from one Region to another Region is the norm
  • Secondary Storage in CloudStack 4.0 today
    • NFS is the server default - mounted by any CloudStack Hypervisor, easy to set up
    • BUT - doesn't scale well, "chatty", maybe need WAN optimize. What if 1000 hypervisors talk to one NFS share?
    • At large scale NFS shows some issues
    • One solution is use object storage for secondary storage
  • Object Storage has redundancy, replication, auditing built in to the technology typically
  • In addition, this technology enables other applications, API server in front of the object store and you know have "Dropbox", etc.  typically static content and archival kinds of applications
  • Object is 99.9 availability and 99.(eleven 9's) durability according to Amazon S3 and Massive scale (1.3 trillion objects in AWS today serving 800k requests per second
  • Scalable objects can not be modified, only deleted (called an Immutable object)
  • Simple API with a flat namespace - think KISS princisple
  • CloudStack S3 API Server - understands Amazon S3 API with a Pluggable BackEnd, default backend is a POSIX filesystem (not very useful in production), Carringo was mentioned as a replacement, also HDFS replacement
  • Question - Does CloudStack handle all the ACL's / Answer: Yes
  • FollowUp - Does that mean SQL Server is a possible constraint / Answer: Yes
  • Integrations are available with Riak CS and OpenStack Swift
  • Upcoming in CloudStack 4.2 - Framework to expand this much more
  • Given all of this, what could we build? (Topic switch)
  • Want an Open Source, scales to 1 billion objects, reliability & durability on par with S3, S3 API
  • This is now a theoretical design (hasn't been tested)
  • (See picture for architecture)

  • Hadoop meets all of these requirements and is proven to work (200 million objects in 1 cluster, 100PB in 1 cluster), need to scale, just add a node, very easy
  • BUT - Name Node Scalability (at 100's of millions of blocks, could run into GC issues), Name Node is a SPOF (Single Point of Failure) - this is being worked currently, Cross Zone Replication (Hadoop has rack awareness, what if further apart?) - this isn't really tested today, where do you store metadata (ACL's for instance)
  • take a 1 billion objects example (bunch of assumptions here) - needs about 450GB per name node, 16TB / note = 1000 data nodes
  • Name Node management is federated (sorry this is vague, getting beyond my knowledge of Hadoop architecture at this point). Name Node and HA really hasn't been tested to date
  • NameSpace shards, how do you shard them? Do you need a DB just to store this?? What about rebalancing between node names?
  • Replication over lossy/slower links (solution really breaks down here today)
    • Async replication - how do you handle master/slave relationships?
    • Sync - not very feasible if you lose a zone (writes never acknowledged so will not continue)
  • Where do you store Metadata?
    • Store in HDFS along with the object, reads become expensive and meta data is mutable (needs to be edited), needs a layer on top of HDFS
    • Use another storage system (like HBase) - required for Name node federation anyway, but ANOTHER system to manage
    • Modify the Name Node to store the metadata
      • high performance (doesn't exist today)
      • not extensible and not easy to just "plug in"
  • What can you do with Object Store in HDFS today?
    • Viable for small size deployments - up too 100-200 million objects (Facebook does this) with datacenters close together
    • Larger deployments needs development and there is really no effort around this today