my gluster setup, described

For the last ~two or so years I’ve played and tested gluster on and off and hanging out in the awesome #gluster channel on Freenode. In case you haven’t heard, gluster was acquired by RedHat back in October 2011. This post describes my current setup. I urge you to send your comments and suggestions for improvement. I’ll update this as needed.

Hardware:
Ideology: I wanted to build individual self-contained storage hosts. I didn’t want to have servers with separate (serial) attached storage (SAS) like Dell is often pushing. Supermicro fit the design goal, and sold me when I realized I could have the OS drives swappable out the back.

  • 4 x Supermicro 6047R-E1R24N
  • 4 x 24 x 2TiB, 3.5″ HDD (front, hot swappable main storage)
  • 4 x 2 x 600GiB, 2’5″ HDD (rear, hot swappable os drives, awesome feature!)
  • 2 x quality stacked switches (with one leg of each bond device out to each switch)
  • IPMI: absolutely required (It seems it’s a bit buggy! I’ve had problems where the SOL console stops responding when dealing with a big stream of data, and I can only rescue it with a cold reset of the BMC.) Overall it’s been sufficient to get me up and running.

OS:

  • CentOS 6.3+. I would consider using RHEL if their sales department could get organized and when RHEL integrates into my cobbler+puppet build system.
  • Bonded (eth0,eth1 -> bond0) ethernet for each machine. Possible upgrade to bonded 10GbE if ever needed. Interface eth0 on each machine plugs into switch0 and eth1 on each machine plugs into switch1.
  • The 24 storage HDD’s are split into two separate RAID 6’s per machine.
  • OS HDD’s in software raid 1. Unfortunately anaconda/kickstart doesn’t support RAID 1 for the EFI boot partitions. Maybe someone could fix this! (HINT, HINT)
  • The machines pxeboot, kickstart and configure themselves automatically with cobbler+puppet.
  • The LSI MSM tool (for monitoring the RAID) seems to give me a lot of trouble with false positive warnings about temperature thresholds. Apart from being stuck with proprietary crap ware, it does actually email me when drives fail. Alternatives welcome! I deploy this with a puppet module that I wrote. If it weren’t for that, this step would drive me insane.

Gluster:

  • Each host has its drives split into two bricks. A gluster engineer recommended this for the type of setup I’m running.
  • Each RAID6 set is formatted with xfs.
  • Keepalived maintains a VIP (will replace with cman/corosync one day) which serves as the client hostname to connect to. This makes my setup a bit more highly available if one or more nodes go down.
  • I have a puppet module which I use to describe/build my gluster setup. It’s not perfect, but it works for me ™. I’m cleaning it up, and will post it shortly.
  • I’m using a distributed-replicate setup, with eight bricks (2 per node).
  • I originally used the official packages to get my gluster rpm’s, but recently I switched to using: kkeithle‘s. Thanks for your hard work!

Conclusion:

Let me know what other nitpick details you want to know about and I’ll post them. A lot of things can also be inferred by reading my puppet module.

Happy Hacking,
James

14 thoughts on “my gluster setup, described

  1. Great article. I was curious to know the arguments behind the choice of using 2 bricks/box and choosing RAID6 instead of increasing the replication factor in Gluster.

    I am guessing that disk failures (relatively frequent) are taken care of by RAID6 and node failures (relatively rare) by the Gluster replication (what is the replication factor?), in this way you don’t have to keep the replication factor too high (and make the Gluster client do all that replication work). Is that correct?

    • Thanks!

      Since there are 24 storage drives per host, splitting this into two separate RAID6 volumes allows me to withstand more drive failures, and is easier to deal with performance wise during a rebuild. A RAID5 is just too dangerous when a gluster server rebuild can be very expensive/time consuming. There are other good reasons too!

      I have a replication factor of two. Some might prefer even higher, however this more than doubled the redundancy of our previous storage solution, so an incremental step made sense. Remember that backups are still required :)
      Hope this answers it all,

      Cheers,
      James

      • Thanks James, that answers my question about the layout you chose. I was also looking at building a 100TB+ storage cluster. I’d love to hear about your experience on the client side (which applications behave well and any tuning to keep in mind), now that you ve used the Gluster backend for some time. In another nice blog post perhaps! Cheers.

      • Figuring out which client applications perform well might be tricky. I’d recommend you first build a prototype of the setup as virtual machines before purchasing hardware. This will at least give you some ideas and the time to solve problems before you have hardware sitting around.

        I am not an expert at “gluster tuning”, but as you learn more, maybe you can contribute your learnings to my puppet-gluster module, so that it can be more efficient “out of the box”.

        Good luck,
        James

  2. I am deploying a similar setup. 4 gluster storage nodes.

    Every servers configured with
    2 XFS bricks (12 x 2TB each) RAID 6 (19TB usable)
    Block size 4k
    Stripe size 64KB
    cache mode: Write through

    When running simple dd on the 19TB XFS I only get 170 mb/s max. How did you configure your RAID6?

    Thanks

    • I can’t speak to the performance numbers, because I no longer remember the specifics. What information about the RAID6 were you looking for? I was using LSI hardware RAID for this setup. I wrote a puppet module to manage it. I can release it shortly if it would help you a lot.

  3. Hi,

    What is the total available space of the cluster? Is it about 80T? My calculation is:
    12x2TB in Raid6 => about 20TB
    8bricks of 20TB => 160 TB
    replica factor 2 of 160TB => 80TB

    Thanks,

  4. Hi James, I’m working on a gluster solution pretty similar to this one. I’ve got a few questions for you:

    -Why did you set up RAID 6 instead of using gluster replication feature? Is it better for performance?
    -How many CPUs do your nodes have? 1 o 2?
    -How much RAM per node?

    Great posts by the way!

    • Hi Juan,

      This is an old post, and I don’t use this particular cluster anymore. Anyways, some answers:

      I used RAID6 _and_ gluster replication.
      I had 2 CPU’s I think, but this was chosen because it was a hybrid solution, and not necessarily dedicated gluster hosts.
      I forget how much RAM.

      You should contact RedHat for Gluster support. They have supported hardware configurations and recommendations that are up to date and in line with recent GlusterFS/RHS releases. There is also a gluster-users mailing list and irc channel.

      HTH

  5. You can skip the first question :P, I’ve got a better one:

    -Since you are already using RAID 6, so you are covered from a disk failure, why would you use replication too?

    Just to level up the HA even more? Replication would cover you from a node failure when RAID 6 wouldn’t, is that it?

    Thanks again!

    • Correct!

      Also, keep in mind that if you have *many* servers, disk failures are more common. Having more redundancy (across servers) can be very useful! You want to keep the whole pool up, even if multiple servers fail.

  6. You are right, didn’t notice the date haha, sorry. Still one of the best posts related to gluster implementations.

    Thanks again!

  7. Hi,

    I know it’s a very old topic for you and you don’t use the configuration anymore but I have a question? Why did you used keepalived? Have you mount the volume via NFS?

    Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s