How to avoid cluster race conditions or: How to implement a distributed lock manager in puppet

I’ve been working on a puppet module for gluster. Both this, my puppet-gfs2 module, and other puppet clustering modules all share a common problem: How does one make sure that only certain operations happen on one node at a time?

The inelegant solutions are simple:

  1. Specify manually (in puppet) which node the “master” is, and have it carry out all the special operations. Downside: Single point of failure for your distributed cluster, and you’ve also written ugly asymmetrical code. Build a beautiful, decentralized setup instead.
  2. Run all your operations on all nodes. Ensure they’re idempotent, and that they check the cluster state for success first. Downside: Hope that they don’t run simultaneously and race somehow. This is actually how I engineered my first version of puppet-gluster! It was low risk, and I just wanted to get the cluster up; my module was just a tool, not a product.
  3. Use the built-in puppet DLM to coordinate running of these tasks. Downside: Oh wait, puppet can’t do this, you misunderstand the puppet architecture. I too thought this was possible for a short duration. Woops! There is no DLM.

Note: I know this is ironic, since by default puppet requires a master node for coordination, however you can have multiple masters, and if they’re down, puppetd still runs on the clients, it just doesn’t receive new information. (You could also reconfigure your manifests to work around these downsides as they arise, but this takes the point out of puppet: keeping it automatic.)

Mostly elegant: Thoughts of my other cluster crept into my head. Out of nowhere, I realized the solution was: VRRP! You may use a different mechanism if you like, but at the moment I’m using keepalived. Keepalived runs on my gluster pool to provide a VIP for the cluster. This allows my clients to use a highly available IP address to download volume files (mount operation) otherwise, if that particular server were down, they wouldn’t be about to mount. The trick: I tell each node what the expected VIP for the cluster is, and if that IP is present in a facter $ipaddress_, then I let that node execute!

The code is now available, please have a look, and let me know what you think.

Happy hacking,
James

PS: Inelegant modes 1 and 2 are still available. For mode 1, set “vip” in your config to the master node IP address you’d like to use. For mode 2, leave the “vip” value at the empty string default.

PPS: I haven’t had a change to thoroughly test this, so be warned of any dust bunnies you might find.

puppet gluster module now in git

The thoughtful bodepd has been kind enough to help me get my puppet-gluster module off the ground and publicized a bit too. My first few commits have been all clean up to get my initial hacking up to snuff with the puppet style guidelines. Sadly, I love indenting my code with tabs, and this is against the puppet rules :(

I’ll be accepting patches by email, but I’d prefer discussion first, especially since I’ve got a few obvious things brewing in my mental queue that should hit master shortly.

Are you a gluster expert who’s weak at puppet? I’m keen to implement many of the common raid, file system and gluster performance optimization’s directly into the module, so that the out of box experience for new users is a fast, streamlined, experience.

Are you a puppet expert who knows a bit of gluster? I’m not sure what the best way to handle large config changes, such as expanding volumes, or replacing bricks is. I can imagine a large state diagram that would be very hard to wholly implement in puppet. So for now, I’m missing a few edge cases, but hopefully this module will be able to solve more of them over time.

I’ve included an examples/ directory in the repository, to give you an idea of how this works for now. Stay tuned for more commits!

git clone https://github.com/purpleidea/puppet-gluster.git

Happy hacking,
James

a puppet module for gluster

I am an avid cobbler+puppet user. This allows me to rely on my cobbler server and puppet manifests to describe how servers/workstations are setup. I only backup my configs and data, and I regenerate failed machines PRN.

I’ll be publishing my larger cobbler+puppet infrastructure in the future once it’s been cleaned up a bit, but for now I figured I’d post my work-in-progress “puppet-gluster” module, since it seems there’s a real interest.

Warning: there are some serious issues with this module! I’ve used this as an easy way to build out many servers with cobbler+puppet automatically. It’s not necessarily the best long-term solution, and it certainly can’t handle certain scenarios yet, but it is a stepping stone if someone would like to think about writing such a module this way.

For lack of better hosting, it’s now available here: https://dl.dropbox.com/u/48553683/puppet-gluster.tar.bz2 Once I finish cleaning up a bit of cruft, I’ll post my git tree somewhere sane. All of this code is AGPL3+ so share and enjoy!

What’s next? My goal is to find the interested parties and provoke a bit of discussion as to whether this is useful and where to go next. It makes sense to me, that the gluster experts chirp in and add gluster specific optimization’s into this module, so that it’s used as a sort of de-facto documentation on how to set up gluster properly.

I believe that Dan Bode and other gluster developers are already interested in the “puppet module” project, and that work is underway. I spoke to him briefly about collaborating. He is most likely a more experienced puppet user than I, and so I look forward to the community getting a useful puppet-gluster module from somewhere. Maybe even native gluster types?

Happy hacking,
James

 

now syndicated on “planet gluster”

Many thanks to johnmark in #gluster for syndicating my “gluster” tagged blog posts on http://www.gluster.org/blog/

I aim to keep these posts technical and informative, aimed mostly at other sysadmins and gluster users. Please don’t  be shy to comment on my writing style or to let me know if you need more information about a particular subject. If you have any ideas about things you’d like me to write about, let me know and I’ll try to do my best. I like feedback!

Happy hacking,
James

PS: My main blog (https://ttboj.wordpress.com/) will still contain other technical articles not relating to gluster, should that be useful to you.

building intel nic driver (igb) for gluster on centos

I’ve been having some strange networking issues with gluster. “Eco__” from #gluster suggested I try an up to date Intel nic driver. Here are the steps I followed to make that happen. No news yet on if that solved the problem.

Currently my system is using the igb (intel gigabit) driver. To find out what version you are running:

# modinfo -F version igb
3.2.10-k

I found a newer version from the supermicro ftp site. A download and a decompress later, I found an: igb-3.4.7.tar.gz file hiding out. Thanks to the kind people at Intel, this was fairly easy to compile and install. First install some deps:

# yum install /usr/bin/{rpmbuild,gcc,make} kernel-devel

Use rpmbuild to make yourself an rpm:

# rpmbuild -tb igb-3.4.7.tar.gz
[snip]

Your rpm package should appear in rpmbuild/RPMS/
In my case, I added this to my local cobbler repo, and pushed it to all my gluster nodes. You might prefer a simple:

 # yum install igb-3.4.7-1.x86_64.rpm

Please note that I believe it’s important to build this module on the same kind of OS/Hardware that you’re using it for. Since my storage hosts are all identical, this wasn’t a problem.

Happy hacking!
James

EDIT: tru_tru from #gluster pointed out that this module actually exists in elrepo, the wonderful people who also provide drbd modules. I haven’t tested it, but I’m sure it’s excellent.

my gluster setup, described

For the last ~two or so years I’ve played and tested gluster on and off and hanging out in the awesome #gluster channel on Freenode. In case you haven’t heard, gluster was acquired by RedHat back in October 2011. This post describes my current setup. I urge you to send your comments and suggestions for improvement. I’ll update this as needed.

Hardware:
Ideology: I wanted to build individual self-contained storage hosts. I didn’t want to have servers with separate (serial) attached storage (SAS) like Dell is often pushing. Supermicro fit the design goal, and sold me when I realized I could have the OS drives swappable out the back.

  • 4 x Supermicro 6047R-E1R24N
  • 4 x 24 x 2TiB, 3.5″ HDD (front, hot swappable main storage)
  • 4 x 2 x 600GiB, 2’5″ HDD (rear, hot swappable os drives, awesome feature!)
  • 2 x quality stacked switches (with one leg of each bond device out to each switch)
  • IPMI: absolutely required (It seems it’s a bit buggy! I’ve had problems where the SOL console stops responding when dealing with a big stream of data, and I can only rescue it with a cold reset of the BMC.) Overall it’s been sufficient to get me up and running.

OS:

  • CentOS 6.3+. I would consider using RHEL if their sales department could get organized and when RHEL integrates into my cobbler+puppet build system.
  • Bonded (eth0,eth1 -> bond0) ethernet for each machine. Possible upgrade to bonded 10GbE if ever needed. Interface eth0 on each machine plugs into switch0 and eth1 on each machine plugs into switch1.
  • The 24 storage HDD’s are split into two separate RAID 6’s per machine.
  • OS HDD’s in software raid 1. Unfortunately anaconda/kickstart doesn’t support RAID 1 for the EFI boot partitions. Maybe someone could fix this! (HINT, HINT)
  • The machines pxeboot, kickstart and configure themselves automatically with cobbler+puppet.
  • The LSI MSM tool (for monitoring the RAID) seems to give me a lot of trouble with false positive warnings about temperature thresholds. Apart from being stuck with proprietary crap ware, it does actually email me when drives fail. Alternatives welcome! I deploy this with a puppet module that I wrote. If it weren’t for that, this step would drive me insane.

Gluster:

  • Each host has its drives split into two bricks. A gluster engineer recommended this for the type of setup I’m running.
  • Each RAID6 set is formatted with xfs.
  • Keepalived maintains a VIP (will replace with cman/corosync one day) which serves as the client hostname to connect to. This makes my setup a bit more highly available if one or more nodes go down.
  • I have a puppet module which I use to describe/build my gluster setup. It’s not perfect, but it works for me ™. I’m cleaning it up, and will post it shortly.
  • I’m using a distributed-replicate setup, with eight bricks (2 per node).
  • I originally used the official packages to get my gluster rpm’s, but recently I switched to using: kkeithle‘s. Thanks for your hard work!

Conclusion:

Let me know what other nitpick details you want to know about and I’ll post them. A lot of things can also be inferred by reading my puppet module.

Happy Hacking,
James

IPMI for linux professionals

The nostalgia of serial console servers, kvm’s and switched PDU’s is hopefully no longer visible in your server room. If not, you must definitely start playing catch up. Please forgive my ignorance, but these things might still be common for big windows shops, however if that’s the case, you’ve got an entirely different set of problems ;)

IPMI is an IP based protocol that allows you to talk directly to a little computer, usually built in to your server. It lets you remotely manage power (on, off, reboot, cycle…) get a serial console, collect sensor readings like temperatures, and do other magical things too if you care to figure them out.

The web talks a lot about all this. I’ll give you the short “need to know” list to get you going.

  1. It probably makes sense to have the IPMI device of your DHCP server (or whatever network dependencies you have) set statically, so that this works if DHCP is down. I’ve actually never heard of anyone who had this problem, but it seems logical enough that I figured I’d mention it.
  2. Set an IPMI password and put the device on a separate layer two network behind your router and firewall. Most servers bond the IPMI device to your “eth0” by default (at layer2), or let you split it off to a separate physical interface if so desired. Do the split and plug it into your management network. Remind me to talk about my dual router topology one day.
  3. When you use cobbler to kickstart your machines, you’ll need this in your kopts:
    console=ttyS1,115200
    Don’t bother wasting your time configuring that manually when anaconda takes care of this for you :)
  4. Almost all server hardware uses the second serial device (ttyS1) as the one that is linked to the IPMI hardware. In some insane default BIOS’es you might have to enable this.
  5. Once installed, the kopt will usually know to have added the correct magic to grub, and also to whatever spawn’s your serial tty. Feel free to grep to see what your $OS did.
  6. ipmitool -I lanplus -H <ip address of ipmi device> -U ADMIN sol activate
    if ever this gets stuck, run a ‘deactivate’ first.
  7. Learn the ~. disconnect sequence. If you’re connected over ssh to your ipmi client (which I always am since it’s my router) you’ll need to “~~” to skip “through” the ssh escape character, and then period “.”, exactly how ssh disconnects. Similarly the same logic applies if you’re insane and run screen -> ssh -> screen.
  8. You might need to do a “reset+clear” if the bios throws crap down the wire at you. I haven’t found a way to avoid this. It’s generally not a big problem for me, because this only happens if I’m watching the bios at boot, which only really happens if I’m bored on first install.

Happy Hacking!

James