How to avoid cluster race conditions or: How to implement a distributed lock manager in puppet

I’ve been working on a puppet module for gluster. Both this, my puppet-gfs2 module, and other puppet clustering modules all share a common problem: How does one make sure that only certain operations happen on one node at a time?

The inelegant solutions are simple:

  1. Specify manually (in puppet) which node the “master” is, and have it carry out all the special operations. Downside: Single point of failure for your distributed cluster, and you’ve also written ugly asymmetrical code. Build a beautiful, decentralized setup instead.
  2. Run all your operations on all nodes. Ensure they’re idempotent, and that they check the cluster state for success first. Downside: Hope that they don’t run simultaneously and race somehow. This is actually how I engineered my first version of puppet-gluster! It was low risk, and I just wanted to get the cluster up; my module was just a tool, not a product.
  3. Use the built-in puppet DLM to coordinate running of these tasks. Downside: Oh wait, puppet can’t do this, you misunderstand the puppet architecture. I too thought this was possible for a short duration. Woops! There is no DLM.

Note: I know this is ironic, since by default puppet requires a master node for coordination, however you can have multiple masters, and if they’re down, puppetd still runs on the clients, it just doesn’t receive new information. (You could also reconfigure your manifests to work around these downsides as they arise, but this takes the point out of puppet: keeping it automatic.)

Mostly elegant: Thoughts of my other cluster crept into my head. Out of nowhere, I realized the solution was: VRRP! You may use a different mechanism if you like, but at the moment I’m using keepalived. Keepalived runs on my gluster pool to provide a VIP for the cluster. This allows my clients to use a highly available IP address to download volume files (mount operation) otherwise, if that particular server were down, they wouldn’t be about to mount. The trick: I tell each node what the expected VIP for the cluster is, and if that IP is present in a facter $ipaddress_, then I let that node execute!

The code is now available, please have a look, and let me know what you think.

Happy hacking,

PS: Inelegant modes 1 and 2 are still available. For mode 1, set “vip” in your config to the master node IP address you’d like to use. For mode 2, leave the “vip” value at the empty string default.

PPS: I haven’t had a change to thoroughly test this, so be warned of any dust bunnies you might find.

6 thoughts on “How to avoid cluster race conditions or: How to implement a distributed lock manager in puppet

    • You can definitely use the mount option to add more servers, but I don’t think this makes it irrelevant, no.
      Once the mount connects, the volfile it downloads contains the complete list of servers…
      If you were particularly crazy about this, then you could use keepalived to setup two VIP’s and specify them both with the mount…
      I forget how many servers can be listed, but technically you could list as many as are allowed too.
      Different options for different choices.

      • Nice writeup.. and apologies for posting five years later (it’s probably no longer relevant).

        I’ve used rrDNS to solve the gluster mount problem. Register all the gluster nodes under the the same A record/alias (configure basic rrDNS) then mount the gluster endpoints using this rrDNS record rather than the host’s hostname. You can also implement this locally if you use dnsmasq on each node and let puppet populate a local dnsmasq hosts file (dnsmasq supports rrDNS configured locally in files) using exported resources of the cluster members… this means the gluster component doesn’t need to depend on external DNS for the rrDNS component.

        I found this post attempting to solve another problem (not gluster related).. which is… In specific circumstances I almost want the puppet run to block (but not finish) until the cluster nodes are all available (ie. block until 5 nodes have exported resources registered then continue). The current problem is it can take several (n) puppet runs of each node to achieve a full cluster configuration using exported resources based on discrepencies in node deployment times. I’m sure it just needs a rethink.. as the variables are populated at the start of the run. Dodgy bash loop to the rescue.

      • Indeed you could use rrdns instead of vrrp, but there are some disadvantages, including the need to now manage DNS, and the ability to take down hosts out of the rr list on failure. Again easier to do with DNS if you’re good at automating that, and you don’t mind waiting for the ~5 min dns cache to timeout.

        Your question about puppet blocking is more interesting. I worked around that issue with this: — After thinking about this problem and being ultimately dissatisfied with that as a solution, I ended up writing mgmt. I noticed you found that blog post! Here’s the project link where you can find a video presentation and more info at the very bottom: Cheers!

      • @purpleidea

        That’s the benefit of using dnsmasq, you can configure rrDNS locally using a dnsmasq hosts file (dnsmasq hosts file supports basic rrDNS config). dnsmasq then uses this file to resolve before going out to real DNS. I populate this file using file_line from exported resources of the cluster members (all with the same name.. eg. name of the cluster).

        I find gluster works through the (dead) rrDNS records fairly quickly.. quick enough to not be a major issue on remount if you configure the retry/timeout in your gluster environment.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s