Automatic clustering in mgmt

In mgmt, deploying and managing your clustered config management infrastructure needs to be as automatic as the infrastructure you’re using mgmt to manage. With mgmt, instead of a centralized data store, we function as a distributed system, built on top of etcd and the raft protocol.

In this article, I’ll cover how this feature works.

Foreword:

Mgmt is a next generation configuration management project. If you haven’t heard of it yet, or you don’t remember why we use a distributed database, start by reading the previous articles:

Embedded etcd:

Since mgmt and etcd are both written in golang, the etcd code can be built into the same binary as mgmt. As a result, etcd can be managed directly from within mgmt. Unfortunately, there’s currently no recommended API to do this, but I’ve tried to get such a feature upstream to avoid code duplication in mgmt. If you can help out here, I’d really appreciate it! In the meantime, I’ve had to copy+paste the necessary portions into mgmt.

Clustering mechanics:

You can deploy an automatically clustered mgmt cluster by following these three steps:

1) If no mgmt servers exist you can start one up by running mgmt normally:

./mgmt run --file examples/graph0.yaml

2) To add any subsequent mgmt server, run mgmt normally, but point it at any number of existing mgmt servers with the --seeds command:

./mgmt run --file examples/graph0.yaml --seeds <ip address:port>

3) Profit!

We internally implement a clustering algorithm which does the hard-working of building and managing the etcd cluster for you, so that you don’t have to. If you’re interested, keep reading to find out how it works!

Clustering algorithm:

The clustering algorithm works as follows:

If you aren’t given any seeds, then assume you are the first etcd server (peer) and start-up. If you are given a seeds argument, then connect to that peer to join the cluster as a client. If you’d like to be promoted to a server, then you can “volunteer” by setting a special key in the cluster data store.

The existing cluster of peers will decide if they want additional peers, and if so, they can “nominate” someone from the pool of volunteers. If you have been nominated, you can start-up an etcd peer and peer with the rest of the cluster. Similarly, the cluster can decide to un-nominate a peer, and if you’ve been un-nominated, then you should shutdown your etcd server.

All cluster decisions are made by consensus using the raft algorithm. In practice this means that the elected cluster leader looks at the state of the system, and makes the necessary nomination changes.

Lastly, if you don’t want to be a peer any more, you can revoke your volunteer message, which will be seen by the cluster, and if you were running a server, you should receive an un-nominate message in response, which will let you shutdown cleanly.

Disclaimer:

It’s probably worth mentioning that the current implementation has a few issues, and at least one race. The goal is to have it polished up by the time etcd v3 is released, but it’s perfectly usable for testing and experimentation today! If you don’t want to automatically cluster, you can always use the --no-server flag, and point mgmt at a manually managed mgmt cluster using the --seeds flag.

Testing:

Testing this feature on a single machine makes development and experimentation easier, so as a result, there are a few flags which make this possible.

--hostname <hostname>
With this flag, you can force your mgmt client to pretend it is running on a host with the above mentioned name. You can use this to specify --hostname h1, or --hostname h2, and so on; one for each mgmt agent you want to run on the same machine.

--server-urls <ip:port>
With this flag you can specify which IP address and port the etcd server will listen on for peer requests. By default this will use 127.0.0.1:2379, but when running multiple mgmt agents on the same machine you’ll need to specify this manually to avoid collisions. You can specify as many IP address and port pairs as you’d like by separating them with commas or semicolons. The --peer-urls flag is an alias which does the same thing.

--client-urls <ip:port>
This flag specifies which IP address and port the etcd server will listen on for client connections. It defaults to 127.0.0.1:2380, but you’ll occasionally want to specify this manually for the same reasons as mentioned above. You can specify as many IP address and port pairs as you’d like by separating them with commas or semicolons. This is the address that will be used by the --seeds flag when joining an existing cluster.

Elastic clustering:

In the future, you’ll be able to specify a much more elaborate method to decide how many hosts should be promoted into peers, and which hosts should be nominated or un-nominated when growing or shrinking the cluster.

At the moment, we do the grow or shrink operation when the current peer count does not match the requested cluster size. This value has a default of 5, and can even be changed dynamically. To do so you can run:

ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 put /_mgmt/idealClusterSize 3

You can also set it at start-up by using the --ideal-cluster-size flag.

Example:

Here’s a real example if you want to dive in. Try running the following four commands in separate terminals:

./mgmt run --file examples/etcd1a.yaml --hostname h1 --ideal-cluster-size 3
./mgmt run --file examples/etcd1b.yaml --hostname h2 --seeds http://127.0.0.1:2379 --client-urls http://127.0.0.1:2381 --server-urls http://127.0.0.1:2382
./mgmt run --file examples/etcd1c.yaml --hostname h3 --seeds http://127.0.0.1:2379 --client-urls http://127.0.0.1:2383 --server-urls http://127.0.0.1:2384
./mgmt run --file examples/etcd1d.yaml --hostname h4 --seeds http://127.0.0.1:2379 --client-urls http://127.0.0.1:2385 --server-urls http://127.0.0.1:2386

Once you’ve done this, you should have a three host cluster! Check this by running any of these commands:

ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 member list
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2381 member list
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2383 member list

Note that you’ll need a v3 beta version of the etcdctl command which you can get by running ./build in the etcd git repo.

To grow your cluster, try increasing the desired cluster size to five:

ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2381 put /_mgmt/idealClusterSize 5

You should see the last host start-up an etcd server. If you reduce the idealClusterSize, you’ll see servers shutdown! You’re responsible if you destroy the cluster by setting it too low! You can then try growing your cluster again, but unfortunately due to a bug, hosts can’t be re-used yet, and you’ll get a “bind: address already in use” error. We hope to have this fixed shortly!

Security:

Unfortunately no authentication security or transport security has been implemented yet. We have a great design, but are busy working on other parts of the project at the moment. If you’d like to help out here, please let us know!

Future work:

There’s still a lot of work to do, to improve this feature. The biggest challenge has been getting a reasonable embedded server API upstream. It’s not clear whether this patch can be made to work or if something different will need to be written, but at least one other project looks like it could benefit from this as well.

Video:

A recording from the recent Berlin CoreOSFest 2016 has been published! I demoed these recent features, but one interesting note is that I am actually presenting an earlier version of the code which used the etcd V2 API. I’ve since ported the code to V3, but it is functionally similar. It’s probably worth mentioning, that I found the V3 API to be more difficult, but also more correct and powerful. I think it is a net improvement to the project.

Community:

I can’t end this blog post without mentioning some of the great stuff that’s been happening in the mgmt community! In particular, Felix has written some great code to run existing Puppet code on mgmt. Check out his work!

Upcoming speaking:

I’ve got some upcoming speaking in Hong Kong at HKOSCon16 and in Cape Town at DebConf16 about the project. Please ping me if you’ll be in one of these cities and would like to hack on mgmt or just chat about the project. I’m happy to give some impromptu demos if you ask!

Thanks for reading!

Happy Hacking,

James

PS: We now have a community run twitter account. Check us out!

Clustering virtual machines with rgmanager and clusvcadm

This could be a post detailing how to host clustered virtual machines with rgmanager and clusvcadm, but that is a longer story and there is much work to do. For now, I will give you a short version including an informative “gotcha”.

With my cluster up and running, I added a virtual machine entry to my cluster.conf:

<vm name="test1" domain="somedomain" path="/shared/vm/" autostart="0" exclusive="0" recovery="restart" use_virsh="1" />

This goes inside the <rm> block. As a benchmark, please note that starting the machine with virsh worked perfectly:

[root@server1 ~]# virsh create /shared/vm/test1.xml --console
(...The operation worked perfectly!)

However, when I attempted to use the cluster aware tools, all I got was failure:

[root@server1 ~]# clusvcadm -e 'vm:test1' -m server1
Member server1 trying to enable vm:test1...Failure

Whenever I think I’ve done everything right, but something is still not working, I first check to see if I can blame someone else. Usually that someone is selinux. Make no mistake, selinux is a good thing, however it does still cause me pain.

The first clue is to remember that /var/log/ contains other files besides “messages“. Running a tail on /var/log/audit/audit.log while simultaneously running the above clusvcadm command revealed:

type=AVC msg=audit(1357202069.310:10904): avc:  denied  { read } for  pid=15675 comm="virsh" name="test1.xml" dev=drbd0 ino=198628 scontext=unconfined_u:system_r:xm_t:s0 tcontext=unconfined_u:object_r:default_t:s0 tclass=file
type=SYSCALL msg=audit(1357202069.310:10904): arch=c000003e syscall=2 success=no exit=-13 a0=24259e0 a1=0 a2=7ffff03af0d0 a3=7ffff03aee10 items=0 ppid=15609 pid=15675 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=1 comm="virsh" exe="/usr/bin/virsh" subj=unconfined_u:system_r:xm_t:s0 key=(null)

I am not a magician, but if I was, I would probably understand what all of that means. For now, let’s pretend that we do. Closer inspection (or grep) will reveal:

  • test1.xml” (the definition for the virtual machine)

and:

  • “/usr/bin/virsh” (the command that I expect rgmanager’s /usr/share/cluster/vm.sh script to run)

A quick:

[root@server1 ~]# selinuxenabled && echo t || echo f
t

to confirm that selinux is auditing away, and a short:

[root@server1 ~]# /bin/echo 0 > /selinux/enforce

to temporarily test my theory, and:

[root@server1 ~]# clusvcadm -e 'vm:test1' -m server1
Member server1 trying to enable vm:test1...Success
vm:test1 is now running on server1

Presto change-o, the diagnosis is complete. This is a development system, and so for the time being, I will accept defeat and workaround this problem by turning selinux off, but this is most definitely the wrong solution. If you’re an selinux guru who knows the proper fix, please let me know! Until then,

Happy Hacking,

James

 

How I broke (and fixed) my rgmanager service

Rgmanager, clustat and clusvcadm are useful tools in cluster land. I recently built a custom resource which I added to one of my service chains. Upon inspecting clustat, I noticed:

[root@server1 ~]# clustat
Member Status: Quorate

Member Name                             ID   Status
------ ----                             ---- ------
server1                                 1 Online, Local, rgmanager
server2                                 2 Online, rgmanager

Service Name                   Owner (Last)                   State
------- ----                   ----- ------                   -----
service:service-main-server1   (server1)                      failed

Looking at /var/log/messages, I found:

server1 rgmanager: [script] script:shorewall-reload: start of shorewall-reload.sh failed (returned 2)
server1 rgmanager: start on script "shorewall-reload" returned 1 (generic error)

This was peculiar, because my script didn’t have an exit code of 2 anywhere. It was due to a syntax error (woops)! Moving on with the syntax error fixed, I had trouble getting the service going again. In the logs I found these:

server1 rgmanager: #68: Failed to start service:service-main-server1; return value: 1
server1 rgmanager: Stopping service service:service-main-server1
server1 rgmanager: #12: RG service:service-main-server1 failed to stop; intervention required
server1 rgmanager: Service service:service-main-server1 is failed
server1 rgmanager: #13: Service service:service-main-server1 failed to stop cleanly

Running commands like: clusvcadm -e service-main-server1 didn’t help. It turns out that you have to first convince rgmanager that you truly fixed the problem, by first disabling the service. Now you can safely enable it and things should work smoothly:

clusvcadm -d service-main-server1
clusvcadm -e service-main-server1

Hopefully you’ve now got your feet wet with this clustering intro! Remember that you can look in the logs for clues and run clustat -i 1 in a screen session to keep tabs on things.

Happy Hacking,

James