Clustering virtual machines with rgmanager and clusvcadm

This could be a post detailing how to host clustered virtual machines with rgmanager and clusvcadm, but that is a longer story and there is much work to do. For now, I will give you a short version including an informative “gotcha”.

With my cluster up and running, I added a virtual machine entry to my cluster.conf:

<vm name="test1" domain="somedomain" path="/shared/vm/" autostart="0" exclusive="0" recovery="restart" use_virsh="1" />

This goes inside the <rm> block. As a benchmark, please note that starting the machine with virsh worked perfectly:

[root@server1 ~]# virsh create /shared/vm/test1.xml --console
(...The operation worked perfectly!)

However, when I attempted to use the cluster aware tools, all I got was failure:

[root@server1 ~]# clusvcadm -e 'vm:test1' -m server1
Member server1 trying to enable vm:test1...Failure

Whenever I think I’ve done everything right, but something is still not working, I first check to see if I can blame someone else. Usually that someone is selinux. Make no mistake, selinux is a good thing, however it does still cause me pain.

The first clue is to remember that /var/log/ contains other files besides “messages“. Running a tail on /var/log/audit/audit.log while simultaneously running the above clusvcadm command revealed:

type=AVC msg=audit(1357202069.310:10904): avc:  denied  { read } for  pid=15675 comm="virsh" name="test1.xml" dev=drbd0 ino=198628 scontext=unconfined_u:system_r:xm_t:s0 tcontext=unconfined_u:object_r:default_t:s0 tclass=file
type=SYSCALL msg=audit(1357202069.310:10904): arch=c000003e syscall=2 success=no exit=-13 a0=24259e0 a1=0 a2=7ffff03af0d0 a3=7ffff03aee10 items=0 ppid=15609 pid=15675 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=1 comm="virsh" exe="/usr/bin/virsh" subj=unconfined_u:system_r:xm_t:s0 key=(null)

I am not a magician, but if I was, I would probably understand what all of that means. For now, let’s pretend that we do. Closer inspection (or grep) will reveal:

  • test1.xml” (the definition for the virtual machine)

and:

  • “/usr/bin/virsh” (the command that I expect rgmanager’s /usr/share/cluster/vm.sh script to run)

A quick:

[root@server1 ~]# selinuxenabled && echo t || echo f
t

to confirm that selinux is auditing away, and a short:

[root@server1 ~]# /bin/echo 0 > /selinux/enforce

to temporarily test my theory, and:

[root@server1 ~]# clusvcadm -e 'vm:test1' -m server1
Member server1 trying to enable vm:test1...Success
vm:test1 is now running on server1

Presto change-o, the diagnosis is complete. This is a development system, and so for the time being, I will accept defeat and workaround this problem by turning selinux off, but this is most definitely the wrong solution. If you’re an selinux guru who knows the proper fix, please let me know! Until then,

Happy Hacking,

James

 

How I broke (and fixed) my rgmanager service

Rgmanager, clustat and clusvcadm are useful tools in cluster land. I recently built a custom resource which I added to one of my service chains. Upon inspecting clustat, I noticed:

[root@server1 ~]# clustat
Member Status: Quorate

Member Name                             ID   Status
------ ----                             ---- ------
server1                                 1 Online, Local, rgmanager
server2                                 2 Online, rgmanager

Service Name                   Owner (Last)                   State
------- ----                   ----- ------                   -----
service:service-main-server1   (server1)                      failed

Looking at /var/log/messages, I found:

server1 rgmanager: [script] script:shorewall-reload: start of shorewall-reload.sh failed (returned 2)
server1 rgmanager: start on script "shorewall-reload" returned 1 (generic error)

This was peculiar, because my script didn’t have an exit code of 2 anywhere. It was due to a syntax error (woops)! Moving on with the syntax error fixed, I had trouble getting the service going again. In the logs I found these:

server1 rgmanager: #68: Failed to start service:service-main-server1; return value: 1
server1 rgmanager: Stopping service service:service-main-server1
server1 rgmanager: #12: RG service:service-main-server1 failed to stop; intervention required
server1 rgmanager: Service service:service-main-server1 is failed
server1 rgmanager: #13: Service service:service-main-server1 failed to stop cleanly

Running commands like: clusvcadm -e service-main-server1 didn’t help. It turns out that you have to first convince rgmanager that you truly fixed the problem, by first disabling the service. Now you can safely enable it and things should work smoothly:

clusvcadm -d service-main-server1
clusvcadm -e service-main-server1

Hopefully you’ve now got your feet wet with this clustering intro! Remember that you can look in the logs for clues and run clustat -i 1 in a screen session to keep tabs on things.

Happy Hacking,

James