The switch as an ordinary GNU/Linux server

The fact that we manage the switches in our data centres differently than any other server is patently absurd, but we do so because we want to harness the power of a tiny bit of silicon which happens to be able to dramatically speed up the switching bandwidth.

absurd

beware of proprietary silicon, it’s absurd!

That tiny bit of silicon is known as an ASIC, or an application specific integrated circuit, and one particularly well performing ASIC (which is present in many commercially available switches) is called the Trident.

None of this should impact the end-user management experience, however, because the big switch companies and chip makers believe that there is some special differentiation in their IP, they’ve ensured that the stacks and software surrounding the hardware is highly proprietary and difficult to replace. This also lets them create and sell bundled products and features that you don’t want, but which you can’t get elsewhere.

This is still true today! System and network engineers know too well the hassles of dealing with the different proprietary switch operating systems and interfaces. Why not standardize on the well-known interface that every GNU/Linux server uses.

We’re talking about iptables of course! (Although nftables would be an acceptable standard too!) This way we could have a common interface for all the networked devices in our server room.

I’ve been able to work around this limitation in the past, by using Linux to do my routing in software, and by building the routers out of COTS 2U GNU/Linux boxes. The trouble with this approach, is that they’re bigger, louder, more expensive, consume more power, and don’t have the port density that a 48 port 1U switch does.

a 48 port, 1U switch

a 48 port, 1U switch

It turns out that there is a company which is actually trying to build this mythical box. It is not perfect, but I think they are on the right track. What follows are my opinions of what they’ve done right, what’s wrong, and what I’d like to see in the future.

Who are they?

They are Cumulus Networks, and I recently got to meet, demo and discuss with one of their very talented engineers, Leslie Carr. I recently attended a talk that she gave on this very same subject. She gave me a rocket turtle. (Yes, this now makes me biased!)

my rocket turtle, the cumulus networks mascot

my rocket turtle, the cumulus networks mascot

What are they doing?

You buy an existing switch from your favourite vendor. You then throw out (flash over) the included software, and instead, pay them a yearly licensing fee to use the “Cumulus” GNU/Linux. It comes as an OS image, based off of Debian.

How does it talk to the ASIC?

The OS image comes with a daemon called switchd that transfers the kernel iptables rules down into the ASIC. I don’t know the specifics on how this works exactly because:

  1. Switchd is proprietary. Apparently this is because of a scary NDA they had to sign, but it’s still unfortunate, and it is impeding my hacking.
  2. I’m not an expert on talking to ASIC’s. I’m guessing that unless you’ve signed the NDA’s, and you’re behind the Trident paywall, then it’s tough to become one!

Problems with packaging:

The OS is only distributed as a single image. This is an unfortunate mistake! It should be available from the upstream project with switchd (and any other add-ons) as individual .deb packages. That way, I know I’m getting a stock OS which is preferably even built and signed by the Debian team! That way I could use the same infrastructure for my servers to keep all my servers up to date.

Problems with OS security:

Unfortunately the OS doesn’t benefit from any of the standard OS security enhancements like SELinux. I’d prefer running a more advanced distro like RHEL or CentOS that have these things out of the box, but if Cumulus will continue using Debian, then they must include some more advanced security measures. I didn’t find AppArmor or grsecurity in use either. It did seem to have all the important bash security updates:

cumulus@switch1$ bash --version
GNU bash, version 4.2.37(1)-release (powerpc-unknown-linux-gnu)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Use of udev:

This switch does seem to support and use udev, although not being a udev expert I can’t comment on if it’s done properly or not. I’d be interested to hear from the pros. Here’s what I found:

cumulus@switch1$ cd /etc/udev/
cumulus@switch1$ tree
.
`-- rules.d
    |-- 10-cumulus.rules
    |-- 60-bridge-network-interface.rules -> /dev/null
    |-- 75-persistent-net-generator.rules -> /dev/null
    `-- 80-networking.rules -> /dev/null

1 directory, 4 files
cumulus@switch1$ cat rules.d/10-cumulus.rules | tail -n 7
# udev rules specific to Cumulus Linux

# Rule called when the linux-user-bde driver is loaded
ACTION=="add" SUBSYSTEM=="module" DEVPATH=="/module/linux_user_bde" ENV{DEVICE_NAME}="linux-user-bde" ENV{DEVICE_TYPE}="c" ENV{DEVICE_MINOR}="0" RUN="/usr/lib/cumulus/udev-module"

# Quanta LY8 uses RTC1
KERNEL=="rtc1", PROGRAM="/usr/bin/platform-detect", RESULT=="quanta,ly8_rangeley", SYMLINK+="rtc"

Other things:

There seems to be a number of extra things running on the switch. Here’s what I mean:

cumulus@switch1$ ps auxwww 
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   2516   860 ?        Ss   Nov05   0:02 init [3]  
root         2  0.0  0.0      0     0 ?        S    Nov05   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Nov05   0:05 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    Nov05   0:00 [kworker/u:0]
root         6  0.0  0.0      0     0 ?        S    Nov05   0:00 [migration/0]
root         7  0.0  0.0      0     0 ?        S<   Nov05   0:00 [cpuset]
root         8  0.0  0.0      0     0 ?        S<   Nov05   0:00 [khelper]
root         9  0.0  0.0      0     0 ?        S<   Nov05   0:00 [netns]
root        10  0.0  0.0      0     0 ?        S    Nov05   0:02 [sync_supers]
root        11  0.0  0.0      0     0 ?        S    Nov05   0:00 [bdi-default]
root        12  0.0  0.0      0     0 ?        S<   Nov05   0:00 [kblockd]
root        13  0.0  0.0      0     0 ?        S<   Nov05   0:00 [ata_sff]
root        14  0.0  0.0      0     0 ?        S    Nov05   0:00 [khubd]
root        15  0.0  0.0      0     0 ?        S<   Nov05   0:00 [rpciod]
root        17  0.0  0.0      0     0 ?        S    Nov05   0:00 [khungtaskd]
root        18  0.0  0.0      0     0 ?        S    Nov05   0:00 [kswapd0]
root        19  0.0  0.0      0     0 ?        S    Nov05   0:00 [fsnotify_mark]
root        20  0.0  0.0      0     0 ?        S<   Nov05   0:00 [nfsiod]
root        21  0.0  0.0      0     0 ?        S<   Nov05   0:00 [crypto]
root        34  0.0  0.0      0     0 ?        S    Nov05   0:00 [scsi_eh_0]
root        36  0.0  0.0      0     0 ?        S    Nov05   0:00 [kworker/u:2]
root        41  0.0  0.0      0     0 ?        S    Nov05   0:00 [mtdblock0]
root        42  0.0  0.0      0     0 ?        S    Nov05   0:00 [mtdblock1]
root        43  0.0  0.0      0     0 ?        S    Nov05   0:00 [mtdblock2]
root        44  0.0  0.0      0     0 ?        S    Nov05   0:00 [mtdblock3]
root        49  0.0  0.0      0     0 ?        S    Nov05   0:00 [mtdblock4]
root       362  0.0  0.0      0     0 ?        S    Nov05   0:54 [hwmon0]
root       363  0.0  0.0      0     0 ?        S    Nov05   1:01 [hwmon1]
root       398  0.0  0.0      0     0 ?        S    Nov05   0:04 [flush-8:0]
root       862  0.0  0.1  28680  1768 ?        Sl   Nov05   0:03 /usr/sbin/rsyslogd -c4
root      1041  0.0  0.2   5108  2420 ?        Ss   Nov05   0:00 /sbin/dhclient -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases eth0
root      1096  0.0  0.1   3344  1552 ?        S    Nov05   0:16 /bin/bash /usr/bin/arp_refresh
root      1188  0.0  1.1  15456 10676 ?        S    Nov05   1:36 /usr/bin/python /usr/sbin/ledmgrd
root      1218  0.2  1.1  15468 10708 ?        S    Nov05   5:41 /usr/bin/python /usr/sbin/pwmd
root      1248  1.4  1.1  15480 10728 ?        S    Nov05  39:45 /usr/bin/python /usr/sbin/smond
root      1289  0.0  0.0  13832   964 ?        SNov05   0:00 /sbin/auditd
root      1291  0.0  0.0  10456   852 ?        S
root     12776  0.0  0.0      0     0 ?        S    04:31   0:01 [kworker/0:0]
root     13606  0.0  0.0      0     0 ?        S    05:05   0:00 [kworker/0:2]
root     13892  0.0  0.0      0     0 ?        S    05:13   0:00 [kworker/0:1]
root     13999  0.0  0.0   2028   512 ?        S    05:16   0:00 sleep 30
cumulus  14016  0.0  0.1   3324  1128 pts/0    R+   05:17   0:00 ps auxwww
root     30713 15.6  2.4  69324 24176 ?        Ssl  Nov05 304:16 /usr/sbin/switchd -d
root     30952  0.0  0.3  11196  3500 ?        Ss   Nov05   0:00 sshd: cumulus [priv]
cumulus  30954  0.0  0.1  11196  1684 ?        S    Nov05   0:00 sshd: cumulus@pts/0
cumulus  30955  0.0  0.2   4548  2888 pts/0    Ss   Nov05   0:01 -bash

In particular, I’m referring to ledmgrd, pwmd, smond and others. I don’t doubt these things are necessary and useful, in fact, they’re written in python and should be easy to reverse if anyone is interested, but if they’re a useful part of a switch capable operating system, I hope that they grow proper upstream projects and get appropriate documentation, licensing, and packaging too!

Switch ports are network devices:

Hacking on the device couldn’t feel more native. Anyone remember how to enumerate the switch ports on IOS? … Who cares! Try the standard iproute2 tools on a Cumulus box:

cumulus@switch1$ ip a s swp42
44: swp42: <broadcast,multicast> mtu 1500 qdisc noop state DOWN qlen 500
    link/ether 08:9e:01:f8:96:74 brd ff:ff:ff:ff:ff:ff
cumulus@switch1$ ip a | grep swp | wc -l
52

What about ifup/ifdown?

This one is a bit different:

cumulus@switch1$ file /sbin/ifup
/sbin/ifup: symbolic link to `/sbin/ifupdown'
cumulus@switch1$ file /sbin/ifupdown 
/sbin/ifupdown: Python script, ASCII text executable

The Cumulus team encountered issues with the traditional ifup/ifdown tools found in a stock distro. So they replaced them, with shiny python versions:

https://github.com/CumulusNetworks/ifupdown2

I hope that this project either gets into the upstream distro, or that some upstream writes tools that address the limitations in the various messes of shell scripts. I’m optimistic about networkd being the solution here, but until that’s fully baked, the Cumulus team has built a nice workaround. Additionally, until the Debian team finalizes on the proper technical decision to use SystemD, it has a bleak future.

Kernel:

All the kernel hackers out there will want to know what’s under the hood:

cumulus@switch1$ uname -a
Linux leaf1 3.2.46-1+deb7u1+cl2.2+1 #3.2.46-1+deb7u1+cl2.2+1 SMP Thu Oct 16 14:28:31 PDT 2014 ppc powerpc GNU/Linux

Because this is an embedded chip found in a 1U box, and not an Xeon processor, it’s noticeably slower than a traditional server. This is of course (non-sarcastically) exactly what I want. For admin tasks, it has plenty of power, and this trade-off means it has lower power consumption and heat production than a stock server. While debugging some puppet code that I was running takes longer than normal on this box, I was eventually able to get the job done. This is another reason why this box needs to act like more of an upstream distro — if it did, I’d be able to have a faster machine as my dev box!

Other tools:

Other stock tools like ethtool, and brctl, work out of the box. Bonding, vlan’s and every other feature I tested seems to work the same way you expect from a GNU/Linux system.

Puppet and automation:

Readers of my blog will know that I manage my servers with Puppet. Instead of having the puppet agent connect over an API to the switch, you can directly install and run puppet on this Cumulus Linux machine! Some users might quickly jump to using the firewall module as the solution for consistent management, but a level two user will know that a higher level wrapper around shorewall is the better approach. This is all possible with this switch and seems to work great! The downside was that I had to manually add repositories to get the shorewall packages because it is not a stock distro :(

Why not SDN?

SDN or software-defined networking, is a fantastic and interesting technology. Unfortunately, it’s a parallel problem to what I’m describing in this article, and not a solution to help you work around the lack of a good GNU+Linux switch. If programming the ASIC’s wasn’t an NDA requiring activity, I bet we’d see a lot more innovative networking companies and technologies pop up.

Future?

This product isn’t quite baked yet for me to want to use it in production, but it’s so tantalizingly close that it’s strongly worth considering. I hope that they react positively to my suggestions and create an even more native, upstream environment. Once this is done, it will be my go to, for all my switching!

Thanks:

Thanks very much to the Cumulus team for showing me their software, and giving me access to demo it on some live switches. I didn’t test performance, but I have no doubt that it competes with the market average. Prove me right by trying it out yourself!

Thanks for listening, and Happy hacking!

James

PS: Special thanks to David Caplan for the great networking discussions we had!

Building base images for Vagrant with a Makefile

I needed a base image “box” for my Puppet-Gluster+Vagrant work. It would have been great if good boxes already existed, and even better if it were easy to build my own. As it turns out, I wasn’t able to satisfy either of these conditions, so I’ve had to build one myself! I’ve published all of my code, so that you can use these techniques and tools too!

Status quo:

Having an NIH problem is bad for your vision, and it’s best to benefit from existing tools before creating your own. I first tried using vagrant-cachier, and then veewee, and packer. Vagrant-cachier is a great tool, but it turned out not being very useful because there weren’t any base images available for download that met my needs. Veewee and packer can build those images, but they both failed in doing so for different reasons. Hopefully this situation will improve in the future.

Writing a script:

I started by hacking together a short shell script of commands for building base images. There wasn’t much programming involved as the process was fairly linear, but it was useful to figure out what needed getting done.

I decided to use the excellent virt-builder command to put together the base image. This is exactly what it’s good at doing! To install it on Fedora 20, you can run:

$ sudo yum install libguestfs-tools

It wasn’t available in Fedora 19, but after a lot of pain, I managed to build (mostly correct?) packages. I have posted them online if you are brave (or crazy?) enough to want them.

Using the right tool:

After building a few images, I realized that a shell script was the wrong tool, and that it was time for an upgrade. What was the right tool? GNU Make! After working on this for more hours than I’m ready to admit, I present to you, a lovingly crafted virtual machine base image (“box”) builder:

Makefile

The Makefile itself is quite compact. It uses a few shell scripts to do some of the customization, and builds a clean image in about ten minutes. To use it, just run make.

Customization:

At the moment, it builds x86_64, CentOS 6.5+ machines for vagrant-libvirt, but you can edit the Makefile to build a custom image of your choosing. I’ve gone out of my way to add an $(OUTPUT) variable to the Makefile so that your generated files get saved in /tmp/ or somewhere outside of your source tree.

Download the image:

If you’d like to download the image that I generated, it is being generously hosted by the Gluster community here. If you’re using the Vagrantfile from my Puppet-Gluster+Vagrant setup, then you don’t have to download it manually, this will happen automatically.

Open issues:

The biggest issue with the images is that SELinux gets disabled! You might be okay with this, but it’s actually quite unfortunate. It is disabled to avoid the SELinux relabelling that happens on first boot, as this overhead defeats the usefulness of a fast vagrant deployment. If you know of a way to fix this problem, please let me know!

Example output:

If you’d like to see this in action, but don’t want to run it yourself, here’s an example run:

$ date && time make && date
Mon Jan 20 10:57:35 EST 2014
Running templater...
Running virt-builder...
[   1.0] Downloading: http://libguestfs.org/download/builder/centos-6.xz
[   4.0] Planning how to build this image
[   4.0] Uncompressing
[  19.0] Resizing (using virt-resize) to expand the disk to 40.0G
[ 173.0] Opening the new disk
[ 181.0] Setting a random seed
[ 181.0] Setting root password
[ 181.0] Installing packages: screen vim-enhanced git wget file man tree nmap tcpdump htop lsof telnet mlocate bind-utils koan iftop yum-utils nc rsync nfs-utils sudo openssh-server openssh-clients
[ 212.0] Uploading: files/epel-release-6-8.noarch.rpm to /root/epel-release-6-8.noarch.rpm
[ 212.0] Uploading: files/puppetlabs-release-el-6.noarch.rpm to /root/puppetlabs-release-el-6.noarch.rpm
[ 212.0] Uploading: files/selinux to /etc/selinux/config
[ 212.0] Deleting: /.autorelabel
[ 212.0] Running: yum install -y /root/epel-release-6-8.noarch.rpm && rm -f /root/epel-release-6-8.noarch.rpm
[ 214.0] Running: yum install -y bash-completion moreutils
[ 235.0] Running: yum install -y /root/puppetlabs-release-el-6.noarch.rpm && rm -f /root/puppetlabs-release-el-6.noarch.rpm
[ 239.0] Running: yum install -y puppet
[ 254.0] Running: yum update -y
[ 375.0] Running: files/user.sh
[ 376.0] Running: files/ssh.sh
[ 376.0] Running: files/network.sh
[ 376.0] Running: files/cleanup.sh
[ 377.0] Finishing off
Output: /home/james/tmp/builder/gluster/builder.img
Output size: 40.0G
Output format: qcow2
Total usable space: 38.2G
Free space: 37.3G (97%)
Running convert...
Running tar...
./Vagrantfile
./metadata.json
./box.img

real	9m10.523s
user	2m23.282s
sys	0m37.109s
Mon Jan 20 11:06:46 EST 2014
$

If you have any other questions, please let me know!

Happy hacking,

James

PS: Be careful when writing Makefile‘s. They can be dangerous if used improperly, and in fact I once took out part of my lib/ directory by running one. Woops!

UPDATE: This technique now exists in it’s own repo here: https://github.com/purpleidea/vagrant-builder

New puppet-gluster features before Linuxcon

Hey there,

I’ve done a bit of puppet-gluster hacking lately to try to squeeze some extra features and testing in before Linuxcon. Here’s a short list:

If there are features or bugs that you’d like to see added or removed (respectively) please let me know ASAP so that I can try to get something ready for you before my Linuxcon talk. I also don’t have any hardware RAID or physical hardware to test external storage partitions (bricks) on. If you have any that you can donate or let me hack on for a while, there are some features I’d like to test. Contact me!

I’ve got a few more things queued up too, but you’ll have to wait and see. Until then,

Happy hacking,

James

Upgrading to centos 6.4 with shorewall onboard

In case you upgrade your CentOS 6.x box to version 6.4, the shorewall service might complain. With a scary message:

ERROR: Your kernel/iptables do not include state match support.
No version of Shorewall will run on this system

This is selinux at work, and the problem can easily be solved by running:

# restorecon -Rv /sbin

Thanks shorewall-users and

Happy hacking,

James

 

SElinux causes pain when using puppet 2.x with hiera

So hiera wasn’t working when used through my puppetmaster. It worked perfectly when I was running my scripts manually with: puppet apply site.pp but the moment I switched over to regular puppetmasterd usage, everything went dead.

I realized a while back, that I could always expect some hiera failures from time to time. Whether this is hiera’s fault or not is irrelevant, the relevant part is that I quickly added:

$test = hiera('test', '')    # expects: 'Hiera is working!'
if "${test}" == '' {
    fail('The hiera function is not working as expected!')
}

to my site.pp and now I don’t risk breaking/changing a node because some data is missing. Naturally you’ll need to add a little message in your globals.yaml or similar. If you’re really cautious, you could change the above to an inequality and have your code “expect” a magic message.

To continue with the story, blkperl from #puppet recommended that I try running puppetmasterd manually to watch for any messages it might print. To my surprise, things started working! This just shows you how helpful it is to have a second set of eyes to help you on your way. So why did this work ?

To make a long story short, after following a few dead ends, it hit me, and so I looked in /var/log/audit/audit.log:

type=AVC msg=audit(1358404083.789:55416): avc:  denied  { getattr } for  pid=13830 comm="puppetmasterd" path="/etc/puppet/hiera.yaml" dev=dm-0 ino=18613624 scontext=unconfined_u:system_r:puppetmaster_t:s0 tcontext=unconfined_u:object_r:admin_home_t:s0 tclass=file

to find that selinux was once again causing me pain. When I had started puppetmasterd manually, I was root, which allowed me to bypass selinux rules. Sadly, I like selinux, but since I’m not nearly clever enough to want to learn how to fix this the right way, it just got disabled on another one of my machines.

Running:

restorecon -v /etc/puppet/hiera.yaml

fixed the bad selinux context I had on that file.

Hope this saves you some time and,

Happy hacking,

James

Clustering virtual machines with rgmanager and clusvcadm

This could be a post detailing how to host clustered virtual machines with rgmanager and clusvcadm, but that is a longer story and there is much work to do. For now, I will give you a short version including an informative “gotcha”.

With my cluster up and running, I added a virtual machine entry to my cluster.conf:

<vm name="test1" domain="somedomain" path="/shared/vm/" autostart="0" exclusive="0" recovery="restart" use_virsh="1" />

This goes inside the <rm> block. As a benchmark, please note that starting the machine with virsh worked perfectly:

[root@server1 ~]# virsh create /shared/vm/test1.xml --console
(...The operation worked perfectly!)

However, when I attempted to use the cluster aware tools, all I got was failure:

[root@server1 ~]# clusvcadm -e 'vm:test1' -m server1
Member server1 trying to enable vm:test1...Failure

Whenever I think I’ve done everything right, but something is still not working, I first check to see if I can blame someone else. Usually that someone is selinux. Make no mistake, selinux is a good thing, however it does still cause me pain.

The first clue is to remember that /var/log/ contains other files besides “messages“. Running a tail on /var/log/audit/audit.log while simultaneously running the above clusvcadm command revealed:

type=AVC msg=audit(1357202069.310:10904): avc:  denied  { read } for  pid=15675 comm="virsh" name="test1.xml" dev=drbd0 ino=198628 scontext=unconfined_u:system_r:xm_t:s0 tcontext=unconfined_u:object_r:default_t:s0 tclass=file
type=SYSCALL msg=audit(1357202069.310:10904): arch=c000003e syscall=2 success=no exit=-13 a0=24259e0 a1=0 a2=7ffff03af0d0 a3=7ffff03aee10 items=0 ppid=15609 pid=15675 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=1 comm="virsh" exe="/usr/bin/virsh" subj=unconfined_u:system_r:xm_t:s0 key=(null)

I am not a magician, but if I was, I would probably understand what all of that means. For now, let’s pretend that we do. Closer inspection (or grep) will reveal:

  • test1.xml” (the definition for the virtual machine)

and:

  • “/usr/bin/virsh” (the command that I expect rgmanager’s /usr/share/cluster/vm.sh script to run)

A quick:

[root@server1 ~]# selinuxenabled && echo t || echo f
t

to confirm that selinux is auditing away, and a short:

[root@server1 ~]# /bin/echo 0 > /selinux/enforce

to temporarily test my theory, and:

[root@server1 ~]# clusvcadm -e 'vm:test1' -m server1
Member server1 trying to enable vm:test1...Success
vm:test1 is now running on server1

Presto change-o, the diagnosis is complete. This is a development system, and so for the time being, I will accept defeat and workaround this problem by turning selinux off, but this is most definitely the wrong solution. If you’re an selinux guru who knows the proper fix, please let me know! Until then,

Happy Hacking,

James