Automatic grouping in mgmt

In this post, I’ll tell you about the recently released “automatic grouping” or “AutoGroup” feature in mgmt, a next generation configuration management prototype. If you aren’t already familiar with mgmt, I’d recommend you start by reading the introductory post, and the second post. There’s also an introductory video.

Resources in a graph

Most configuration management systems use something called a directed acyclic graph, or DAG. This is a fancy way of saying that it is a bunch of circles (vertices) which are connected with arrows (edges). The arrows must be connected to exactly two vertices, and you’re only allowed to move along each arrow in one direction (directed). Lastly, if you start at any vertex in the graph, you must never be able to return to where you started by following the arrows (acyclic). If you can, the graph is not fit for our purpose.

A DAG from Wikipedia

An example DAG from Wikipedia

The graphs in configuration management systems usually represent the dependency relationships (edges) between the resources (vertices) which is important because you might want to declare that you want a certain package installed before you start a service. To represent the kind of work that you want to do, different kinds of resources exist which you can use to specify that work.

Each of the vertices in a graph represents a unique resource, and each is backed by an individual software routine or “program” which can check the state of the resource, and apply the correct state if needed. This makes each resource idempotent. If we have many individual programs, this might turn out to be a lot of work to do to get our graph into the desired state!

Resource grouping

It turns out that some resources have a fixed overhead to starting up and running. If we can group resources together so that they share this fixed overhead, then our graph might converge faster. This is exactly what we do in mgmt!

Take for example, a simple graph such as the following:

Simple DAG showing three pkg, two file, and one svc resource

Simple DAG showing one svc, two file, and three pkg resources…

We can logically group the three pkg resources together and redraw the graph so that it now looks like this:

DAG with the three pkg resources now grouped into one.

DAG with the three pkg resources now grouped into one! Overlapping vertices mean that they act as if they’re one vertex instead of three!

This all happens automatically of course! It is very important that the new graph is a faithful, logical representation of the original graph, so that the specified dependency relationships are preserved. What this represents, is that when multiple resources are grouped (shown by overlapping vertices in the graph) they run together as a single unit. This is the practical difference between running:

$ dnf install -y powertop
$ dnf install -y sl
$ dnf install -y cowsay

if not grouped, and:

$ dnf install -y powertop sl cowsay

when grouped. If you try this out you’ll see that the second scenario is much faster, and on my laptop about three times faster! This is because of fixed overhead such as cache updates, and the dnf dep solver that each process runs.

This grouping means mgmt uses this faster second scenario instead of the slower first scenario that all the current generation tools do. It’s also important to note that different resources can implement the grouping feature to optimize for different things besides performance. More on that later…

The algorithm

I’m not an algorithmist by training, so it took me some fiddling to come up with an appropriate solution. I’ve implemented it along with an extensive testing framework and a series of test cases, which it passes of course! If we ever find a graph that does not get grouped correctly, then we can iterate on the algorithm and add it as a new test case.

The algorithm turns out to be relatively simple. I first noticed that vertices which had a relationship between them must not get grouped, because that would undermine the precedence ordering of the vertices! This property is called reachability. I then attempt to group every vertex to every other vertex that has no reachability or reverse reachability to it!

The hard part turned out to be getting all the plumbing surrounding the algorithm correct, and in particular the actual vertex merging algorithm, so that “discarded edges” are reattached in the correct places. I also took a bit of extra time to implement the algorithm as a struct which satisfies an “AutoGrouper” interface. This way, if you’d like to implement a different algorithm, it’s easy to drop in your replacement. I’m fairly certain that a more optimal version of my algorithm is possible for anyone wishing to do the analysis.

A quick note on nomenclature: I’ve actually decided to call this grouping and not merging, because we actually preserve the unique data of each resource so that they can be taken apart and mixed differently when (and if) there is a change in the compiled graph. This makes graph changeovers very cheap in mgmt, because we don’t have to re-evaluate anything which remains constant between graphs. Merging would imply a permanent reduction and loss of unique identity.

Parallelism and user choice

It’s worth noting two important points:

  1. Auto grouping of resources usually decreases the parallelism of a graph.
  2. The user might not want a particular resource to get grouped!

You might remember that one of the novel properties of mgmt, is that it executes the graph in parallel whenever possible. Although the grouping of resources actually removes some of this parallelism, certain resources such as the pkg resource already have an innate constraint on sequential behaviour, namely: the package manager lock. Since these tools can’t operate in parallel, and since each execution has a fixed overhead, it’s almost always beneficial to group pkg resources together.

Grouping is also not mandatory, so while it is a sensible default, you can disable grouping per resource with a simple meta parameter.

Lastly, it’s also worth mentioning that grouping doesn’t “magically” happen without some effort. The underlying resource needs to know how to optimize, watch, check and apply multiple resources simultaneously for it to support the feature. At the moment, only the pkg resource can do any grouping, and even then, there could always be some room for improvement. It’s also not optimal (or even logical) to group certain types of resources, so those will never be able to do any grouping. We also don’t group together resources of different kinds, although mgmt could support this if a valid use case is ever found.

File grouping

As I mentioned, only the pkg resource supports grouping at this time. The file resource demonstrates a different use case for resource grouping. Suppose you want to monitor 10000 files in a particular directory, but they are specified individually. This would require far too many inotify watches than a normal system usually has, so the grouping algorithm could group them into a single resource, which then uses a recursive watcher such as fanotify to reduce the watcher count by a factor of 10000. Unfortunately neither the file resource grouping, nor the fanotify support for this exist at the moment. If you’d like to implement either of these, please let me know!

If you can think of another resource kind that you’d like to write, or in particular, if you know of one which would work well with resource grouping, please contact me!

Science!

I wouldn’t be a very good scientist (I’m actually a Physiologist by training) if I didn’t include some data and a demonstration that this all actually works, and improves performance! What follows will be a good deal of information, so skim through the parts you don’t care about.

Science <3 data

Science <3 data

I decided to test the following four scenarios:

  1. single package, package check, package already installed
  2. single package, package install, package not installed
  3. three packages, package check, packages already installed
  4. three packages, package install, packages not installed

These are the situations you’d encounter when running your tool of choice to install one or more packages, and finding them either already present, or in need of installation. I timed each test, which ends when the tool tells us that our system has converged.

Each test is performed multiple times, and the average is taken, but only after we’ve run the tool at least twice so that the caches are warm.

We chose small packages so that the fixed overhead delays due to bandwidth and latencies are minimal, and so that our data is more representative of the underlying tool.

The single package tests use the powertop package, and the three package tests use powertop, sl, and cowsay. All tests were performed on an up-to-date Fedora 23 laptop, with an SSD. If you haven’t tried sl and cowsay, do give them a go!

The four tools tested were:

  1. puppet
  2. mgmt
  3. pkcon
  4. dnf

The last two are package manager front ends so that it’s more obvious how expensive something is expected to cost, and so that you can discern what amount of overhead is expected, and what puppet or mgmt is causing you. Here are a few representative runs:

mgmt installation of powertop:

$ time sudo ./mgmt run --file examples/pkg1.yaml --converged-timeout=0
21:04:18 main.go:63: This is: mgmt, version: 0.0.3-1-g6f3ac4b
21:04:18 main.go:64: Main: Start: 1459299858287120473
21:04:18 main.go:190: Main: Running...
21:04:18 main.go:113: Etcd: Starting...
21:04:18 main.go:117: Main: Waiting...
21:04:18 etcd.go:113: Etcd: Watching...
21:04:18 etcd.go:113: Etcd: Watching...
21:04:18 configwatch.go:54: Watching: examples/pkg1.yaml
21:04:20 config.go:272: Compile: Adding AutoEdges...
21:04:20 config.go:533: Compile: Grouping: Algorithm: nonReachabilityGrouper...
21:04:20 main.go:171: Graph: Vertices(1), Edges(0)
21:04:20 main.go:174: Graphviz: No filename given!
21:04:20 pgraph.go:764: State: graphStateNil -> graphStateStarting
21:04:20 pgraph.go:825: State: graphStateStarting -> graphStateStarted
21:04:20 main.go:117: Main: Waiting...
21:04:20 pkg.go:245: Pkg[powertop]: CheckApply(true)
21:04:20 pkg.go:303: Pkg[powertop]: Apply
21:04:20 pkg.go:317: Pkg[powertop]: Set: installed...
21:04:25 packagekit.go:399: PackageKit: Woops: Signal.Path: /8442_beabdaea
21:04:25 packagekit.go:399: PackageKit: Woops: Signal.Path: /8443_acbadcbd
21:04:31 pkg.go:335: Pkg[powertop]: Set: installed success!
21:04:31 main.go:79: Converged for 0 seconds, exiting!
21:04:31 main.go:55: Interrupted by exit signal
21:04:31 pgraph.go:796: Pkg[powertop]: Exited
21:04:31 main.go:203: Goodbye!

real    0m13.320s
user    0m0.023s
sys    0m0.021s

puppet installation of powertop:

$ time sudo puppet apply pkg.pp 
Notice: Compiled catalog for computer.example.com in environment production in 0.69 seconds
Notice: /Stage[main]/Main/Package[powertop]/ensure: created
Notice: Applied catalog in 10.13 seconds

real    0m18.254s
user    0m9.211s
sys    0m2.074s

dnf installation of powertop:

$ time sudo dnf install -y powertop
Last metadata expiration check: 1:22:03 ago on Tue Mar 29 20:04:29 2016.
Dependencies resolved.
==========================================================================
 Package          Arch           Version            Repository       Size
==========================================================================
Installing:
 powertop         x86_64         2.8-1.fc23         updates         228 k

Transaction Summary
==========================================================================
Install  1 Package

Total download size: 228 k
Installed size: 576 k
Downloading Packages:
powertop-2.8-1.fc23.x86_64.rpm            212 kB/s | 228 kB     00:01    
--------------------------------------------------------------------------
Total                                     125 kB/s | 228 kB     00:01     
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Installing  : powertop-2.8-1.fc23.x86_64                            1/1 
  Verifying   : powertop-2.8-1.fc23.x86_64                            1/1 

Installed:
  powertop.x86_64 2.8-1.fc23                                              

Complete!

real    0m10.406s
user    0m4.954s
sys    0m0.836s

puppet installation of powertop, sl and cowsay:

$ time sudo puppet apply pkg3.pp 
Notice: Compiled catalog for computer.example.com in environment production in 0.68 seconds
Notice: /Stage[main]/Main/Package[powertop]/ensure: created
Notice: /Stage[main]/Main/Package[sl]/ensure: created
Notice: /Stage[main]/Main/Package[cowsay]/ensure: created
Notice: Applied catalog in 33.02 seconds

real    0m41.229s
user    0m19.085s
sys    0m4.046s

pkcon installation of powertop, sl and cowsay:

$ time sudo pkcon install powertop sl cowsay
Resolving                     [=========================]         
Starting                      [=========================]         
Testing changes               [=========================]         
Finished                      [=========================]         
Installing                    [=========================]         
Querying                      [=========================]         
Downloading packages          [=========================]         
Testing changes               [=========================]         
Installing packages           [=========================]         
Finished                      [=========================]         

real    0m14.755s
user    0m0.060s
sys    0m0.025s

and finally, mgmt installation of powertop, sl and cowsay with autogrouping:

$ time sudo ./mgmt run --file examples/autogroup2.yaml --converged-timeout=0
21:16:00 main.go:63: This is: mgmt, version: 0.0.3-1-g6f3ac4b
21:16:00 main.go:64: Main: Start: 1459300560994114252
21:16:00 main.go:190: Main: Running...
21:16:00 main.go:113: Etcd: Starting...
21:16:00 main.go:117: Main: Waiting...
21:16:00 etcd.go:113: Etcd: Watching...
21:16:00 etcd.go:113: Etcd: Watching...
21:16:00 configwatch.go:54: Watching: examples/autogroup2.yaml
21:16:03 config.go:272: Compile: Adding AutoEdges...
21:16:03 config.go:533: Compile: Grouping: Algorithm: nonReachabilityGrouper...
21:16:03 config.go:533: Compile: Grouping: Success for: Pkg[powertop] into Pkg[cowsay]
21:16:03 config.go:533: Compile: Grouping: Success for: Pkg[sl] into Pkg[cowsay]
21:16:03 main.go:171: Graph: Vertices(1), Edges(0)
21:16:03 main.go:174: Graphviz: No filename given!
21:16:03 pgraph.go:764: State: graphStateNil -> graphStateStarting
21:16:03 pgraph.go:825: State: graphStateStarting -> graphStateStarted
21:16:03 main.go:117: Main: Waiting...
21:16:03 pkg.go:245: Pkg[autogroup:(cowsay,powertop,sl)]: CheckApply(true)
21:16:03 pkg.go:303: Pkg[autogroup:(cowsay,powertop,sl)]: Apply
21:16:03 pkg.go:317: Pkg[autogroup:(cowsay,powertop,sl)]: Set: installed...
21:16:08 packagekit.go:399: PackageKit: Woops: Signal.Path: /8547_cbeaddda
21:16:08 packagekit.go:399: PackageKit: Woops: Signal.Path: /8548_bcaadbce
21:16:16 pkg.go:335: Pkg[autogroup:(cowsay,powertop,sl)]: Set: installed success!
21:16:16 main.go:79: Converged for 0 seconds, exiting!
21:16:16 main.go:55: Interrupted by exit signal
21:16:16 pgraph.go:796: Pkg[cowsay]: Exited
21:16:16 main.go:203: Goodbye!

real    0m15.621s
user    0m0.040s
sys    0m0.038s

Results and analysis

My hard work seems to have paid off, because we do indeed see a noticeable improvement from grouping package resources. The data shows that even in the single package comparison cases, mgmt has very little overhead, which is demonstrated by seeing that the mgmt run times are very similar to the times it takes to run the package managers manually.

In the three package scenario, performance is approximately 2.39 times faster than puppet for installation. Checking was about 12 times faster! These ratios are expected to grow with a larger number of resources.

Sweet graph...

Bigger bars is worse… Puppet is in Red, mgmt is in blue.

The four groups at the bottom along the x axis correspond to the four scenarios I tested, 1, 2 and 3 corresponding to each run of that scenario, with the average of the three listed there too.

Versions

The test wouldn’t be complete if we didn’t tell you which specific version of each tool that we used. Let’s time those as well! ;)

puppet:

$ time puppet --version 
4.2.1

real    0m0.659s
user    0m0.525s
sys    0m0.064s

mgmt:

$ time ./mgmt --version
mgmt version 0.0.3-1-g6f3ac4b

real    0m0.007s
user    0m0.006s
sys    0m0.002s

pkcon:

$ time pkcon --version
1.0.11

real    0m0.013s
user    0m0.006s
sys    0m0.005s

dnf:

$ time dnf --version
1.1.7
  Installed: dnf-0:1.1.7-2.fc23.noarch at 2016-03-17 13:37
  Built    : Fedora Project at 2016-03-09 16:45

  Installed: rpm-0:4.13.0-0.rc1.12.fc23.x86_64 at 2016-03-03 09:39
  Built    : Fedora Project at 2016-02-29 09:53

real    0m0.438s
user    0m0.379s
sys    0m0.036s

Yep, puppet even takes the longest to tell us what version it is. Now I’m just teasing…

Methodology

It might have been more useful to time the removal of packages instead so that we further reduce the variability of internet bandwidth and latency, although since most configuration management is used to install packages (rather than remove), we figured this would be more appropriate and easy to understand. You’re welcome to conduct your own study and share the results!

Additionally, for fun, I also looked at puppet runs where three individual resources were used instead of a single resource with the title being an array of all three packages, and found no significant difference in the results. Indeed puppet runs dnf three separate times in either scenario:

$ ps auxww | grep dnf
root     12118 27.0  1.4 417060 110864 ?       Ds   21:57   0:03 /usr/bin/python3 /usr/bin/dnf -d 0 -e 0 -y install powertop
$ ps auxww | grep dnf
root     12713 32.7  2.0 475204 159840 ?       Rs   21:57   0:02 /usr/bin/python3 /usr/bin/dnf -d 0 -e 0 -y install sl
$ ps auxww | grep dnf
root     13126  0.0  0.7 275324 55608 ?        Rs   21:57   0:00 /usr/bin/python3 /usr/bin/dnf -d 0 -e 0 -y install cowsay

Data

If you’d like to download the raw data as a text formatted table, and the terminal output from each type of run, I’ve posted it here.

Conclusion

I hope that you enjoyed this feature and analysis, and that you’ll help contribute to making it better. Come join our IRC channel and say hello! Thanks to those who reviewed my article and pointed out some good places for improvements!

Happy Hacking,

James

 

Next generation configuration mgmt

It’s no secret to the readers of this blog that I’ve been active in the configuration management space for some time. I owe most of my knowledge to what I’ve learned while working with Puppet and from other hackers working in and around various other communities.

I’ve published, a number, of articles, in an, attempt, to push, the field, forwards, and to, share the, knowledge, that I’ve, learned, with others. I’ve spent many nights thinking about these problems, but it is not without some chagrin that I realized that the current state-of-the-art in configuration management cannot easily (or elegantly) solve all the problems for which I wish to write solutions.

To that end, I’d like to formally present my idea (and code) for a next generation configuration management prototype. I’m calling my tool mgmt.

Design triad

Mgmt has three unique design elements which differentiate it from other tools. I’ll try to cover these three points, and explain why they’re important. The summary:

  1. Parallel execution, to run all the resources concurrently (where possible)
  2. Event driven, to monitor and react dynamically only to changes (when needed)
  3. Distributed topology, so that scale and centralization problems are replaced with a robust distributed system

The code is available, but you may prefer to read on as I dive deeper into each of these elements first.

1) Parallel execution

Fundamentally, all configuration management systems represent the dependency relationships between their resources in a graph, typically one that is directed and acyclic.

directed acyclic graph g1, showing the dependency relationships with black arrows, and the linearized dependency sort order with red arrows.

Directed acyclic graph g1, showing the dependency relationships with black arrows, and the linearized dependency sort order (a topological sort) with red arrows.

Unfortunately, the execution of this graph typically has a single worker that runs through a linearized, topologically sorted version of it. There is no reason that a graph with a number of disconnected parts cannot run each separate section in parallel with each other.

g2

Graph g2 with the red arrows again showing the execution order of the graph. Please note that this graph is composed of two disconnected parts: one diamond on the left and one triplet on the right, both of which can run in parallel. Additionally, nodes 2a and 2b can run in parallel only after 1a has run, and node 3a requires the entire left diamond to have succeeded before it can execute.

Typically, some nodes will have a common dependency, which once met will allow its children to all execute simultaneously.

This is the first major design improvement that the mgmt tool implements. It has obvious improvements for performance, in that a long running process in one part of the graph (eg: a package installation) will cause no delay on a separate disconnected part of the graph which is in the process of converging unrelated code. It also has other benefits which we will discuss below.

In practice this is particularly powerful since most servers under configuration management typically combine different modules, many of which have no inter-dependencies.

An example is the best way to show off this feature. Here we have a set of four long (10 second) running processes or exec resources. Three of them form a linear dependency chain, while the fourth has no dependencies or prerequisites. I’ve asked the system to exit after it has been converged for five seconds. As you can see in the example, it is finished five seconds after the limiting resource should be completed, which is the longest running delay to complete in the whole process. That limiting chain took 30 seconds, which we can see in the log as being from three 10 second executions. The logical total of about 35 seconds as expected is shown at the end:

$ time ./mgmt run --file graph8.yaml --converged-timeout=5 --graphviz=example1.dot
22:55:04 This is: mgmt, version: 0.0.1-29-gebc1c60
22:55:04 Main: Start: 1452398104100455639
22:55:04 Main: Running...
22:55:04 Graph: Vertices(4), Edges(2)
22:55:04 Graphviz: Successfully generated graph!
22:55:04 State: graphStarting
22:55:04 State: graphStarted
22:55:04 Exec[exec4]: Apply //exec4 start
22:55:04 Exec[exec1]: Apply //exec1 start
22:55:14 Exec[exec4]: Command output is empty! //exec4 end
22:55:14 Exec[exec1]: Command output is empty! //exec1 end
22:55:14 Exec[exec2]: Apply //exec2 start
22:55:24 Exec[exec2]: Command output is empty! //exec2 end
22:55:24 Exec[exec3]: Apply //exec3 start
22:55:34 Exec[exec3]: Command output is empty! //exec3 end
22:55:39 Converged for 5 seconds, exiting! //converged for 5s
22:55:39 Interrupted by exit signal
22:55:39 Exec[exec4]: Exited
22:55:39 Exec[exec1]: Exited
22:55:39 Exec[exec2]: Exited
22:55:39 Exec[exec3]: Exited
22:55:39 Goodbye!

real    0m35.009s
user    0m0.008s
sys     0m0.008s
$

Note that I’ve edited the example slightly to remove some unnecessary log entries for readability sake, and I have also added some comments and emphasis, but aside from that, this is actual output! The tool also generated graphviz output which may help you better understand the problem:

 

example1.dot

This example is obviously contrived, but is designed to illustrate the capability of the mgmt tool.

Hopefully you’ll be able to come up with more practical examples.

2) Event driven

All configuration management systems have some notion of idempotence. Put simply, an idempotent operation is one which can be applied multiple times without causing the result to diverge from the desired state. In practice, each individual resource will typically check the state of the element, and if different than what was requested, it will then apply the necessary transformation so that the element is brought to the desired state.

The current generation of configuration management tools, typically checks the state of each element once every 30 minutes. Some do it more or less often, and some do it only when manually requested. In all cases, this can be an expensive operation due to the size of the graph, and the cost of each check operation. This problem is compounded by the fact that the graph doesn’t run in parallel.

g3

In this time / state sequence diagram g3, time progresses from left to right. Each of the three elements (from top to bottom) want to converge on states a, b and c respectively. Initially the first two are in states x and y, where as the third is already converged. At t1 the system runs and converges the graph, which entails a state change of the first and second elements. At some time t2, the elements are changed by some external force, and the system is no longer converged. We won’t notice till later! At this time t3 when we run for the second time, we notice that the second and third elements are no longer converged and we apply the necessary operations to fix this. An unknown amount of time passed where our cluster was in a diverged or unhealthy state. Traditionally this is on the order of 30 minutes.

More importantly, if something diverges from the requested state you might wait 30 minutes before it is noticed and repaired by the system!

The mgmt system is unique, because I realized that an event based system could fulfill the same desired behaviour, and in fact it offers a more general and powerful solution. This is the second major design improvement that the mgmt tool implements.

These events that we’re talking about are inotify events for file changes, systemd events (from dbus) for service changes, packagekit events (from dbus again) for package change events, and events from exec calls, timers, network operations and more! In the inotify example, on first run of the mgmt system, an inotify watch is taken on the file we want to manage, the state is checked and it is converged if need be. We don’t ever need to check the state again unless inotify tells us that something happened!

g4

In this time / state sequence diagram g4, time progresses from left to right. After the initial run, since all the elements are being continuously monitored, the instant something changes, mgmt reacts and fixes it almost instantly.

Astute config mgmt hackers might end up realizing three interesting consequences:

  1. If we don’t want the mgmt program to continuously monitor events, it can be told to exit after the graph converges, and run again 30 minutes later. This can be done with my system by running it from cron with the --converged-timeout=1, flag. This effectively offers the same behaviour that current generation systems do for the administrators who do not want to experiment with a newer model. Thus, the current systems are a special, simplified model of mgmt!
  2. It is possible that some resources don’t offer an event watching mechanism. In those instances, a fallback to polling is possible for that specific resource. Although there currently aren’t any missing event APIs that your narrator knows about at this time.
  3. A monitoring system (read: nagios and friends) could probably be built with this architecture, thus demonstrating that my world view of configuration management is actually a generalized version of system monitoring! They’re the same discipline!

Here is a small real-world example to demonstrate this feature. I have started the agent, and I have told it to create three files (f1, f2, f3) with the contents seen below, and also, to ensure that file f4 is not present. As you can see the mgmt system responds quite quickly:

james@computer:/tmp/mgmt$ ls
f1  f2  f3
james@computer:/tmp/mgmt$ cat *
i am f1
i am f2
i am f3
james@computer:/tmp/mgmt$ rm -f f2 && cat f2
i am f2
james@computer:/tmp/mgmt$ echo blah blah > f2 && cat f2
i am f2
james@computer:/tmp/mgmt$ touch f4 && file f4
f4: cannot open `f4' (No such file or directory)
james@computer:/tmp/mgmt$ ls
f1  f2  f3
james@computer:/tmp/mgmt$

That’s fast!

3) Distributed topology

All software typically runs with some sort of topology. Puppet and Chef normally run in a client server topology, where you typically have one server with many clients, each running an agent. They also both offer a standalone mode, but in general this is not more interesting than running a fancy bash script. In this context, I define interesting as “relating to clustered, multiple machine architectures”.

g5

Here in graph g5 you can see one server which has three clients initiating requests to it.

This traditional model of computing is well-known, and fairly easy to reason about. You typically put all of your code in one place (on the server) and the clients or agents need very little personalized configuration to get working. However, it can suffer from performance and scalability issues, and it can also be a single point of failure for your infrastructure. Make no mistake: if you manage your infrastructure properly, then when your configuration management infrastructure is down, you will be unable to bring up new machines or modify existing ones! This can be a disastrous type of failure, and is one which is seldom planned for in disaster recovery scenarios!

Other systems such as Ansible are actually orchestrators, and are not technically configuration management in my opinion. That doesn’t mean they don’t share much of the same problem space, and in fact they are usually idempotent and share many of the same properties of traditional configuration management systems. They are useful and important tools!

graph6The key difference about an orchestrator, is that it typically operates with a push model, where the server (or the sysadmin laptop) initiates a connection to the machines that it wants to manage. One advantage is that this is sometimes very easy to reason about for multi machine architectures, however it shares the usual downsides around having a single point of failure. Additionally there are some very real performance considerations when running large clusters of machines. In practice these clusters are typically segmented or divided in some logical manner so as to lessen the impact of this, which unfortunately detracts from the aforementioned simplicity of the solution.

Unfortunately with either of these two topologies, we can’t immediately detect when an issue has occurred and respond immediately without sufficiently advanced third party monitoring. By itself, a machine that is being managed by orchestration, cannot detect an issue and communicate back to its operator, or tell the cluster of servers it peers with to react accordingly.

The good news about current and future generation topologies is that algorithms such as the Paxos family and Raft are now gaining wider acceptance and good implementations now exist as Free Software. Mgmt depends on these algorithms to create a mesh of agents. There are no clients and servers, only peers! Each peer can choose to both export and collect data from a distributed data store which lives as part of the cluster of peers. The software that currently implements this data store is a marvellous piece of engineering called etcd.

graph7

In graph g7, you can see what a fully interconnected graph topology might look like. It should be clear that the numbed of connections (or edges) is quite large. Try and work out the number of edges required for a fully connected graph with 128 nodes. Hint, it’s large!

In practice the number of connections required for each peer to connect to each other peer would be too great, so instead the cluster first achieves distributed consensus, and then the elected leader picks a certain number of machines to run etcd masters. All other agents then connect through one of these masters. The distributed data store can easily handle failures, and agents can reconnect seamlessly to a different temporary master should they need to. If there is a lack or an abundance of transient masters, then the cluster promotes or demotes an agent automatically by asking it to start or stop an etcd process on its host.

g8

In graph g8, you can see a tightly interconnected centre of nodes running both their configuration management tasks, but also etcd masters. Each additional peer picks any of them to connect to. As the number of nodes scale, it is far easier to scale such a cluster. Future algorithm designs and optimizations should help this system scale to unprecedented host counts. It should go without saying that it would be wise to ensure that the nodes running etcd masters are in different failure domains.

By allowing hosts to export and collect data from the distributed store, we actually end up with a mechanism that is quite similar to what Puppet calls exported resources. In my opinion, the mechanism and data interchange is actually a brilliant idea, but with some obvious shortcomings in its implementation. This is because for a cluster of N nodes, each wishing to exchange data with one another, puppet must run N times (once on each node) and then N-1 times for the entire cluster to see all of the exchanged data. Each of these runs requires an entire sequential run through every resource, and an expensive check of each resource, each time.

In contrast, with mgmt, the graph is redrawn only when an etcd event notifies us of a change in the data store, and when the new graph is applied, only members who are affected either by a change in definition or dependency need to be re-run. In practice this means that a small cluster where the resources themselves have a negligible apply time, can converge a complete connected exchange of data in less than one second.

An example demonstrates this best.

  • I have three nodes in the system: A, B, C.
  • Each creates four files, two of which it will export.
  • On host A, the two files are: /tmp/mgmtA/f1a and /tmp/mgmtA/f2a.
  • On host A, it exports: /tmp/mgmtA/f3a and /tmp/mgmtA/f4a.
  • On host A, it collects all available (exported) files into: /tmp/mgmtA/
  • It is done similarly with B and C, except with the letters B and C substituted in to the emphasized locations above.
  • For demonstration purposes, I startup the mgmt engine first on A, then B, and finally C, all the while running various terminal commands to keep you up-to-date.

As before, I’ve trimmed the logs and annotated the output for clarity:

james@computer:/tmp$ rm -rf /tmp/mgmt* # clean out everything
james@computer:/tmp$ mkdir /tmp/mgmt{A..C} # make the example dirs
james@computer:/tmp$ tree /tmp/mgmt* # they're indeed empty
/tmp/mgmtA
/tmp/mgmtB
/tmp/mgmtC

0 directories, 0 files
james@computer:/tmp$ # run node A, it converges almost instantly
james@computer:/tmp$ tree /tmp/mgmt*
/tmp/mgmtA
├── f1a
├── f2a
├── f3a
└── f4a
/tmp/mgmtB
/tmp/mgmtC

0 directories, 4 files
james@computer:/tmp$ # run node B, it converges almost instantly
james@computer:/tmp$ tree /tmp/mgmt*
/tmp/mgmtA
├── f1a
├── f2a
├── f3a
├── f3b
├── f4a
└── f4b
/tmp/mgmtB
├── f1b
├── f2b
├── f3a
├── f3b
├── f4a
└── f4b
/tmp/mgmtC

0 directories, 12 files
james@computer:/tmp$ # run node C, exit 5 sec after converged, output:
james@computer:/tmp$ time ./mgmt run --file examples/graph3c.yaml --hostname c --converged-timeout=5
01:52:33 main.go:65: This is: mgmt, version: 0.0.1-29-gebc1c60
01:52:33 main.go:66: Main: Start: 1452408753004161269
01:52:33 main.go:203: Main: Running...
01:52:33 main.go:103: Etcd: Starting...
01:52:33 config.go:175: Collect: file; Pattern: /tmp/mgmtC/
01:52:33 main.go:148: Graph: Vertices(8), Edges(0)
01:52:38 main.go:192: Converged for 5 seconds, exiting!
01:52:38 main.go:56: Interrupted by exit signal
01:52:38 main.go:219: Goodbye!

real    0m5.084s
user    0m0.034s
sys    0m0.031s
james@computer:/tmp$ tree /tmp/mgmt*
/tmp/mgmtA
├── f1a
├── f2a
├── f3a
├── f3b
├── f3c
├── f4a
├── f4b
└── f4c
/tmp/mgmtB
├── f1b
├── f2b
├── f3a
├── f3b
├── f3c
├── f4a
├── f4b
└── f4c
/tmp/mgmtC
├── f1c
├── f2c
├── f3a
├── f3b
├── f3c
├── f4a
├── f4b
└── f4c

0 directories, 24 files
james@computer:/tmp$

Amazingly, the cluster converges in less than one second. Admittedly it didn’t have large amounts of IO to do, but since those are fixed constants, it still shows how fast this approach should be. Feel free to do your own tests to verify.

Code

The code is publicly available and has been for some time. I wanted to release it early, but I didn’t want to blog about it until I felt I had the initial design triad completed. It is written entirely in golang, which I felt was a good match for the design requirements that I had. It is my first major public golang project, so I’m certain there are many things I could be doing better. As a result, I welcome your criticism and patches, just please try and keep them constructive and respectful! The project is entirely Free Software, and I plan to keep it that way. As long as Red Hat is involved, I’m pretty sure you won’t have to deal with any open core compromises!

Community

There’s an IRC channel going. It’s #mgmtconfig on Freenode. Please come hangout! If we get bigger, we’ll add a mailing list.

Caveats

There are a few caveats that I’d like to mention. Please try to keep these in mind.

  • This is still an early prototype, and as a result isn’t ready for production use, or as a replacement for existing config management software. If you like the design, please contribute so that together we can turn this into a mature system faster.
  • There are some portions of the code which are notably absent. In particular, there is no lexer or parser, nor is there a design for what the graph building language (DSL) would look like. This is because I am not a specialist in these areas, and as a result, while I have some ideas for the design, I don’t have any useful code yet. For testing the engine, there is a (quickly designed) YAML graph definition parser available in the code today.
  • The etcd master election/promotion portion of the code is not yet available. Please stay tuned!
  • There is no design document, roadmap or useful documentation currently available. I’ll be working to remedy this, but I first wanted to announce the project, gauge interest and get some intial feedback. Hopefully others can contribute to the docs, and I’ll try to write about my future design ideas as soon as possible.
  • The name mgmt was the best that I could come up with. If you can propose a better alternative, I’m open to the possibility.
  • I work for Red Hat, and at first it might seem confusing to announce this work alongside our existing well-known Puppet and Ansible teams. To clarify, this is a prototype of some work and designs that I started before I was employed at Red Hat. I’m grateful that they’ve been letting me work on interesting projects, and I’m very pleased that my managers have had the vision to invest in future technologies and projects that (I hope) might one day become the de facto standard.

Presentations

It is with great honour, that my first public talk about this project will be at Config Management Camp 2016. I am thrilled to be speaking at such an excellent conference, where I am sure the subject matter will be a great fit for all the respected domain experts who will be present. Find me on the schedule, and please come to my session.

I’m also fortunate enough to be speaking about the same topic, just four days later in Brno, at DevConf.CZ. It’s a free conference, in an excellent city, and you’ll be sure to find many excellent technical sessions and hackers!

I hope to see you at one of these events or at a future conference. If you’d like to have me speak at your conference or event, please contact me!

Conclusion

Thank you for reading this far! I hope you appreciate my work, and I promise to tell you more about some of the novel designs and properties that I have planned for the future of mgmt. Please leave me your comments, even if they’re just +1’s.

Happy hacking!

James

 

Post scriptum

There seems to be a new trend about calling certain types of management systems or designs “choreography”. Since this term is sufficiently overloaded and without a clear definition, I choose to avoid it, however it’s worth mentioning that some of the ideas from some of the definitions of this word as pertaining to the configuration management field match what I’m trying to do with this design. Instead of using the term “choreography”, I prefer to refer to what I’m doing as “configuration management”.

Some early peer reviews suggested this might be a “puppet-killer”. In fact, I actually see it as an opportunity to engage with the puppet community and to share my design and engine, which I hope some will see as a more novel solution. Existing puppet code could be fed through a cross compiler to output a graph that actually runs on my engine. While I plan to offer an easier to use and more powerful DSL language, the 80% of existing puppet code isn’t more than plumbing, package installation, and simple templating, so a gradual migration would be possible, where the multi-system configuration management parts are implemented using my new patterns instead of with slowly converging puppet. The same things could probably also be done with other languages like Chef. Anyone from Puppet Labs, Chef Software Inc., or the greater hacker community is welcome to contact me personally if they’d like to work on this.

Lastly, but most importantly, thanks to everyone who has discussed these ideas with me, reviewed this article, and contributed in so many ways. You’re awesome!

[py]inotify, polling, gtk and gio

i have this software with a gtk mainloop, using dbus and all that fun stuff that seems to play together nicely. i know about the kernel inotify support, and i wanted it to get integrated with that above stack. i thought i was supposed to do it with pyinotify and io_add_watch, but on closer inspection into the pyinotify code it turns out that it seems to actually use polling! (search for: select.poll)

thinking i was confused, i emailed a friend to see if he could confirm my suspicions. we both weren’t 100% sure, a little searching later i was convinced when i found this blog posting. i’m surprised i didn’t find out about this sooner. in any case, my application seems to be happy now.

as a random side effect, it seems that when a file is written, i still see the G_FILE_MONITOR_EVENT_ATTRIBUTE_CHANGED *after* the G_FILE_MONITOR_EVENT_CHANGES_DONE_HINT event, which i would have expected to always come last. maybe this is a bug, or maybe this is something magical that $EDITOR is doing- in any case it doesn’t affect me, i just wasn’t sure if it was a bug or not. to make it harder, different editors save the file in different ways. gedit seems to first delete the file, and then create it again– or at least from what i see in the gio debug.

the code i’m using to test all this is:

#!/usr/bin/python
import gtk
import gio
count = 0
def file_changed(monitor, file, unknown, event):
  global count
  print 'debug: %d' % count
  count = count + 1
  print 'm: %s' % monitor
  print 'f: %s' % file
  print 'u: %s' % unknown
  print 'e: %s' % event
  if event == gio.FILE_MONITOR_EVENT_CHANGES_DONE_HINT:
    print "file finished changing"
  print '#'*79
  print 'n'
myfile = gio.File('/tmp/gio')
monitor = myfile.monitor_file()
monitor.connect("changed", file_changed)
gtk.main()

(very similar to the aforementioned blog post)

and if you want to see how i’m using it in the particular app, the latest version of evanescent is now available. (look inside of evanescent-client.py)