Next generation configuration mgmt

It’s no secret to the readers of this blog that I’ve been active in the configuration management space for some time. I owe most of my knowledge to what I’ve learned while working with Puppet and from other hackers working in and around various other communities.

I’ve published, a number, of articles, in an, attempt, to push, the field, forwards, and to, share the, knowledge, that I’ve, learned, with others. I’ve spent many nights thinking about these problems, but it is not without some chagrin that I realized that the current state-of-the-art in configuration management cannot easily (or elegantly) solve all the problems for which I wish to write solutions.

To that end, I’d like to formally present my idea (and code) for a next generation configuration management prototype. I’m calling my tool mgmt.

Design triad

Mgmt has three unique design elements which differentiate it from other tools. I’ll try to cover these three points, and explain why they’re important. The summary:

  1. Parallel execution, to run all the resources concurrently (where possible)
  2. Event driven, to monitor and react dynamically only to changes (when needed)
  3. Distributed topology, so that scale and centralization problems are replaced with a robust distributed system

The code is available, but you may prefer to read on as I dive deeper into each of these elements first.

1) Parallel execution

Fundamentally, all configuration management systems represent the dependency relationships between their resources in a graph, typically one that is directed and acyclic.

directed acyclic graph g1, showing the dependency relationships with black arrows, and the linearized dependency sort order with red arrows.

Directed acyclic graph g1, showing the dependency relationships with black arrows, and the linearized dependency sort order (a topological sort) with red arrows.

Unfortunately, the execution of this graph typically has a single worker that runs through a linearized, topologically sorted version of it. There is no reason that a graph with a number of disconnected parts cannot run each separate section in parallel with each other.

g2

Graph g2 with the red arrows again showing the execution order of the graph. Please note that this graph is composed of two disconnected parts: one diamond on the left and one triplet on the right, both of which can run in parallel. Additionally, nodes 2a and 2b can run in parallel only after 1a has run, and node 3a requires the entire left diamond to have succeeded before it can execute.

Typically, some nodes will have a common dependency, which once met will allow its children to all execute simultaneously.

This is the first major design improvement that the mgmt tool implements. It has obvious improvements for performance, in that a long running process in one part of the graph (eg: a package installation) will cause no delay on a separate disconnected part of the graph which is in the process of converging unrelated code. It also has other benefits which we will discuss below.

In practice this is particularly powerful since most servers under configuration management typically combine different modules, many of which have no inter-dependencies.

An example is the best way to show off this feature. Here we have a set of four long (10 second) running processes or exec resources. Three of them form a linear dependency chain, while the fourth has no dependencies or prerequisites. I’ve asked the system to exit after it has been converged for five seconds. As you can see in the example, it is finished five seconds after the limiting resource should be completed, which is the longest running delay to complete in the whole process. That limiting chain took 30 seconds, which we can see in the log as being from three 10 second executions. The logical total of about 35 seconds as expected is shown at the end:

$ time ./mgmt run --file graph8.yaml --converged-timeout=5 --graphviz=example1.dot
22:55:04 This is: mgmt, version: 0.0.1-29-gebc1c60
22:55:04 Main: Start: 1452398104100455639
22:55:04 Main: Running...
22:55:04 Graph: Vertices(4), Edges(2)
22:55:04 Graphviz: Successfully generated graph!
22:55:04 State: graphStarting
22:55:04 State: graphStarted
22:55:04 Exec[exec4]: Apply //exec4 start
22:55:04 Exec[exec1]: Apply //exec1 start
22:55:14 Exec[exec4]: Command output is empty! //exec4 end
22:55:14 Exec[exec1]: Command output is empty! //exec1 end
22:55:14 Exec[exec2]: Apply //exec2 start
22:55:24 Exec[exec2]: Command output is empty! //exec2 end
22:55:24 Exec[exec3]: Apply //exec3 start
22:55:34 Exec[exec3]: Command output is empty! //exec3 end
22:55:39 Converged for 5 seconds, exiting! //converged for 5s
22:55:39 Interrupted by exit signal
22:55:39 Exec[exec4]: Exited
22:55:39 Exec[exec1]: Exited
22:55:39 Exec[exec2]: Exited
22:55:39 Exec[exec3]: Exited
22:55:39 Goodbye!

real    0m35.009s
user    0m0.008s
sys     0m0.008s
$

Note that I’ve edited the example slightly to remove some unnecessary log entries for readability sake, and I have also added some comments and emphasis, but aside from that, this is actual output! The tool also generated graphviz output which may help you better understand the problem:

 

example1.dot

This example is obviously contrived, but is designed to illustrate the capability of the mgmt tool.

Hopefully you’ll be able to come up with more practical examples.

2) Event driven

All configuration management systems have some notion of idempotence. Put simply, an idempotent operation is one which can be applied multiple times without causing the result to diverge from the desired state. In practice, each individual resource will typically check the state of the element, and if different than what was requested, it will then apply the necessary transformation so that the element is brought to the desired state.

The current generation of configuration management tools, typically checks the state of each element once every 30 minutes. Some do it more or less often, and some do it only when manually requested. In all cases, this can be an expensive operation due to the size of the graph, and the cost of each check operation. This problem is compounded by the fact that the graph doesn’t run in parallel.

g3

In this time / state sequence diagram g3, time progresses from left to right. Each of the three elements (from top to bottom) want to converge on states a, b and c respectively. Initially the first two are in states x and y, where as the third is already converged. At t1 the system runs and converges the graph, which entails a state change of the first and second elements. At some time t2, the elements are changed by some external force, and the system is no longer converged. We won’t notice till later! At this time t3 when we run for the second time, we notice that the second and third elements are no longer converged and we apply the necessary operations to fix this. An unknown amount of time passed where our cluster was in a diverged or unhealthy state. Traditionally this is on the order of 30 minutes.

More importantly, if something diverges from the requested state you might wait 30 minutes before it is noticed and repaired by the system!

The mgmt system is unique, because I realized that an event based system could fulfill the same desired behaviour, and in fact it offers a more general and powerful solution. This is the second major design improvement that the mgmt tool implements.

These events that we’re talking about are inotify events for file changes, systemd events (from dbus) for service changes, packagekit events (from dbus again) for package change events, and events from exec calls, timers, network operations and more! In the inotify example, on first run of the mgmt system, an inotify watch is taken on the file we want to manage, the state is checked and it is converged if need be. We don’t ever need to check the state again unless inotify tells us that something happened!

g4

In this time / state sequence diagram g4, time progresses from left to right. After the initial run, since all the elements are being continuously monitored, the instant something changes, mgmt reacts and fixes it almost instantly.

Astute config mgmt hackers might end up realizing three interesting consequences:

  1. If we don’t want the mgmt program to continuously monitor events, it can be told to exit after the graph converges, and run again 30 minutes later. This can be done with my system by running it from cron with the --converged-timeout=1, flag. This effectively offers the same behaviour that current generation systems do for the administrators who do not want to experiment with a newer model. Thus, the current systems are a special, simplified model of mgmt!
  2. It is possible that some resources don’t offer an event watching mechanism. In those instances, a fallback to polling is possible for that specific resource. Although there currently aren’t any missing event APIs that your narrator knows about at this time.
  3. A monitoring system (read: nagios and friends) could probably be built with this architecture, thus demonstrating that my world view of configuration management is actually a generalized version of system monitoring! They’re the same discipline!

Here is a small real-world example to demonstrate this feature. I have started the agent, and I have told it to create three files (f1, f2, f3) with the contents seen below, and also, to ensure that file f4 is not present. As you can see the mgmt system responds quite quickly:

james@computer:/tmp/mgmt$ ls
f1  f2  f3
james@computer:/tmp/mgmt$ cat *
i am f1
i am f2
i am f3
james@computer:/tmp/mgmt$ rm -f f2 && cat f2
i am f2
james@computer:/tmp/mgmt$ echo blah blah > f2 && cat f2
i am f2
james@computer:/tmp/mgmt$ touch f4 && file f4
f4: cannot open `f4' (No such file or directory)
james@computer:/tmp/mgmt$ ls
f1  f2  f3
james@computer:/tmp/mgmt$

That’s fast!

3) Distributed topology

All software typically runs with some sort of topology. Puppet and Chef normally run in a client server topology, where you typically have one server with many clients, each running an agent. They also both offer a standalone mode, but in general this is not more interesting than running a fancy bash script. In this context, I define interesting as “relating to clustered, multiple machine architectures”.

g5

Here in graph g5 you can see one server which has three clients initiating requests to it.

This traditional model of computing is well-known, and fairly easy to reason about. You typically put all of your code in one place (on the server) and the clients or agents need very little personalized configuration to get working. However, it can suffer from performance and scalability issues, and it can also be a single point of failure for your infrastructure. Make no mistake: if you manage your infrastructure properly, then when your configuration management infrastructure is down, you will be unable to bring up new machines or modify existing ones! This can be a disastrous type of failure, and is one which is seldom planned for in disaster recovery scenarios!

Other systems such as Ansible are actually orchestrators, and are not technically configuration management in my opinion. That doesn’t mean they don’t share much of the same problem space, and in fact they are usually idempotent and share many of the same properties of traditional configuration management systems. They are useful and important tools!

graph6The key difference about an orchestrator, is that it typically operates with a push model, where the server (or the sysadmin laptop) initiates a connection to the machines that it wants to manage. One advantage is that this is sometimes very easy to reason about for multi machine architectures, however it shares the usual downsides around having a single point of failure. Additionally there are some very real performance considerations when running large clusters of machines. In practice these clusters are typically segmented or divided in some logical manner so as to lessen the impact of this, which unfortunately detracts from the aforementioned simplicity of the solution.

Unfortunately with either of these two topologies, we can’t immediately detect when an issue has occurred and respond immediately without sufficiently advanced third party monitoring. By itself, a machine that is being managed by orchestration, cannot detect an issue and communicate back to its operator, or tell the cluster of servers it peers with to react accordingly.

The good news about current and future generation topologies is that algorithms such as the Paxos family and Raft are now gaining wider acceptance and good implementations now exist as Free Software. Mgmt depends on these algorithms to create a mesh of agents. There are no clients and servers, only peers! Each peer can choose to both export and collect data from a distributed data store which lives as part of the cluster of peers. The software that currently implements this data store is a marvellous piece of engineering called etcd.

graph7

In graph g7, you can see what a fully interconnected graph topology might look like. It should be clear that the numbed of connections (or edges) is quite large. Try and work out the number of edges required for a fully connected graph with 128 nodes. Hint, it’s large!

In practice the number of connections required for each peer to connect to each other peer would be too great, so instead the cluster first achieves distributed consensus, and then the elected leader picks a certain number of machines to run etcd masters. All other agents then connect through one of these masters. The distributed data store can easily handle failures, and agents can reconnect seamlessly to a different temporary master should they need to. If there is a lack or an abundance of transient masters, then the cluster promotes or demotes an agent automatically by asking it to start or stop an etcd process on its host.

g8

In graph g8, you can see a tightly interconnected centre of nodes running both their configuration management tasks, but also etcd masters. Each additional peer picks any of them to connect to. As the number of nodes scale, it is far easier to scale such a cluster. Future algorithm designs and optimizations should help this system scale to unprecedented host counts. It should go without saying that it would be wise to ensure that the nodes running etcd masters are in different failure domains.

By allowing hosts to export and collect data from the distributed store, we actually end up with a mechanism that is quite similar to what Puppet calls exported resources. In my opinion, the mechanism and data interchange is actually a brilliant idea, but with some obvious shortcomings in its implementation. This is because for a cluster of N nodes, each wishing to exchange data with one another, puppet must run N times (once on each node) and then N-1 times for the entire cluster to see all of the exchanged data. Each of these runs requires an entire sequential run through every resource, and an expensive check of each resource, each time.

In contrast, with mgmt, the graph is redrawn only when an etcd event notifies us of a change in the data store, and when the new graph is applied, only members who are affected either by a change in definition or dependency need to be re-run. In practice this means that a small cluster where the resources themselves have a negligible apply time, can converge a complete connected exchange of data in less than one second.

An example demonstrates this best.

  • I have three nodes in the system: A, B, C.
  • Each creates four files, two of which it will export.
  • On host A, the two files are: /tmp/mgmtA/f1a and /tmp/mgmtA/f2a.
  • On host A, it exports: /tmp/mgmtA/f3a and /tmp/mgmtA/f4a.
  • On host A, it collects all available (exported) files into: /tmp/mgmtA/
  • It is done similarly with B and C, except with the letters B and C substituted in to the emphasized locations above.
  • For demonstration purposes, I startup the mgmt engine first on A, then B, and finally C, all the while running various terminal commands to keep you up-to-date.

As before, I’ve trimmed the logs and annotated the output for clarity:

james@computer:/tmp$ rm -rf /tmp/mgmt* # clean out everything
james@computer:/tmp$ mkdir /tmp/mgmt{A..C} # make the example dirs
james@computer:/tmp$ tree /tmp/mgmt* # they're indeed empty
/tmp/mgmtA
/tmp/mgmtB
/tmp/mgmtC

0 directories, 0 files
james@computer:/tmp$ # run node A, it converges almost instantly
james@computer:/tmp$ tree /tmp/mgmt*
/tmp/mgmtA
├── f1a
├── f2a
├── f3a
└── f4a
/tmp/mgmtB
/tmp/mgmtC

0 directories, 4 files
james@computer:/tmp$ # run node B, it converges almost instantly
james@computer:/tmp$ tree /tmp/mgmt*
/tmp/mgmtA
├── f1a
├── f2a
├── f3a
├── f3b
├── f4a
└── f4b
/tmp/mgmtB
├── f1b
├── f2b
├── f3a
├── f3b
├── f4a
└── f4b
/tmp/mgmtC

0 directories, 12 files
james@computer:/tmp$ # run node C, exit 5 sec after converged, output:
james@computer:/tmp$ time ./mgmt run --file examples/graph3c.yaml --hostname c --converged-timeout=5
01:52:33 main.go:65: This is: mgmt, version: 0.0.1-29-gebc1c60
01:52:33 main.go:66: Main: Start: 1452408753004161269
01:52:33 main.go:203: Main: Running...
01:52:33 main.go:103: Etcd: Starting...
01:52:33 config.go:175: Collect: file; Pattern: /tmp/mgmtC/
01:52:33 main.go:148: Graph: Vertices(8), Edges(0)
01:52:38 main.go:192: Converged for 5 seconds, exiting!
01:52:38 main.go:56: Interrupted by exit signal
01:52:38 main.go:219: Goodbye!

real    0m5.084s
user    0m0.034s
sys    0m0.031s
james@computer:/tmp$ tree /tmp/mgmt*
/tmp/mgmtA
├── f1a
├── f2a
├── f3a
├── f3b
├── f3c
├── f4a
├── f4b
└── f4c
/tmp/mgmtB
├── f1b
├── f2b
├── f3a
├── f3b
├── f3c
├── f4a
├── f4b
└── f4c
/tmp/mgmtC
├── f1c
├── f2c
├── f3a
├── f3b
├── f3c
├── f4a
├── f4b
└── f4c

0 directories, 24 files
james@computer:/tmp$

Amazingly, the cluster converges in less than one second. Admittedly it didn’t have large amounts of IO to do, but since those are fixed constants, it still shows how fast this approach should be. Feel free to do your own tests to verify.

Code

The code is publicly available and has been for some time. I wanted to release it early, but I didn’t want to blog about it until I felt I had the initial design triad completed. It is written entirely in golang, which I felt was a good match for the design requirements that I had. It is my first major public golang project, so I’m certain there are many things I could be doing better. As a result, I welcome your criticism and patches, just please try and keep them constructive and respectful! The project is entirely Free Software, and I plan to keep it that way. As long as Red Hat is involved, I’m pretty sure you won’t have to deal with any open core compromises!

Community

There’s an IRC channel going. It’s #mgmtconfig on Freenode. Please come hangout! If we get bigger, we’ll add a mailing list.

Caveats

There are a few caveats that I’d like to mention. Please try to keep these in mind.

  • This is still an early prototype, and as a result isn’t ready for production use, or as a replacement for existing config management software. If you like the design, please contribute so that together we can turn this into a mature system faster.
  • There are some portions of the code which are notably absent. In particular, there is no lexer or parser, nor is there a design for what the graph building language (DSL) would look like. This is because I am not a specialist in these areas, and as a result, while I have some ideas for the design, I don’t have any useful code yet. For testing the engine, there is a (quickly designed) YAML graph definition parser available in the code today.
  • The etcd master election/promotion portion of the code is not yet available. Please stay tuned!
  • There is no design document, roadmap or useful documentation currently available. I’ll be working to remedy this, but I first wanted to announce the project, gauge interest and get some intial feedback. Hopefully others can contribute to the docs, and I’ll try to write about my future design ideas as soon as possible.
  • The name mgmt was the best that I could come up with. If you can propose a better alternative, I’m open to the possibility.
  • I work for Red Hat, and at first it might seem confusing to announce this work alongside our existing well-known Puppet and Ansible teams. To clarify, this is a prototype of some work and designs that I started before I was employed at Red Hat. I’m grateful that they’ve been letting me work on interesting projects, and I’m very pleased that my managers have had the vision to invest in future technologies and projects that (I hope) might one day become the de facto standard.

Presentations

It is with great honour, that my first public talk about this project will be at Config Management Camp 2016. I am thrilled to be speaking at such an excellent conference, where I am sure the subject matter will be a great fit for all the respected domain experts who will be present. Find me on the schedule, and please come to my session.

I’m also fortunate enough to be speaking about the same topic, just four days later in Brno, at DevConf.CZ. It’s a free conference, in an excellent city, and you’ll be sure to find many excellent technical sessions and hackers!

I hope to see you at one of these events or at a future conference. If you’d like to have me speak at your conference or event, please contact me!

Conclusion

Thank you for reading this far! I hope you appreciate my work, and I promise to tell you more about some of the novel designs and properties that I have planned for the future of mgmt. Please leave me your comments, even if they’re just +1’s.

Happy hacking!

James

 

Post scriptum

There seems to be a new trend about calling certain types of management systems or designs “choreography”. Since this term is sufficiently overloaded and without a clear definition, I choose to avoid it, however it’s worth mentioning that some of the ideas from some of the definitions of this word as pertaining to the configuration management field match what I’m trying to do with this design. Instead of using the term “choreography”, I prefer to refer to what I’m doing as “configuration management”.

Some early peer reviews suggested this might be a “puppet-killer”. In fact, I actually see it as an opportunity to engage with the puppet community and to share my design and engine, which I hope some will see as a more novel solution. Existing puppet code could be fed through a cross compiler to output a graph that actually runs on my engine. While I plan to offer an easier to use and more powerful DSL language, the 80% of existing puppet code isn’t more than plumbing, package installation, and simple templating, so a gradual migration would be possible, where the multi-system configuration management parts are implemented using my new patterns instead of with slowly converging puppet. The same things could probably also be done with other languages like Chef. Anyone from Puppet Labs, Chef Software Inc., or the greater hacker community is welcome to contact me personally if they’d like to work on this.

Lastly, but most importantly, thanks to everyone who has discussed these ideas with me, reviewed this article, and contributed in so many ways. You’re awesome!

42 thoughts on “Next generation configuration mgmt

  1. Hi James!

    Thanks for putting this all together. I’ll definitely attend your talk in Gent! I agree with you that nowadays configuration management has somehow reached its limits and we need a fully new design to support us for the next 5-10 years.

    I’ll hold an ignite talk at Config Management Camp, namely “the three legs of modern configuration management (…or maybe four)”. I’d be glad if you came and hear my <5m talk, as you seem to have addressed the three legs pretty well and I'm interested in understanding if mgmt can address the fourth, too :-)

    Ciao
    — bronto

  2. Reblogged this on A sysadmin's logbook and commented:
    I’ll be giving an ignite talk at Config Management Camp this year: “the three legs of modern configuration management (…or maybe four)”. James has definitely made a big step in that direction. I recommend you read his blog post and, if you are coming to the conference, attend both his talk and my ignite talk. Who knows, maybe you are really watching the dawn of the next generation of configuration management!!!

  3. This is excellent! I like that etcd is included as part of the solution. Puppet’s exported resources are far from ideal for run time configuration and I was eventually going to have to add etcd anyways to solve those pains.

    I’ll be interested to see how the DSL implementation comes about. I like how Ansible is simply YAML but I also like the flexibility of Puppet but dislike the complexity that comes with it and the higher learning curve for new developers. If a sufficiently usable DSL emerges, I will gladly contribute ports for common applications (apache, mysql, etc).

    One question: the system is designed to be distributed, but where are the YAML graphs expected to originate? Stated differently, if I provision new hosts in a cloud, what is the intended delivery and update method for the YAML graphs?

    • @JeremyC: I’m glad to have you as a future contributor!

      Your question is quite good. Just to re-clarify, the current yaml graph definitions are *temporary* so that users can test the engine without writing golang code.
      As for new hosts and graph (really whatever code the DSL is which generates the per-host graph) the delivery will probably be some sort of torrent like mechanism, but more on that later!

      • Erik — I really appreciate the link, thanks! I’m currently trying to figure out what existing DSL to use, or what features I want when I write one. I’m fairly certain we want something declarative, and there are some other design aspects I’ve considered, include FRP, but nothing concrete has happened yet.

        Care to be part of this discussion?

  4. Pingback: Links 18/1/2016: AsteroidOS With GNU, NetworkManager 1.2 | Techrights

  5. Do you have thoughts on a bridge step between existing CI tools? If you moved ohai/heira data into gossipy kv store (etcd / consul come to mind) and triggered converges on attributes your ci tool used, you could get some of the benefit of this system without scrapping existing code bases. Or fixing the time to fix by using inode monitoring your favorite sexy part?

    • We’re actually already using etcd– As for your question, it sounds interesting, but I think I’d need a few more specifics or a bit of discussion to understand what you’re getting at better. Any chance you’ll be at FOSDEM/cfgmgmtcamp/devconf.cz ?

  6. Pingback: issue #12: Zabbix, GitLab, Tcpdive, Pact, Grafana, XKCD and many more - Cron Weekly: a weekly newsletter for Linux and Open Source enthusiasts

  7. I always wanted to apply Petri nets as a formalism for concurrent provisioning. There is a tool called renew that provides a simulator a modeling tool and there is the concept of reference nets that allows you to put nets in nets as state full tokens. With that in mind you can put together a mighty formalism to model such a flow and be able to check for liveness. Then you can add the concept of agents and a platform ( it’s called mulan) and you even have distribution.

    • Hi Felix,

      Great comment about Petri nets– I’ve looked into them a bit, but I’m not an expert in this kind of formalism, although some light research led me to believe that some was possible depending on what kind of DSL we come up with. Would you be interested in helping explore this more?

  8. Hi James,

    Great article outlining what’s you’re doing, thank you. I’d love it if you were explicit about the license in your project, copyright is clear, but not the license there.

    We’ve used many of the same concepts and took many of the same paths with RackHD (https://github.com/RackHD/) for doing the bare metal/low level hardware automation. It’s an orchestrator and meant to plug into a larger system, but may be on interest. It’s also uses a purely event driven workflow engine, although more tightly linked with event sources relevant to bare metal management (provisioning, firmware updating, out of band management), which we’ve found to be wonderfully effective.

    We took a slightly different path, making our workflow engine fully declarative so that specific orchestrations could be added via API and the system used as parameterized actions, intentionally not doing scheduling or reaching into the software configuration management space. Anyhow, give it a look if that sounds interesting to you.

    A question as well: how does your system work (or put better, how do you envision in working) for an evolving the services and situations under management? I didn’t spot a service description language like Juju or Nomad in there, and at a quick glance I wasn’t sure if you were intending this to manage a single service, or be a general system used by a team to manage many services together.

    • @heckj — There will be a description language or “DSL (domain specific language)” coming. The design is not finalized, but it will most likely be declarative. Would love to have you as a contributor if you’re interested in participating. While I’m not familiar with RackHD, you could envision writing a native “type” in mgmt for physical server provisioning and thus integrating both provisioning (incl. of metal) and config management in a single language.

  9. Great article. I love the design constrains you have pointed out. I have worked on puppet/cfengine few years and now use chef for most cases. I faces pretty much exact difficulties as I work on larger, more distributed systems. We have pushed several features you mentioned here in the chef ecosystem and i want to mention couple of things here for those who want to explore them, as well as with the hope that the learning might help you making your tool better:
    1) Parallelism : Chef internally uses thread for couple of use cases (like artfiact fetch etc). There are separate cookbooks( like thread) that allows parallel resource execution. Some of the common config management tasks cant be parallelized (like package installation, e.g. dpkg holds a flock) due to downstream limitation. Some other cases this can be offloaded completely to downstream tools (like chef now has multi-package resource which offload multiple package installation with version constrain via a single dpkg/yum/dnf call). There are also limitation with the language(ruby threads & io) in concurrent scenarios their performance is sub-optimal. Golang will help you a lot in those cases
    2) Distributed systems: It is still not clear whether to implement fleet awareness in the core tool or implement this as a plugin, due to rapid developments in dirstributed systems itself. Currently we have serf (memberlist with health check) and consul/etcd integration in chef which allows one to get distributed lock/coordination from chef. I think similar things exists in ansible and puppet as well. Some of these tools can be used to orchestrate as well (etcd/zk watch, serf/consul handlers etc). Currently these are convenient (and advance) ways of provisioning mongodb replicasets, cassandra or zookeeper cluster provisioning etc. Chef 12.5 introduced a handler DSL which allows user to coordinate (locking, memebership data acquisition etc) in particular phases of chef run. This provides a more stable and standardize way of consuming these features. I hope to introduce more such primitives (like pluggable host discovery ) in core chef.

    I want to add one last thing, regarding systemd-integration. system provides a first class dbus api to start/stop/query services, as well as execute commands. Till now this has been always hacky shellout implementations that differ across distros. They also suffer from perf issues on large scale, a modern config management system for linux should definitely consider systemd to shave all those yak. This also means all the exec statements now can have cgroup statements (very important if you want to safeguard that config management system invoked commands are not hosing the host or services running inside host)

    keep up the good work, i’ll see if i can contribute some code :-)

  10. Pingback: CentOS Dojo, Brussels 2016 - Richii.com

  11. Hi, some questions regarding the parallelism support (something I really wish Puppet had as it already has a acyclic graph).

    Does it support batching resources? For example package installations often can’t be parallelized due to locks, but you could batch up all package resources whose dependencies are satisfied into a single call to the package manager.

    Does it support prefetching? In many cases a lot of the job of checking a resource state can be parallelized to happen long before the resource dependencies have been satisfied.
    It would be good if the providers get info about all the coming resources in the graph up front so they can decide if they want to start checking the current state of them even before their dependencies have been resolved. It wouldn’t work for all resource types of course, but I think it would for many of them.

    Also I’m wondering if you have given any thought to just operating on compiled Puppet catalogs instead of inventing your own DSL?

    • Great questions;

      Yes it supports what you call batching, but we call it grouping (for now). It’s not merged yet, and I haven’t talked about this publicly yet. Good for you for realizing this is possible!

      RE: prefetching, it’s not clear where this would be generally useful and safety isn’t trivial. (Eg: I said A requires B, but you did something anyways). I’d like to talk about this more though. Depending on what you’re thinking of, it looks like it could be possible. What use case did you have in mind?

      RE: puppet DSL, yes I’ve absolutely considered this, and I mentioned it at the end of the article. It will go in based on whether puppetlabs or someone from the community wants to patch and support it. I think it would provide value to puppet users, but it’s not something I personally will write. If you’d like to talk more about this, let me know!

  12. +1 for PAXOS instead of reinventing a wheel.

    Otherwise, nice ideas, but from CFEngine/Rudder user’s point of view I don’t see – yet – much to drool over.
    Execution even of a large policy is fast for me. Also, being on 5 minute interval by default takes away some of the latency pain. If the package policies had better caching most of my systems would be under 30s for even the initial runs.
    And subsequent runs are usually sub-second anyway. Scalability… yeah, once you hit over 2000 nodes you might need to add another policy hub (1GB ram, Load avg <2)…
    So, a lot of those issues you describe go back to choice of tool in the first place. Ruby is just a stupid idea for something to run on large of systems since the overhead scales, too.

    But I like what you wrote… mgmt has the main advantage of having collected less dust, and a little more distributed/coordinated nature which is something that is dearly needed.
    And standing on some standards is something which I really very very much appreciate.

    I hope it turns out well, and it's great you're rallying for involvement early on.

    • The goal is to be better at multi-machine things in particular, and while you may not see 100x improvements in run times for single machine things, the parallelism and event based features will be a big win when converging across multiple machines.

      Having said that, it’s only just the beginning and I’ve got a lot more fun things happening! Stay tuned :)

  13. This sounds very similar to Cassandra DB. No one node in a cluster stores all of the data. Also, no one node is tasked with serving queries.

    I didn’t look at your code yet, and I’m definitely not an expert on message transport, but have you thought about RabbitMQ as a message queue in this architecture?

    I love the idea!

  14. Pingback: Leap or die | A sysadmin's logbook

  15. Pingback: In reply to Luke Kanies | A sysadmin's logbook

  16. Pingback: Ce mardi 1er mars à 19 h au Linux-Meetup… | L'Interface, bulletin d'information de la vie étudiante à l'ÉTS

  17. Pingback: issue #17: Mint, GCC, Ansible, Python, Kubernetes, MongoDB & performance - Cron Weekly: a weekly newsletter for Linux and Open Source enthusiasts

  18. Hello James,
    So I just got the git source for mgmt yesterday and compared it with puppet and chef, its really fast and has a very good scope, this is truly revolutionizing.
    Few thing that I think would be needing a lookup are:
    1. The server which starts when we run the YAML code to get inotify watches over a configuration. Server keeps on running unless we provide a converge timeout or put the whole command in background ‘& magic’. If this work around is not done anyone with admin access can stop the server or send a SIGKILL or SIGTERM and it wont restart on its own, could result in potential data loss or config change. I had an idea to daemonize this and run the service no matter what to respawn, then let the mgmt binary take care of rest of the work. Now since i am new to this I may making a mistake here to see this part as a bug right now and truly am sorry for that but soon to confirm coming out of my curiosity, if this would create any hustle.
    2. When a server is running and say we would need to deploy one more YAML, I dont see there is way we could do it without stopping the server.
    3. and most important, i think this should be at #1. installation! go binaries fails to install with the ‘make deps’ alone and need to be installed manually, again an automation tool with manual installtion needs ? I wrote a temporary bash script (well I am bash person originally) taht takes care of installation of this package to configure it for first use.

    • Hi Abhishek,

      Thanks for your comment, and I’m glad it seems you like the project. A quite fore word: Someone with a similarly sounding name kept joining our IRC channel, asking a similar question, and leaving without waiting for the response! I’m guessing this was you. If so, please understand that people are in different timezones, and it’s common IRC policy to “stay in channel” whenever possible. Usually this means running IRC on a server somewhere, or using and IRCcloud like service. Personally I prefer screen + irssi but the choice is up to you!

      Now to answer your questions…
      1) I’m not sure this would work– if an admin wants to stop a process, they have the right, whatever the consequences. Mgmt will shutdown cleanly on ^C signal (SIGQUIT) and try and finish what it was immediately doing first. If this doesn’t happen, please report the specifics as a bug.

      2) Actually you just edit the YAML file, and the server will notice that it changed and use the new one! Please note, that the YAML construct is temporary until we have a full DSL. (WIP). The same automatic switch over will happen with the code.

      3) Feel free to share your changes in channel. I’m not understanding the issue you’re having, but if you could paste errors, that would be great.

      Cheers!

      • Thanks for the reply, yes that was me not intentionally though, but with the lost of connectivity while travelling from work to home. Anyways i was able to workup the errors myself after some tweaks on my machine. I understand everyone in different timezones and will reply only when they are available to do so.

        Personally i very much liked the project. However i may not be able to put up my thoughts as is in the previous comment as i just started testing on the project few days back, but whatever part i tested clearly proves it faster.

        I will continue testing the project and will sure share it on the channel.

        Best regards

        Abhishek

  19. Pingback: Mgmt en las charlas de Barcelona Free Software - KDE Blog

  20. Impressive stuff, and an excellent read!

    For someone who has spent a long time in the tenches with puppet, reading this post was like therapy!

  21. purpleidea: This is a very interesting solution, and one that I’d love to dig into and use!
    Had a quick question though: What are your thoughts on saltstack as a contender to mgmt? Its creators built it with the same “unique” design elements in mind.

    • I prefer to avoid comparisons when possible, in particular because I have friends using and building other tools (like Salt) and I don’t want to denigrate their communities or work, but having said that, there are some clear differences. The most obvious that come to mind include:

      * Salt requires a central SPOF to pass events around. Mgmt’s events are either per host and when shared with other hosts are fully redundant in the CP distributed system.

      * Events in Salt need to be built manually, where as in Mgmt they are built into the core resources. This might not be completely the case with Salt “modules”, however I think the event primitive should be at the resource level which is not how it’s done with Salt AIUI.

      * Mgmt is written in a safe language (golang) where as Salt’s use of Python can allow you to build some dangerous code bases or hide bugs. This is of course a longer discussion which I don’t want to hash out here. One which can be said is that to describe things in mgmt, we use a safe DSL language (coming soon) which should further reduce the chance of infrastructure bugs.

      * Lastly, Salt is a more mature software product which is usable in production today, where as mgmt is much newer and git master still lacks 100% usefulness compatibility to where Salt is now.

      Join us if you’d like to help us build mgmt!

      PS: And while the two tools might seem quite similar, I think they are drastically different architecturally.

Leave a reply to purpleidea Cancel reply