LISA '03 Configuration Management BoF

Organized and Chaired by Luke Kanies
Notes by Jim Thornton (corrections to <jthornton@parc.com>)
LISA '03 Site


Wednesday Evening, October 29, 2003
Town and Country Resort and Convention Center,
San Diego CA USA


These are notes from a freewheeling and sometimes chaotic BoF discussion, taken in real time. Though I made an effort to record the content of statements accurately, these notes should be considered paraphrases not verbatim quotes.  Where possible, statements are attributed but I was not able to get all the names, especially of less frequent speakers. Editorial comments like these are italicized.

Luke opened the discussion by asking people to talk about problems with what they have or use: "Don't say my tool is wonderful, say what is missing or what I'm stuck on or whatever"

Poll: Anyone using a non-open source payware tool?
Survey: make a list of tools we're using (votes out of 76 counted present, multiple voting encouraged)
cfengine 18
ISconf
3
psgconf
1
Radmind
3
rdist
2
rsync
26
Unison
1
Tivoli
2
LCFG
2
BCONFIG
1
xhier
1
RPM
3
Package Managers
6
reimaging/dump/restore
1
kickstart, jumpstart install
34
kickstart, jumpstart clone
7
Install*
70
version control
43
OS Auto Update tool
36
ad-hoc management system
14
BigFix
1
Shifted to meta-questions below
Happy (Happy with what they have now: "yeah this is decent")
18
Angry (How many people are really upset at the people who wrote the tools they are using?)
2
Actively looking for a better option
41
Configuration in LDAP?
1
DB e.g. SQL
9
How many people will look at things on this list
35
SmartFrog
0

Question: Mark Burgess was asked what he doesn't like about cfengine.
Mark Burgess: Decays over time.  E.g. version 1 there were compromises:
<discussion ensued about semantics and consistency of backups and value of slow vs. fast and why can't you use cfengine with rsync>

Question: Mark Roth was asked what he doesn't like about psgconf. (Aside from issue Andrew lambasted him about on Sunday)
Mark Roth:  (policy/procedure issue): psgconf has a built-in thing called policy methods (going to abandon this word) that do procedural operations on policy.  Certain of these policy methods need to be invoked in order and the tool does not offer any help.  This is considered a design flaw. Doesn't like the way it interacts with package mgmt system, issue of mechanics and not theory

Question: Paul Anderson was asked about LCFG.
Paul Anderson: Built from the bottom-up to replace editing config files on individual hosts while what you really want is to deal with is services/relationships that span multiple hosts:  e.g. say "I would like two DHCP servers on each segment"

Question: Can somebody compare cfengine and radmind?

Luke Kanies: radmind and cfengine are not very similar because radmind is mostly about pulling down files

Steve Traugott: radmind does nothing at all to preserve ordering and it doesn't violate the Turing paper.  It does that by describing the complete disk state at all times within the constraints that they have defined within the bounds of everything that is not the negative space concept from radmind.

Luke: cfengine is more than pulling things down, it is a lot of doing.  cfengine avoids the need to write a lot of little shell
scripts, allowing you to connect a lot of relatively simple things. radmind takes "just pulling files down" and adds a lot of functionality on top of that - you can say this system is these bundles of files and there are tools to detect and capture those
subclasses and move that to other machines.  radmind can take a system you like and turn it into a configuration you can replicate.

Wes Craig: [in response to question] process state is not managed by radmind at all, e.g. no repartitioning disk, changing boot loader

Steve: at its base radmind is a list of files that are mode, SHA-1 sum etc.  You have multiple lists for machine and where a file is listed in more than one place you have another file that gives the precedence of those files you will apply to the machine

Luke: it has the ability to create these lists also

Kent Skaar: they are (to their credit) the only tool out there I've seen that solves the tripwire problem - managing consistency of checksum DB in a good manner

<missed some stuff>

Luke: can you tell it not to deal with certain files?

Wes: yes

Luke: everyone who builds tools should download "competitors" and compare

Wes: I can't really try other things with a production environment

Paul Anderson: we're talking about tools that do very different kinds of things and we should distinguish tools that move files around from those that deal in semantics

Wes: argues that you can't have something portable that contains semantics

Paul: No it is a modular thing, e.g. psgconf

Mark Burgess: What is the tripwire problem and why donesn't cfengine fix it?

Kent: tripwire is very hard to verify.  I could choose to trust cfengine [...] you can get a checksum state of a remote system

Luke: cfengine is good at changing system state but not telling you what files on are on what machines

Mark: disputes this due to comparison against trusted master copy feature

Paul: copying is good for making things look the same but mostly that is not what you want

Luke: Disputes this to an extent for small sites, etc., copying tools don't know or care what they are managing

?: Xhier - written up in LISA 1991 and a lot different from what is being talked about here.  On surface seems like just tool for
packaging and distribution of packages BUT any of these packages can be packages that manage your system, e.g. contains a file like mail alias file.  The core engine part that makes it run can pull these out and apply them in an organized manner and thus configure how your system behaves.  Right now uses rdist but that is irrelevant.  Seems much different than radmind or cfengine.  What I don't like about it is that shortcomings at the end of the paper haven't been fixed. Also, it tends to drift: provides several ways of classifying machines: layer of files for entire site, can be overridden by layer for 'region' of machines [...] (4-6 different layers) but it is easy for things to drift out of state you want because SA can come and say "just send this package there" and packages themselves can notice they aren't the latest version and tell the system to get the latest.  This can be good or can be a pain if you're trying to keep it the same.

[John Sellens, "Software Maintenance in a Campus Environment: The Xhier Approach", LISA V, 1991]

Does it have facility for central reporting?  Yes
Released recently? tweaking
How to get it? Call me [business cards exchanged]
google.com

Question: Luke asks Paul how he thinks we should talk about these two different kinds of tools, can't have a hard line ...

Paul: Dividing line for me is can you say "I want to make a machine a print server"

Wes: radmind can do that, just not in an abstract language

Paul: How do parameters get adjusted, e.g. "make a DNS server"

Mark Roth: difference is whether it is completely identical on different machines

?: We have homegrown thing with roles, depending on roles, machines will get the right files.  Machines know their own roles and we have a config script that does the installations.  Specify certain attributes to host, e.g. what IP address, try to keep that in file that specifies role.

Paul: As an SA you get used to thinking from the bottom up, but when you come into a site and ask for what you want you say something high level (I want 2 DNS servers on each segment)

?: You need something to define those terms for your site

Paul: SA now like writing in assembly and you want something more like Prolog?

?: More like hardware, these days you use HDL rather than drawing stuff

Rémy Evard: propose a slightly different way of thinking about this. There are tools that have file copying function: at one extreme is something like SystemImager that just clones but does it well.  A lot of people use SystemImager and then run cfengine on top.  A lot of tools combine both.

?: We're moving to a world of clusters etc. where people will need to say something about what services we need.

Paul: Reminds you of moving from assembly to higher-level language.

Steve: I'm coming to the point of believing that we're not really working at cross purposes here: you had to have hardware, then
assembly to get to high level.  Been thinking about what the GUI on top of radmind might be as in HDL.  What needs to be resolved is that you need to know what the underlying tool chain is like.

Paul: To get where you need to get inside files

?: But that's not how I put a computer together: I buy a stick of RAM and put it in because it has an interface.

Steve: I see progress, a continuum from people looking at high level and people looking at low level components that will go into the compiler.

Luke: A lot of what we're doing has a goal of working on that continuum but a lot of other tools are focused on solving problems right now even if it has nothing to do with this continuum.

Alva Couch: It is part of the puzzle in the sense that we are now working on tools that synthesize the semantics.  Radmind picks stuff up and throws it around and that is a transfer modality and that's fine no matter how you synthesize it, so if you have a high-level language generate it you can still throw it around.

Steve  ? [not Traugott]: radmind is assmbler but has these wonderful macros called overlays

Luke: Paul's focus is a lot on where we're going to be in 10 years. Not judging anyone.

Alva: I'm presenting a paper looking at view 10 years along.

John Sechrest: As we draw axes through, one difference is between manipulating bits on disk that you use in some way and the manipulation of the sytem as it is running, ongoing grooming of starting and killing programs.  That is a different class of toolset than making all the bits on disk right.  We can have a high-level language that compiles to a set of bits on disk which is a different problem space from having a high-level language that describes how my system works as it is running.

Robert Au: Wes was talking about problem of not being able to test other people's tools.  Is there a good test environment for this kind of tool and what would go into it?

Luke: I like cfengine because it is more like a language and so it is simple to use it for something small in its home directory:

?: Hard because every tool has a different model.

Mark Roth: Might be like writing a tool for Perl

Rémy: We've been thinking about this question and it is hard for almost all system administration things.  Looked into programming language literature.  There are some papers that compare tools in a principled way.  We have some tests sketched out but nothing ready.

Steve: You're looking for procedures: do this task on these machines using these tools. How long did it take, etc.

Rémy: At Argonne, we have cluster for research for doing things like this, so if somebody wants to do this seriouslly talk to me.

Luke: One thing a lot of us need to do more: write down what you are doing and put in on a website, publish in LISA, do something.  It is always better when there is more information.

Luke: I've got a site with goal of making it easy for SAs to come find things like scripts, tutorials, etc.  It doesn't have much because people weren't using it.   URL: madstop.com

Paul: It would really interesting to try to collect at a high level the tasks that people are doing every day.  What are you doing and why?

Rémy: Let's say we get answers to these:
Paul: I'm interested in an even higher level: "Machine broke and I have to replace it with something equivalent"

Rémy: I found looking at the repository relevant.

Luke: Need more survey papers

Paul: suggest the LSSConf mailing list.  Just google for lssconf

Luke: we're not trying to formalize, just trying to share more.  It's embarrasing that we don't share info more on the Internet.

Steve: I'm starting a restructure of Infrastructures.org and I like the benchmark idea for a section there.

?: Does anyone else have a problem with controlled roll-out of changes?  Does anyone have a tool that enforces process: must test, then promote etc.

Luke: This was good about integrating isconf with cfengine.  In cfengine I would use logic to say: set these 5 as test hosts.  In
isconf I would say: if you're a test host, do this stuff.  Use cfengine as execution engine that starts everything.  Details in
paper.  I've had success with this. 

?: Today the way we do this is roll up groups of files into tarballs and they get delivered first to test, sit for 2 days, then promoted to UAT.

Question: What about what does the testing?  They don't do very deep testing.

Steve: a cron on top of what I'm doing now would do that because the move is just a single command somebody enters manually.  Why would you want to do this?

Answer: Rolling out resolv.conf for network renumbering

Steve: you picked hardest case ...

Luke: I just did that in the way I was describing.  Did all tests, had a bit-flip in cfengine after waiting a week for appropriate class of machines

Mark Roth: Purpose of doing this in this in a staged way is to validate that it is ok.  Just putting it out there is a weak test.
[...] everybody chooses their risk level [...]

Steve: Luke what are you saying in the resolv.conf case?

Luke: I wrote a script that takes the 5 pieces of data and emits a resolv.conf.

Steve: I got burned in 1997 with a script that generated resolv.conf and it ran into political problems and I never tried again.

Luke: I guess I've been lucky, had a lot of success going from unmanaged to managed.

?: Specific problem is doing this is parallel.  We see things like a new version of nsswitch.conf which we want to change and start rolling out and then half-way through we need to make another change and the two changes need to both progress out through process.

Luke: I find that doing things with cfengine is really complicated and is not good for that but what it is good for is making decisions and passing the results of that decision on to somebody else.
<others: ... classifying machines based on characteristics ...>
Luke: beyond that I get frustrated

Luke: I want a language that makes it easier to do stuff and not just decide stuff.

John: I'm very interested in how other people feel.  People seem to have a problem taking on a new language vs. something like what BCFG does.  Do people have folks in their environment who are resistant to adopting a new language and would prefer making the change on a machine (touch the bits) and screw the abstraction. Poll [of those still left at this point]: 17.  So it would be fair to say that there is a component of the profession looking for a tool where the interface is to manipulate the bits.

Alva: This is problem of semantic distance, you know well how to do it "for real" and there is a learning curve to do it approximately at a high level.

?: Disagree: you don't always know when you log in and do something that it is going to work.  You have to verify everything anyway, so it is good to have a tool to snapshot.

Rémy: First time you're getting something working you want to do that by hand and not roll through a config language over and over.

Wes: Blackbox capture also allows somebody who, say, uses vendor tools to work.

Paul: Problem is that it is like trying to decompile a program.  In a high level description there is a lot that is missing in the details
of the files, e.g. why things are.  Appreciate black box tools but you need input from user.

Luke: Some communities, like Mac, don't want to know anything about computers.

Paul: But you're managing, e.g. DNS

Wes: It's pretty entertaining often on Mac: you get a window and say "serve that stuff"

Luke: A lot of these people are, say, teachers managing for their school in 2 hours a week.

N?: Another thing you get is ability to pull out pieces of data, you can use that data later even if abstract.

Paul: But it is about relationships between machines, e.g. hole in firewall for that machine and you really want to know why that hole is there so that it moves properly when you make a change.

Luke: Yes, but there are tools that will never approach that and we should be encouraging both types of tools.

Steve: Inter-system dependency is right where I am on the bleeding edge of trying to figure out how to do that.

Paul: That's the real problem, single machines are easy.

Wes: I can see the value in a data-center mark-up language but something has to be doing the basic kinds of things.

?: The tools do exist on Mac or RedHat for a person with single machine and that kind of stuff is easy.  The next step in that is dealing with a small LAN or like early AppleTalk and they built tools for that. We're looking at trying to design these tools not for the little subnets but for the whole infrastructure within an organization.  That is 3 orders of magnitude more difficult that the single LAN.

Alva: We want a portable validation claim.  What we're doing now partially solves that problem.  With radmind we're moving to what you think is valid and hope there isn't a latent variable that breaks it. [...].

Steve: That's why I'm big on finding constraints like ordering to keep that out of the way.

Alva: The great thing with radmind is that it duplicates stuff but
<missed discussion>

Steve: Classification here having to do with static routes and stuff that isn't business data but is not common to machine

Alva: and has relatively deep semantics

Luke: If you go for a tool that doesn't go for that understanding you'll have more latent variables.  If you go for a tool that does,
you probably have a less useful tool.

Alva: Not a problem with a small LAN but a problem with many machines.

<missed discussion>

Wes: Some things it is easy to be lucky, e.g. Mac.  Not so lucky on Solaris.  Somewhere in between on Linux.  There are plenty of decisions you could have made that make it so that a particular kernel doesn't work.

<...discussion of where some stuff is stored on Mac/Solaris...>


last modified $Date: 2004/01/06 22:24:56 $