Organized and Chaired by Luke Kanies
Notes by Jim Thornton (corrections to
<jthornton@parc.com>)
LISA '04 Site Notes from 2003 BoF
Tuesday
Evening, November 16, 2004
Marriott Marquis, Atlanta, GA USA
These are notes from a freewheeling BoF
discussion, taken in real time. Though I made an effort to record the content of
statements accurately, these notes should be considered paraphrases not verbatim
quotes. In most cases statements are not attributed, though sometimes I got the
name and maybe even got it right.
Luke gave a general introduction indicating no structured agenda.
Luke: Would like to (like last year) go through and see what the list of tools is and what people are using. Are there any tools that people are using that nobody knows about?
Narayan: Still working on BCFG but have a release now probably fit for human consumption.
Luke: Just sat through cfengine BoF. [runs through list of packages from last year] Let’s do surveys again (who is using to do work)
Note: count of those present was 65
What caused you to come to bof?
A: Probably should figure out if we should be doing something like that, or if we are and don’t know it.
Any other commercial products people are using?
I think that about covers the commercial products. In open source stuff,
people will have e.g. one script running rsync and don’t know if that counts
etc. Last year, out of 76, 70 were using an OS install tool. Who isn’t using
some sort of auto-install tool to get machines from bare metal to some sort of
base OS? 3
?: You can use channels (e.g. RedHat) to send out packages.
Luke: Is that bare metal provisioning though? [answer no]
?: Only recently started to use Solaris Flash: anybody else using?
?: […] state of things is that you need to do 2 pass.
Luke: Flash archive stuff doesn’t do anything with disks itself, you’re still responsible for partitions.
?: Latest documentation of Solaris claims it now supports root partitions for jumpstart
Luke: Flash archive is really take the data and dump it
on disk […] that would be kind of separate.
Almost all OSs out there have some kind of apt-like system that can pull down
packages, resolve dependencies etc. How many people are using that as a
significant portion of how they manage their machines? 28
How many of those people are paying for a service like that, like what used to
be Ximian Red Carpet or whatever? 7
What services:
up2date (RHN)
Red Carpet
technically Solaris support
Luke: That’s not an auto-patch though, you need to download
?: no with PatchPro
MS autoupdate (not a service you have to buy separately)
Other area out there is tools like rdist or rpm that have
open-ended functionality… This is where the survey gets murky, it is a
qualitative decision on your part whether you are using this for config. mgmt.
[…] If you’re using rpm to do more than just install packages, e.g. at Tufts
(Alva Couch) they put all config changes into rpm packages and use that to push
out every change, that’s their config tool.
Who’s using rdist for some kind of config work? 3
Who’s using rsync? 2
ssh/scp to do e.g. parallelized work on systems? 17
Who is using Po? It’s not a tool per se but something in Perl you can use to build various parallel processes: 1
Are there any other tools not on this list that you are using that you think qualify as config mgmt tools:
psgconf: 1 – are you at UIUC? A: No but I was
Does Perl qualify?
Luke: No because you’re just writing programs in Perl
[loose description of ‘macro’ tool]
Luke: I would say that you have written your own tool.
How many people have written their own tool? 18
Other tools:
It’s not cool, but for lack of camaraderie in organization, we
have a bunch of text files that get copied […] e.g. putty, notepad
?: I’ve seen people use that for router config
?: I think that’s how we manage ACLs also
Transarc “Package” don’t know if it is open source: 2
Do you consider telnet/rcp? How many people: 1
Luke: You might think about upgrading
A: It’s on my list
BCFG: 1
Narayan: hasn’t escaped yet so nobody using it offsite as far as I know
LCFG: 1
Quattor from Data Grid in Europe [nobody knows of it it seems]
?: Of people who don’t use cfengine but tried it, what was it that they didn’t like?
Luke: Generalization: are there tools out there that you
tried and didn’t like, and why? So start with cfengine, how many people have
done it (more than to write a single script) and moved away from it? 4
?: I upgraded from 2.1 to [?] 1.11 and it broke, deleted important
files
Who has moved away from it and why, willing to say?
?: I have machines that change fast and it wasn’t always
as predictable when changes would happen, not fast enough, went back to [?]
Narayan: We have about a dozen administrators and couldn’t get people to
buy-in.
?: Why?
Narayan: Doesn’t fit the problem-solving style. I don’t
like the model personally, like a config mgmt script language, wouldn’t want
other people working on my script.
?: Like any framework (web development etc.) if it fits your brain it
works for you. The thing I didn’t like was the line oriented nature: e.g. add
this line to Apache config, no I want to add a block.
?: I know I’m a minority here, but I would like to hear definitions of
what people think configuration management is. Different people may have
different ideas.
Luke: That’s why you won’t hear definitions, there is very little
agreement except for “using something to manage stuff”
[…]
John S.: Using a tool to manage a bunch of machines so you get some kind
of leverage.
Narayan: Leverage may be apparent at different times:
maybe when you make a change, maybe when you have to rebuild a machine, values
provided by tools vary widely.
?: For us it is consistency, whether John or Mary did it
?: Why we left cfengine: we changed whole concept […] to get machines
always in a consistent state.
?: Management spends money, wants a GUI, ssh should be called MMC and be
sold by Microsoft.
?: Reason I stopped cfengine was because it was too close to the metal
[…] you have to know what you’re doing.
?: Isn’t that the case anyway?
?: No I’ve met people who didn’t know what they were doing but they were
Windows sysadmins, need tools for those people also.
?: Capture not standard but best way to do something.
?: Organizational knowledge management: want one thing for enterprise.
Using BladeLogic, has some bugs but a lot of benefits. 3 platforms.
Tivoli BMR (Christie software)
Luke: I considered whole Tivoli suite as one, probably
not fair since it has so many tools.
?: What do people feel are the low hanging fruit? Like if we had nothing,
what would people say should be the first thing?
?: Take /etc/passwd, thinking about same thing on same machine, then move
to larger idea of doing things on multiple machines. For me, main thing is
avoiding making mistakes again.
Narayan: Configuration management tool may require a complete model
shift: the way you administer your machine is different so as an organization
you need to make a decision about how much change you will accept.
?: In my case we started using cfengine on a few machines, then suddenly
we were looking at updating hundreds of clients […]
Luke: One thing that Alva said in workshop on Sunday: a lot is cost
related: if you have a person managing only 3 machines there is not much you can
save unless you get rid of half a person. If you have a lot of machines you
probably have a lot of money. […] Organizational tools would let you identify
hot spots to focus on.
?: In our case there was no low-hanging fruit and we had
to carefully analyze and take goals to management saying we need configuration
management.
[…]
?: How does imaging fit in all this?
Luke: Imaging is one step: […] kind of an implementation
issue at that point.
?: Also issue of bundling config mgmt in images.
Luke: One approach is get it right at the beginning and don’t manage it
(do ad-hoc)
?: Anybody using organizational tools?
TT [time tracking?]
Narayan: Workflow systems – site-specific, locally developed.
?: Successful configuration management and workflow can be implemented in such a way that you can fill out the sysadmin staff with more junior staff
?: Middle of the road approach, I spend a lot of time
going around saying this. It’s not about cutting senior admin jobs. Anybody who
is just fighting fires […] is not doing their job.
Most of us are losing staff over time, in best case staff is linear. But systems
are complex, geometric at least so there is always this gap, assuming best case,
well funded, I call it the complexity gap. It’s never going to go away so you
have to find a way to do it better. It’s about freeing up our time which is the
only commodity that matters …
John S.: I know there’s a package (cfadmin) that sits on top of cfengine
that derives config from a database so the config tool and current state of
machines in DB work together. How many people are doing that? 19
How many using cfadmin for that?
John S.: Seems to me to close the loop you must have monitoring, so how many people derive monitoring from configuration so when you define config automatically the monitoring happens? 6
Luke: I wrote a tool for cfengine called nagiator[?] and it worked pretty well
Narayan: I think there is a third piece: diagnostics and
repair and that is not configuration management.
[discussion about whether this is true if you toss machine and rebuild]
Luke: [argues that killing jobs isn’t a big issue] Main thing config management gives you is higher levels of service. […] You can have dramatic increases in service quality by being able to get work done faster, more consistently, have a log so you know what was done. Second, we’re experiencing a geometric growth in servers […] how many nodes can I fit in a single rack without blowing out my AC, this exacerbates config mgmt problem greatly. Once you get AC fixed, you have a lot of machines where once you had 10.
?: I work for a small cable ISP. To stay in the market, we have to offer more high-value services. I need to leave servers to do their things so I can develop applications so there will be a job. It’s growth in services.
Luke: I know from Covad, they are the one ISP that survived because they have a great provisioning system.
Narayan: 3 axes we use:
node count
configuration count
admin count
?: I think there are several more parameters.
[10 minute break declared for beer fest]
Luke: So, we were about to get into a parameter war, what the axes are
Narayan: No fight, the extra administrative domains is another.
?: My point was: don’t be surprised if there are a lot more parameters.
[…] You will discover new parts as you do your planning. You have a planning
element where you do analysis of what you want to accomplish or what you have in
place, and that is the parameters that you mentioned and probably a bunch more.
Luke: That’s why the cost computation is so complicated… why people write
for-loops
?: We just spent a bunch of money on Tivoli, and some people are (I’ll
say) afraid to use it because to use it we have to write in Tivoli package
language and if we ever decide to move away from it then we’ll be stuck. Does
anybody know of a neutral package approach?
Do you mean something like rpm packages?
?: My developers write something that has these dirs. etc.
?: Have you thought about writing the applications build system so it can produce these packages, e.g. make Tivoli.
Andrew H.: Not the right way to do it because then you need 20 targets in every makefile. You need what he said, something neutral, so each makefile produces one thing, and then there’s a tool […]
?: Somewhere you have a tool that makes the package, what I’m saying is why not put it in the build system.
Narayan: There’s a tool called Alien that translates between package formats. I doubt it has a target for Tivoli but […]
?: Initially we started with rpm […]
?: Sounds like what it boils down to is that I need to train my developers to produce something other than just a tarball.
John S.: Yes, there are some sites I know that have a
‘make rpm’
?: There’s a tool you can invoke just before ‘make install’ and it will
watch what install does and make a package out of that.
?: It doesn’t work very well. I’ve seen other tools: CPAN
to rpm and that doesn’t work very well.
Narayan: All packagers are similar but all a little
different.
?: Silly, but what I need is ClearCase to rpm: export to
rpm. If it just spit it out as an rpm I’d be done.
[jokes about bringing your checkbook]
?: Most systems have build step and 'slurp it into a blob' step. Building
it is one part, and figuring out what needs to be slurped is a second part. Most
systems have both parts, but that’s where the abstraction break could be.
Luke: You could fake it, make rpms, give them to Tivoli so the installation is to drop the package, run rpm, remove it. […] BladeLogic does something like this.
Luke: Is there anything anybody is doing that they find exciting? […] When I was doing operational management, I usually learned something whether it went into a tool or not. Is there something you’ve learned recently, insights you can share?
Andrew H.: One of my issues right now is service supervision: need servers of certain kinds up in your cluster. We had fairly ad-hoc methods. We’ve gone to a scheme where we have a shared data space amongst all nodes in a cluster, (some would call it a distributed hash table but it is not) a key-value pair thing verified with protocol analysis, elect a new server if one goes down. It just works. We expect to make the shared data space available, a paper from HotOS this year. The algorithm you put on top to make the service supervision work is trivial code. If people are interested in that we’ll probably be distributing it in a few months. Once you get to abstraction where propagation delay is a few seconds for cluster, these algorithms are not hard and protocol analysis tools to prove properties like deadlock-free are good.
?: Something that might help people just getting started: Went from a site with a senior sysadmin dictating to my own site where I was trying to figure out the best way. Implemented two tools in separate labs. What I discovered was strengths and weaknesses of tools. By defining service/package in both I understood it better. cfengine, psgconf
?: Did you find many reusable parts that you could share?
?: Wrote a really dirty Perl script that could take one and generate the other. […] Found that going from cfengine to psgconf was easier than the other way.
?: You said you stopped cfengine for psgconf?
?: To clarify, when I changed universities I decided to evaluate both tools. I ended up using cfengine on servers and psgconf on clients because I found with psgconf it is easier to individualize machines, sort of. I’m not done with this and may come back in a year and say this is tripe.
Luke: How many people consider server and client
management to be fundamentally different? 14
How many are from commercial companies of these: 7-8
Educational: 4
Who thinks cluster administration is different from server and client? 9-10
?: now, long term?
Narayan:In general, are there different problems in cluster admin from others.
?: I would argue that you should be able to find a tool that can handle all environments. That should be a goal.
Andrew H.: Tool is not handling clusters: e.g. you need 2 DHCP servers.
?: You actually have exactly the same problem in regular networks.
Narayan: Yes, what I’m saying is: is there a difference between a network of nodes and a cluster?
?: difference is that in network every machine is distinct.
Andrew H.: Clarify: I make no distinction between a network of machines and a cluster. [...]
?: As long as there's no data on the machines, it doesn’t matter whether it is server, client, or cluster. It is when you put important data there that it is different.
?: Purpose is what distinguishes.
?: If all the data is on my network server, then all the client nodes become expendable.
?: It becomes easier but you have to maintain that config somehow.
Andrew H.: I think I’m taking a different slant on what is under config management. I don’t just mean put these files in these places. There is this whole mélange of stuff that seems to hit you in a cluster […] It’s just a far more complicated thing and you rarely see that on an individual node.
Luke:We’re not comparing a cluster to an individual node.
[…]
?: I would say the difference between a network of nodes and a cluster of nodes is like the difference between theory and practice.
?: Define cluster
Andrew H.: I use the Swiss model: everyone has prickly relations with everyone, friendly but not overly so […] Will say flock in future: what characterizes these things for me is the lossy unreliable way you talk to them.
Narayan: […] More deterministic at node level. Question was targeted at idea that clusters are easier.
Luke: I said that, there’s usually a higher degree of consistency across a cluster.
[…]
Andrew: I think of a situation with 10000 nodes which are clones.
?: If you think of things like webfarms and so on as clusters, I think a lot of what Andrew is saying applies there. You have a lot of web servers that are mostly the same, that makes the config job easier, but you also have a few other challenges when not all are alike like a Beowulf cluster. These 20 variables need to be different between each of the nodes, that’s a very different problem from saying that I have 2 DB servers, 2 of this and I need to update them.
?: [from U. Wisconsin] Heading in direction of compute servers, public lab, […] we really look at that as all the same problem (as infrastructures like file servers). We’ve written our own tool. This is really one problem, one tool whether one box or 100.
?: We’ve discovered a delicate dance if you get 4 types of servers serving the outside world but some talking to ATT, some Sprint, so choreography of what to update when to maintain service is what’s hitting us.
?: That’s where you look at a cluster providing a service. […] Several servers in different data centers all are a cluster from user point of view.
Narayan: Distributed runtime.
?: Does everybody know that this conversation is at odds with the marketing side of the industry. Cluster means a bunch of servers sharing load. Another term 'grid' […] means […]
?: Also HA cluster
?: People in this room are better positioned than people writing marketing, but they’re writing marketing.
[…]
John S.: Want to switch to a topic of SLAs. You’re building a cluster to accomplish something, how many of you are building SLA specs with configs.
Luke: [gives some intro to idea of SLAs, different SLAs for production servers, vs. development, vs. test]
Andrew: Sort of an odd way of structuring it. In common parlance it means if you are providing a service (the S) you are contracting to provide a certain level of service. […] John is saying: in order to provide this kind of service I need to provide x (hard to do automatically)
Luke: Who is using tools to do that kind of work?
Anybody?
Declarative language for SLA?
John S.: Paper from IBM about WSLA to codify SLAs.
Andrew: Trivial case I’ve done that’s close. We have a computational model for how fast things run, so if you say workload of size s must complete in t hours, then you have to buy x servers. We have that kind of thing: approximate models for capacity sizing.
?: We don’t do decisions based on that but in DCML (Data Center Modelling Lanaguage) we have a way to say those kind of things but don’t do anything based on it.
?: IBM has ‘orchestration mgr’ that is supposed to do this sort of thing.
Narayan: People said they treat servers differently from clients. Is that a need for a difference in language vs. how you interact with a tool?
?: Clients are down more, laptops carried around, people breaking things. Just problems you encounter are different.
?: Second that
?: Server: make a script to do something and you’re pretty confident it will work. Client: they might have deleted files etc.
?: Or you want to run big computation at night and they turn it off.
Andrew H.: Doesn’t sound like a fundamental difference, just trying to manage configs you don’t have control over.
?: theory vs. practice
Luke: Same problems, different details
?: Illustrate difference in my environment: users in my environment deal in a lot of special software (astrophysics). We’ll get one copy of something with a dongle, so I don’t put it in config mgmt. I wouldn’t do that on a server.
Luke: What if you have to rebuild that server?
[…]
?: Astrophysicist doesn’t expect his machine to be up 24hrs. If he has a problem it will take me at most a day to get him back up. […] With convoluted packages it is much easier to say: if your machine goes down it will take me a day to rebuild so you can use that application.
Luke: I just don’t think that line is the same as server/client
?: [another academic perspective from Vanderbilt] There are a lot of one-off configurations. You have to manage your time: is it more worth it to shoehorn this package into the config mgmt system. But on the server side, you’ve got an app and people expect a level of service from it and so it is worth the effort and if you have to rebuilt it quickly you can.
Andrew: This is bogus: everywhere you put ‘server’ or ‘workstation’ you can flip it around: workstation that serves the dean is more important than server for X. What you’re really saying is that the config mgmt tools are so horrible that you will do what you know is wrong just to avoid running the gauntlet.
?:You still document it.
Andrew: By the time you document it you should just put
it in the config mgmt. system.
[asked what are you trying to avoid, answer cfengine]
?: It all comes down to serving your users: you have to know for your site, whether the dean is more important than the NFS box.
Luke: You have to manage expectations also: go into sites where everybody expects to have root.
[…]
Luke: Want to talk about change management vs. config management. Seems to bite everyone. Change management: how you go about getting the work done once you decide something needs to change, with assurance that this is done right. […] Are people using tools to help?
?: Guys at university store everything in cvs. All the sysadmins on personal desktop machines are on testing config. Changes go into that config first. Once problems are fixed, roll out to everyone else.
?: We do essentially the same thing: sysadmins first, beta in the same building, then roll out. That was working until all the principals from other schools brought in laptops on the beta net.
?: You have to have representative samples of everything. We call them ‘canaries’
[…]
John S.: We asked how many people have built something and many hands went up. We’re also hearing about how horrible these systems are. What features do you want or what features do you now want to be there?
Luke: I want to have 2 DHCP servers […]
Ed: Want to hear from those who’ve written some tool.
?: One of the things some have brought up is that it is much easier to go to hack the box than to use the config mgmt system. What we need is something that makes it as easy as doing it manually.
?: Fundamentally if it takes so much longer in your tool you’re using the wrong tool.
?: Feature suggestion: change management
?: May take more time to put it in a tool once, but will you do it only once?
Narayan: That’s saying “take your medicine”. Decompose this: need UI.
?: My point is you change the config tool once and you never mess with it again.
?: But people are not doing it.
?: People know they should exercise also…
Andrew: Something else is going on here: if you look at how engineers build bridges, they use standards to avoid doing all the load calculations etc. precisely because it is very complex in general. So putting it in the tool is that if I can wedge it into the tool the tool will be taking care of a whole lot of things that I might not think of.
[…]
?: Big thing I would love to see: management of dependencies of services and establishing relationships between services so you can determine what needs to be configured based on those relationships. Relationships between entities.
?: On installation: Windows is supposed to be really easy to install software [on], but there are 12 different ways to install software depending on what tool you use. Our side of things is even worse: do you just extract a tarball, do you need to run a script afterwards?
?: What makes it so difficult to shoehorn into a config mgr is that I may not know the syntax but I know how to do it. So give me an interface and (like smit used to) show me the commands.
Luke: Part of this is that sysadmins are too used to thinking on this broken level. I can’t write in C because I think in assembler.
?: What we’re saying we want is something declarative: I want 2 DHCP servers etc. That is essentially declarative. The way we’re used to going about providing those things is essentially procedural and not declarative. We need a tool that bridges the gap between a declarative description of what we want and the procedural means of getting there.
?: We have these but people don’t want to use them because they know how to do it procedurally. It’s chicken and egg.
?: This is the argument I get from people who work for me. […] A compiler is a thing for going from declarative to procedural. So we need a configuration compiler of some sort. The other thing that we want is a very large library of useful procedures.
Andrew: What we’ve glossed over: what forces this is the cycle time. Using a config mgmt tool is slower than changing a config and restarting the web server. […] One of the goals I would put there is some way to do really rapid convergence.
Narayan: You should look at my tool because it does this, relates to the last 2 comments. You do commands and it captures within constraints of heuristics and uses totally declarative representation.
Narayan: We’ve just discussed how all these tools are bad. When we sat back and thought about all of these problems, we concluded that it is hard to come up with a tool that does not embed a way of solving problems. […] I’ve been interested in trying to figure out how to attach different types of UI to the same tool.
?: One problem I’ve had: in my environment we do internet banking, a network with lots of segments. Every one of these tools I’ve looked at assumes one central server that can talk freely to every node.
Luke: There are tools out there […] documentation almost exclusively describes a simple environment. That’s one of the problems: tool developers' recommendations based on docs rarely match real environments.
[…]
Andrew H.: Observations about goal, apparently a declarative thing. I have a declarative thing for managing files: FSM. It is a pain. The consequences of not doing something like that are unthinkable. We deal with 2 million files a year. You cannot do it any other way. Even though these methods work very well, there’s a tool that anybody must do in our group: an explain tool that says what is going on. The presence of these ‘explain’ functions, you cannot overstate how important that is. If there’s a magic box that does stuff they don’t understand they will just not use it. […] Think of it as an audit tool.
Narayan: Even important in groups: one guy knew all
platforms, everybody else was terrified to touch anything for fear of breaking
other platforms.
last modified $Date: 2005/02/05 02:00:14 $