I came across a news entry at Phoronix about a new init replacement, systemd, and curiously started a read into the surprisingly heavy matter. Systemd is by no means as simple as upstart. It does far more things far more straight and in more detail. The differences are so significant that they enforce quite different configuration strategies. One can argue for both, depending on the goal to reach. However, that’s not what I want to write about. After having read what systemd is capable of, and how it does it, I began to put the existence of all system daemons – in their today’s forms – in question.
Though the provided services must be available in userspace somehow, this doesn’t mean that the cores of the belonging system daemons must be implemented redundantly and for the userspace. I don’t even believe anymore that the cores can fulfill their tasks anyhow well in userspace. For explanation: Upstart is a simple event daemon that can start scripts on actively passed events, mainly. In other words, upstart can do classical tasks of an init daemon and serve as a general purpose event handler, but not much more. This is so because upstart was thought as a simple drop-in for the SystemV init daemon, which implied a small code base and as few hacks as possible. Because in userspace there isn’t much more upstart could fulfill in a sane way, it behaves. However, it already comes with quite a few of substitutes for common system tools to be able to behave well in some problematic cases where upstart differs too much from the SystemV init daemon. Upstart can’t just catch the actions of the original tools and decide differently for them.
Systemd observes the full system actively. It observes paths to be informed if a program attempts to access a socket. It observes hardware events to be informed if a resource is available. It even observes mountpoints to dynamically mount resources at access. This is lots of knowledge to gather, maintain, classify and decide about. This also enforces a better understanding and exploitation of the given possibilities on the OS. To some extend, existing services like libudev help out. But, as systemd seems to mimic a system supervisor, it breaks with the classic init-daemon functionality and heads towards a level of control only auditing systems have today on Linux, though for different purpose. The question is if it can do this well, especially if implemented to run completely in userspace?
For comparison, an auditing system sits inside the kernel, to not get compromised from userspace, but mainly to be able to control every action occuring in the kernel that could alter the security of this system. Though this is a very specialized task compared to that of a userspace supervisor, an auditing system observes all the actions a userspace supervisor is interested in. It thus implements similar core functionality. More than that, it is principally capable of doing things a userspace supervisor can’t, like stopping an action from happening or pausing it until something else is done. For example, if systemd recognizes a mountpoint access, it can mount the resource immediately. But, is that quick enough? Systemd has no influence on the accessing process and thus can’t turn it into sleep until the mount happened. Principally, an auditing system could do this. There are other obstacles systemd faces like described in this entry. It goes like inotify is worse because of no atomic access and thus creating racy conditions but found a weird workaround.
Don’t get me wrong. I don’t pledge that the auditing system shall overtake the task of a userspace supervisor. I just conclude that there is redundancy and, because the auditing system is part of the core of the OS, the redundancy occurs with the supervisor. I also conclude that the supervisor can’t really serve its tasks a sane way because it just can’t. This is why I believe that the auditing system has something the other services should profit from.
So why not turning the observing core of the auditing system into an own-standing framework (with scheduling and messaging functionality) and settling the auditing part on top of it as a guest? Why not turning the system daemon(s) into simple feeders and slaves, i.e. guests, and letting them communicate with the new framework via sockets, device nodes, language interfaces or better? A userspace supervisor then would just feed the framework to create observers, actors and callbacks (for commanding a slave to, for example, start a script.) By urgency, the slave might turn into a master and advice the framework to obey immediately before going on (or the framework already knows about such urgencies and only checks for a return message.)
I believe that such a framework could be a very powerful and helpful generic service for both worlds, the kernel space and the user space. It could also turn systemd into THE daemon and make lots of other daemons redundant or shrink to simple slaves (actors). And, only to not be misunderstood, the auditing guest is still configured independently! There is only one observing framework all guests trust on.
What do you think? Could someone prove this concept in reality?
About the author:
Dennis Heuer is just an aged Linux enthusiast maintaining his own Linux home system (from pure source) and being confronted with the above discussed systems and sub-systems on a daily base. Though, he doesn’t understand the inner details of the Linux kernel well and can’t tell how easily his thoughts can be implemented. His viewpoint is that of an administrator.
Many of the arguments in the article are based on an incorrect understanding of how SystemD works. Ultimately, an auditing system would not be capable of most of the things that SystemD does (unless it were made into a rampant layering violation ).
This is not completely true. SystemD doesn’t have to actively observe and act upon any of the things that you mentioned. Sockets are created ahead of time, and SystemD leaves them alone (the kernel buffers the data). Hardware events are mainly observed by Udev (SystemD has very little hardware logic). And mount points are handled by AutoFS, also in the kernel.
Actually, SystemD can and does do this. It sets up AutoFS mounts. Any access will cause the process to block until the real file system is mounted. An auditing system would not be able to do this any better than SystemD can.
An auditing system would not be able to fix this problem. What is needed is a transactional file system. I am actually working on a transactional file system layer for the Linux kernel (about which I may write an article for OSAlert someday ).
Edited 2010-08-26 00:27 UTC
Fully agree. SystemD is about integration and much basic monitoring. The fact that a subsystem is restarted automatically if it crash. does not make that all systems (daemons, services, whatever) are obsolete and the rest is “the service”. A daemon that restarts automatically the X Server, in case of crashing or if user is in graphic mode, does not make it that the X Server is obsolete.
So I the author simply gets the fact that SystemD will do a better logic to restart services and so on, but the final conclusion is wrong.
i believe that SELinux is such a rampart layering violation. it even spreads into user-space libraries and tools. but the real fact SELinux proves is that auditing sytems are meant to become omnipotent. so, can you safely state that an auditing system will _not_ implement everything systemd is dreaming of into its observing code?
this rather sounds like you agree to the simple truth that things are better done inside the kernel, and supervisors should only feed and serve. i conclude from this that systemd is quite working on the same layer/interface for just everything inside the kernel as the auditing system, and there _is_, as a result, doubled core functionality.
again, my article targets at the cores of the implementations, the observing parts. if the auditing system can do this in the kernel, why we need it another time outside the kernel? in other words, if an auditing system can’t do it _better_ than systemd, does that justify a layer in userspace? where should the generic observing interface reside, and how should userspace daemons settle on it? that is my question.
this is interesting. could you please tell how it shall act (inside the kernel) and why an auditing system is not interested in it?
That is just to configure it. There is really no way to do that without user space tools. But yeah, I don’t like SELinux very much… it’s way too complicated.
Yes, actually. I highly doubt Linux would ever let an auditing system launch arbitrary daemons. And that’s because it wouldn’t make any sense. The old uevent helper system proved that it’s always better to let user space launch things.
There is absolutely no duplicated functionality. None of the things that SystemD does with the kernel are done by the auditing system, and vice versa. The only possible thing I can think of would be that an auditing system could do the job of AutoFS. But that would be a really bad idea. AutoFS is much better for that purpose.
It’s not outside the kernel. AutoFS is part of the Linux kernel. The reason that SystemD has to setup the AutoFS mounts rather than the kernel is because the kernel has no business reading configuration files. Policy decisions belong in user space.
The “generic observing system” is the auditing system. There is really little reason for observation of processes other than for security or debugging.
A transactional file system would allow programs to have a consistent snapshot of the file system. An entire transaction (which could last an indefinite amount of time) is an atomic operation. For example, a package manager could install software in a transaction. Then, if the power goes out, you will not be left with an inconsistent state. The downside is that performance is slightly decreased, and there can be conflicts (e.g. A writes to a file that B is trying to read). Unlike many transaction systems, there is no blocking. Basically, if A reads something in a transaction, and B writes to that thing in a transaction, the transaction with the lower priority is terminated. Individual, normal file operations are treated as transactions with infinite priority, so normal programs never have to worry about the transaction system. If an auditing system were to maintain all this logic, it would be a huge layering violation.
there are auditing systems that don’t taint coreutils, for example
you seem to get my article wrong. possibly the term to observe creates this strong relation to the auditing system that you think they are the same. but, please, go on birds perspective and overlook the kernel scape. you will see that, even if systemd is not observing by itself, at some point in the chain there is an observer because otherwise there would be no action on events. you would rather call this event handling or the like, but to observe is fully correctly used here in terms of the english language. think of a star observer. yes, in many cases the observation can be settled very deep into the kernel internals. but that is of different matter. anyhow there must be observation for events, and there is always a reason why.
this reason may be defined far outside the kernel in a user script. but the job ticket must get through down to the observing unit, being mangled and translated some times on the way. so it is, and both the auditing system and systemd somehow need to create such tickets for an observer or even to create an observer itself, depending on the kernel interfaces they hook in.
beside that – and here we come to what my article is about – both create a system to parse rules, to create types (struct’s) of contexts, to pass these contexts as tickets, etc. think in terms of structures. many programs re-invent structures for the very same purpose: scanning rules to create internal contexts to type and register them at the correct interfaces and bind them to chains, an event handler, or whatever.
this generic way of doing things i mean. this is what the framework could encapsulate and offer a way that both the auditing system and systemd, but also udevd and other services, can profit from it. i could write my own rules in guile and register them via an ffi, circumventing systemd. but systemd would be notified about the change and could update its state or possibly act against my script – possibly via the auditor.
possibly you now see that what i target at is a more generic, say, job center for kernel observation or instruction that provides principles for simple job-creation and allows for even more flexibility because of being accessible arbitrarily and even in concurrency, managing the states for the listeners and feeders.
This article talks about upstart and systemd as competing solutions to a problem but did not identify the problem. What is the problem with the current linux init system that necessitated creation of these two new systems?
It’s slow.
If that were the only problem then any one of the init replacements created in the last 15 years would be an improvement.
Speed is secondary. An init replacement primarily needs to solve initialization sequencing. Building an init sequence in which the appropriate things are started at appopriate times, and not before other things that may be needed first, is a highly non-trivial process. Upstart and systemd try to solve this problem and the different approaches define more than anything else the differences between the systems.
After that there are some nice to have things which are lacking on linux. Here I primarily mean service control; it’s embarrassing that Windows does this better (yes, better). Both upstart and systemd try to address this in fairly similar ways.
Both (but systemd in particular) do other things, of course, which I consider nonessential but still worthwhile and improvements on current systems. I have to give a great deal of credit to Lennart for not trying to solve just one tiny technical problem but aiming for a holistic approach, while still not greatly violating the *nix philosophy.
But init already does the sequencing correctly, in a linear fashion. That’s not too difficult at all.
But linear is slow. We have muli core cpu’s now. Booting would be faster if we loaded things in parallel. Ok, but what can we load parallel to what, and what has to remain serialized? That’s the complexity of the sequencing. its complex due to the parallelization which is due to the need for speed.
Linear is not ‘easy’ – linear is hard. Nonlinear is harder, but linear is not trivial! You have a large, unknown set of things to run which must be run in a particular order. What order? If you know you have a good order and want to insert a new item to run, where in the sequence does it go? Can you *safely* alter the order by inserting this new item and can you be sure that doing so does not break anything?
For simple things it’s pretty easy to just “drop it in” and hope it will be fine, but there are many non-simple things. Does your ldap daemon need to be started before your remote filesystems are mounted? What if your /home is mounted via nfs and all users are stored in ldap? How do you know which order to load things in and how do you re-order it when it changes? Even in a purely linear situation this is not simple. If thinks work well today it’s because of luck and careful engineering over many years.
It’s a management nightmare which only becomes worse over time. Designing a system that works on purpose, instead of accidentally, is a worthwhile effort and a tricky problem.
Being faster is nice, sure, but that’s not really a problem that needs to be solved, it’s just a nice side effect. Once we can figure out sequencing properly we can get parallelism “for free” and thus some speedups. But no, it’s not a goal.
That’s why I think SystemD has a chance, because if you messed up on ordering you will get a process waiting for the Kernel to complete the socket/port/pipe/filesystem connection. In the current topology, you get a hidden failure.
…a microkernel/set of servers?
Except with a larger-than-normal kernel.
Probably is, but shhh don’t tell anyone.
But really a microkernel also would seperate filesystems and drivers, which is (or was at the time Linux was created) slower then doing it all in the same memory space.
so no, not in the strict sense.
For FileSystems it’s an option (FUSE), and yes it is slower for hard I/O, but the convenience is nice.
This article proved incredibly difficult to read, as the English is more than a little mangled. I’m familiar with sysvinit, upstart and systemd but I gave up 1/2 way through paragraph 3, as I just had no idea what I was reading. Sorry.
https://bugzilla.redhat.com/show_bug.cgi?id=615527
If this was fully read you would notice Eric Paris the lead of fanotify. The lead developer to the replacement to inotify and dnotify. fanotify has none of the issues of inotify.
In fact fanotify allows you to block accept and delay requests of a file system. Something the past inotify and dnotify don’t allow. Reason fanotify support real-time virus scanning and auditing from user-space.
Not all forms of auditing can be done from kernel space. Like who in there right mind would run a virus scan in kernel space.
“For example, if systemd recognizes a mountpoint access, it can mount the resource immediately. But, is that quick enough? Systemd has no influence on the accessing process and thus can’t turn it into sleep until the mount happened.”
This so call issue can be solved by fanotify delay response putting application to sleep. fanotify needs to feature complete then problem here is solved.
Next systemd uses cgroups to divide tasks. A full cgroup can be suspended while waiting for a drive to mount as well. Little over the top. fanotify catching would be far less painful.
There are a few experiments in using fanotify to make file recovery from backup transparent. Ie when you attempt to access file that has been sent to backup program gets delayed until file is recovered from backup location and extracted.
systemd is setting up to take advantage of the tech that will be on hand to userspace for auditing over the next 12 months. This really does leave all the other init systems far behind.
cgroup tech is also always expanding. The control systemd is providing compared to all the old systems is many times more.