“In May 2007 I ran some benchmarks of Dragonfly 1.8 to evaluate progress of its SMP implementation, which was the original focus of the project when it launched in 2003 and is still widely believed to be an area in which they had made concrete progress. This was part of a larger cross-OS multiprocessor performance evaluation comparing improvements in FreeBSD to Linux, NetBSD and other operating systems. The 2007 results showed essentially no performance increase from multiple processors on dragonfly 1.8, in contrast to the performance of FreeBSD 7.0 which scaled to 8 CPUs on the benchmark. Recently Dragonfly 1.12 was released, and the question was raised on the dragonfly-users mailing list of how well the OS performs after a further year of development. I performed several benchmarks to study this question.”
How is this different from this document?
http://people.freebsd.org/~kris/scaling/7.0%20Preview.pdf
Edited 2008-03-16 22:09 UTC
The original benchmarks used DragonFly 1.8, whereas this one tests 1.12 and FreeBSD 4.11.
The low scalability DragonFly displays is utterly expected, as the system still needs to run with the giant kernel lock on MP systems.
It’s time for the ‘cluster os’, it should be at least as fast as the oldtimer FBSD 4.11.
Since it forked from FreeBSD around the time of 4.8 a more interesting comparison would be FreeBSD 4.8 and FreeBSD 4.9 so we could get a glimpse of the overhead given by the infrastructure for clustering added since then.
There were no significant performance changes made from 4.8 to 4.11. Even if there were, if dfly never picked them up then that’s still a problem.
It’s a benchmark too for FreeBSD 4.11, because some people keep thinking it’s better than any FreeBSD thereafter
Well, it’s different in the sense that it’s not the same
In the PDF you link to I have a graph comparing freebsd to dfly 1.8 running mysql, but in the current tests I compared freebsd 7.0 with dfly 1.12 and freebsd 4.11 on a wider variety of tests.
MySQL performance scales linearly in FreeBSD 7… amazing.
> MySQL performance scales linearly in FreeBSD 7
No. Mysql sysbench.
http://leaf.dragonflybsd.org/mailarchive/users/2008-03/msg00025.htm…
There is a discussion too on ml.
DF BSD is a fork of FreeBSD 4.8, and a lot of work has been done in FreeBSD to remove the GKL(? Giant Kernel Lock), which I AFAIK was not ported to the DF kernel because Matt Dillion believed that this was the wrong approach. I wonder what he has to say in regards to recent benchmark improvements by FreeBSD?
I believe it would probably something along the lines that DragonflyBSD is really still in alpha (while it maybe usable for some people for daily use, it has implemented all its promised features yet of a transparent cluster). And that as long as the focus is on completing the core feature set, performance improvements are not being worked on beyond an as-needed basis.
Though I wonder: with these performance issues how will DF deal when it comes to work on them when the clustering is complete? Or perhaps it doesn’t matter? Maybe the future of DF is in its niche of clustering where perhaps it will be unchallengeable, and then be adequate at the rest?
After all, OpenBSD isn’t used for super-computing but has managed to find itself a nice niche in security conscious jobs.
Edited 2008-03-17 01:42 UTC
“DF BSD is a fork of FreeBSD 4.8, and a lot of work has been done in FreeBSD to remove the GKL(? Giant Kernel Lock), which I AFAIK was not ported to the DF kernel because Matt Dillion believed that this was the wrong approach.”
They inherited the MP lock from FreeBSD 4.8 and it is basically where all SMP for any OS gets started (one giant lock protecting the entire kernel for one thread at a time). Matt has / had issues with the way that threading was being implemented in FreeBSD, and moreso the all out (over)use of fine grained locks sprinkled about the kernel to replace the Giant lock.
“I wonder what he has to say in regards to recent benchmark improvements by FreeBSD?”
From what I’ve seen of his writings WRT DF scalability ATM, he would not be suprised.
“I believe it would probably something along the lines that DragonflyBSD is really still in alpha (while it maybe usable for some people for daily use, it has implemented all its promised features yet of a transparent cluster). And that as long as the focus is on completing the core feature set, performance improvements are not being worked on beyond an as-needed basis.”
Basically zero optimization has occured in DF to date, as much of the kernel still requires the MP lock, and from the release notes for 1.12 he claims that the biggest part of the kernel that needs more attention for SMP is anything I/O related. Large parts of the kernel have been mostly MP safe for a while (for example large parts of the network stack), but they end up needing to grab the MP lock because of the non-MP safe code.
“Though I wonder: with these performance issues how will DF deal when it comes to work on them when the clustering is complete? Or perhaps it doesn’t matter?”
Optimization isn’t likey to be a big issue until the kernel is mostly MP safe, and the core clustering work is done, however issues like the possible namecache flakyness encountered by Kris would definately be dealt with as soon as the problem can be tracked down.
Before release 1.10, the DF folks spent a bit of time yanking out mounted USB memory sticks and fixing bugs they found in so doing. That isn’t related to either SMP or clustering, and I’m offering it only as an example of the fact that they will take the time to correct obvious problems.
“Maybe the future of DF is in its niche of clustering where perhaps it will be unchallengeable, and then be adequate at the rest?”
Well, DF is a general purpose OS, that has an additional goal to allow native SSI clustering at the kernel level. Clustering is still a ways off, but I find the system usable in the general purpose sense.
That said, as much as I like the DF project, I don’t ever see it being widely deployed. Its just a case of Windows and Linux etc being “good enough” for most people.
“After all, OpenBSD isn’t used for super-computing but has managed to find itself a nice niche in security conscious jobs.”
Perhaps DF will come to fill such a niche roll. Time will tell.
It’s a shame that one cannot moderate after having posted, for I would have moderated you up for such a well-written, informative post.
I did it for you .
Writing a fine grained kernel is hard, especially when you need to support some functionality that’s not so friendly for large MP systems.
Maybe Dillon’s approach is right. Instead of trying to coordinate actions between processors in a lock free or finely grained manner while trying to maintain tons of shared state, why not treat your large MP machine as many smaller ones that are coordinated at a higher level?
The Google cluster is a great example of this. They don’t really need a scalable OS at all: a simple kernel that efficiently manages I/O and gets out of the way of the one executing task on a particular node is all that’s needed. Everything else is coordinated from their aggregation servers and their distributed namespace/locking system.
This is the McVoy cache coherent cluster approach
http://www.bitmover.com/cc-pitch/
It has always seemed like hadwaving to me. The problem with this is: “what higher level?”
What does it buy you to program a multiprocessor system as a set of communicating cluster nodes, that you can’t do as a monolithic kernel? As far as I can see, it only serves to place restrictions on the ways you can communicate and interact between CPUs.
So why do proponents of this approach think they can just assert that a system of communicating nodes — that *must* still be synchronized and share data at some points — can scale as well as a monolithic kernel? I don’t think that is a given at all.
Well, they use Linux, and actually they are pushing some pretty complex functionality into the kernel (to do resource control, for example).
And I don’t know what it’s requirements are, but you can bet it’s nothing like a regular UNIX filesystem / POSIX API.
Also, I don’t see why you think Google is a great example of this. Google does not have a large MP machine. It has a big, non-cache-coherent cluster. So there is only one way to program it — like a cluster.
Nick,
You have a point that the single-machine cluster approach is going to require a different method of programming than the typical monolithic OS image that you seem to favor. But I think that McVoy has a point about the locking cliff. NT has gained scalability over the releases by breaking up hot locks and moving toward per-processor data structures and lock-free algorithms. But if you look at a CPU count versus throughput graph, all of these improvements lower the throughput slightly on small hardware in order to gain advantages on large hardware.
Right now that’s the correct choice since Intel has been selling multi-core processors for a while now. But what applications need truly gigantic machines, like the 1024 node one that McVoy speaks of? The only ones I can think of right now are scientific, visual, and engineering applications where, for cache and NUMA efficiency’s sake, it’s more of an advantage to divide problems into smaller chunks.
Cache coherency isn’t good in this situation either, because it implies that every time a processor dirties a cache line every other processor must get a chance to know about it (or you use a hierarchy of cache directories which manage the same bookkeeping at the cost of additional latency for every operation).
Having a single system address space becomes problematic as well because every time you change a mapping from valid to invalid, every processor must be informed in case it has cached that mapping in its TLB.
The problem with single-image scaling from my perspective is that it’s an n^2 proposition… When you scale from 1 to n processors, in a given period of time you have n times as many synchronizing events happening and you have n times as many people to inform at these synchronization points. The second n is done in parallel, but this parallelism translates into a factor of n latency due to distances and fan-out.
If you want to use a 1024 proc machine for something useful (as opposed to having it sitting around spinning on locks or waiting for cache misses), I think I’ve argued that you just have to bite the bullet and accept that you need to do batches of discontiguous work on small parts of your problem and aggregate the result.
I remember Matthew Dillon was pretty harsh and insulting with how the FreeBSD Project was trying to solve the BSD SMP problem. I don’t know if in the future Dragonfly will scale better than FreeBSD however two things are now certain. 1) As witnessed by the very promising scalability results we are now seeing, the FreeBSD Project was right to push forward and play it safe. 2) As witnessed by the utter lack of progress of SMP scalability in DragonflyBSD, while the road the FreeBSD project took was hard, the road Matthew Dillon was proposing was no walk in the park. I really think he should own up and apologize to the project now.
Maybe everyone should just work on their own projects and drop the petty bickering and benchmark-waving. Using benchmarks, and benchmark comparisons with other projects, as tools to guide development is one thing. But using them as a way to taunt other projects is something else, and is substantially less constructive.
Hear! Hear!
Any benchmark of this sort is fairly useless across different systems for the simple reason that everyone has a different set of optimized cases and weak areas.
For instance, people used to use select() on UNIX and find that it performed worse on NT because NT was designed for a different approach as encompassed in the IoCompletionPort API.
Additionally, any application with highly specialized performance requirements, like a database, would probably go to great lengths to avoid paths within the host OS that are necessary in the general case but bad for the specialized application. Hence you see raw disk mode in Oracle and SQL.
I think the Linux-CFS vs. FreeBSD-ULE scheduler numbers should be taken with a grain of salt. It may just be that MySQL (or at least the benchmark) needs to be reoptimized for the new Linux scheduler to get the formerly high throughput numbers. The scheduler shouldn’t matter to a well-written database because there should always be not much more than 1 work item per cpu being processed at a time.
IIRC Matt had some harsh things to say about FreeBSD5.. which is about the time he forked (and forked from FreeBSD4). FreeBSD6/7 was more or less a departure from 5’s way of SMP. So in that respect, he was pretty much right on.. it just so happens that the FreeBSD team corrected the ‘mess’ that was FreeBSD5 before he could get his DFly work out the door in a state that satisfies him.
In any case, all the BSD’s have bright people working on their teams and, since it’s all open source, we should all benefit from the different ideas eventually.
No, you are quite wrong. FreeBSD 5 was the first release that had the bits of SMPng. The reason FreeBSD 5 has been (quite unfairly) derailed is because SMPng simply was not fully completed as some parts were missing and not in any way focusing on performance. FreeBSD 5 (5.0 and 5.1, which BTW was NOT PRODUCTION (as in stable) releases which is often forgotten) was rather about providing a correct implementation to build upon for future releases. IIRC, what M. Dillon criticized was the basic SMP synchronization model with fine-grained locks/MUTEXES. That is something that has NOT changed and so no he was not right on. If anything, FreeBSD 7 has shown that he was wrong on the basic architecture behind SMPng. Now he himself has to prove that he was at least partly right by making sure DragonflyBSD runs just-as-well by showing that it scales properly, has decent performance (performance winning not needed, enough performance is good enough). So far that has not materialized as to many things are still missing. It sure is an interesting experiment that he has embarked upon but it nowhere near being deployed on production servers. The level entry servers today has 4 cores, most standard servers ship with 8 cores and for that you need an SMP implementation that scales, FreeBSD 7 has clearly shown that it can scale and perform.
There was no “mess” to correct. It was just a matter on refining, and optimizing and that is still ongoing so expect new advancements in performance for FreeBSD.
Yes, they do. Just look at jemalloc, tmpfs as recent examples of BSD-licensed code that has “migrated” outside their *BSD.
Why?
Read the interview with Matt from 2002 and you will see
http://kerneltrap.org/node/8
Maybe just politics …