Microsoft Windows’ next version will feature a database-like file system. However, something similar was done years ago, at BeOS‘ BFS. (BeOS experienced) journalist Andrew Orlowski from TheRegister interviewed Benoit Schillings and Dominic Giampaolo (of the BeOS fame) about BFS and the database-like file systems in general.
I understood about 1% of that whole conversation, but that 1% made me think, again, how cool BeOS really is, and remember why I love using it.
It was designed by people who really understood what makes a user-friendly experience. Aside from all the technical whatnots and blah blah blah, BeOS is so easy to install and so easy to use. Congrats to Benoit and Dominic for being part of that, even if that part was under-appreciated by users like me.
BeOS still has superb technology that still can’t be found on any other OS! (pervasive multithreading, latency, API etc…)
I still hope that Palm can someday hire the other old Be engineers back get the BeOS back on track!
ciao
yc
I thought this was the most interesting sentence:
>while Dominic, we are delighted to learn, has subsequently joined Apple as
>a file system engineer. He started last week).
Apple need a new file system, now they have the guy who wrote BFS!
>Apple need a new file system, now they have the guy who wrote BFS!
QNX needs a better filesystem too. However, Dominic left QNX for Apple… ;o)
too bad they didn’t buy be inc and hire its enginers
>Dominic: They don’t know what’s possible. If someone shows you it – they get it.
>Benoit: I think attributes and queries were really underutilized. Well, most of BeOS was underutilized
I’m still not satisfied.
“The problem now is, and I’ll be frank here, is that can take the Unix community years to solve such problems.”
The real truth is that it takes the whole industry years to solve the problem of shedding legacy stuff. Dominic certainly isn’t going to be able to have the kind influence at Apple that he had at Be. Remember the switch from DR8 to DR9?
How much longer is the antitrust suit going to take?
When will Apple support multiple platforms?
Who is even in line to challenge Mircosoft’s dominance?
Gota say guys, I loved reading that. I gota admit i had to read it twice to make sure my brain caught all of it.
And i want to let you know that the work you put into BeOS touches me (and many others) on a daily basis.
thanks a bunch.
So, I’m wondering if anyone here can explain to me why they were both so down on BTrees? As I understand it, that data structure is responsible for BeFS’s ability to very rapidly search indices. While I understand what they mean about rotational latency, I was under the impression that the commonly used indices (name, etc.) were cached by the OS. Is this not true? If they were, would it nullify the argument against BTrees? And what would be a better solution otherwise?
ELQ/JBQ care to weigh in? Or other lurking ex-Be engineers?
I’m currently working on my own toy OS and I’d like to hear thoughts on the nitty-gritty details of this topic.
>>How much longer is the antitrust suit going to take?<<
This is only the beginning!
>>When will Apple support multiple platforms?<<
Never!!
>>Who is even in line to challenge Mircosoft’s dominance?<<
Marketing?.. or technical??
Marketing?.. NEVER!
technical?.. ALREADY BEEN DONE!!
Some really interesting stuff in there, a lot of which I’d like to see implemented (at least as an influence) in OBOS/GE… I like the stuff about pseudo filesystems, and especially ‘functional namespace’ –
Benoit: You can say give me an image, and give me an image, “gif.small” and it gives you a small version of the image.
Simple, powerful stuff… this integrated with attributes/queries could be really cool
Good question. After all, <a href=”http://www.namesys.com/“>ReiserFS uses balanced trees, and it is a reasonably fast filesystem. Here’s what the ReiserFS web site has to say about it:
<blockquote type=”cite”>
ReiserFS is based on http://www.namesys.com/benchmarks/benchmark-results.html“>fast&l…
BeOS has its ups and downs.
Just a week ago a lame graphics viewer called ‘Peek’, entirely freezed my whole BeOS developer1.0 (beosonline distribution) when I was trying to see some images stored on a CD. Something as silly as that has never happened to me with MSWindows2K nor with my Corel Linux OS. Something similar has also happened a few times in the past with my BeOSPro5.
The good thing with BeOS, of course, is that I didn’t have to fear loosing data when rebooting, neither I had to do long fschecks (solved already with NTFS and ReiserFS).
BeOS has been rightfully praised for many things (multithreading, multitasking, …), but it was far from perfect, I like to see how the OBOS team is starting with a new kernel.
did Microsoft ever have technical dominance?
>>did Microsoft ever have technical dominance?<<
ummm…!@#$%^&*?
“Doug Fulton, actually, he would put everything in a big flat directory and use queries to find the stuff…” I really like this idea, and its been something I’ve been pondering. What if we had a file system where all user-created files/docs were stored in one flat document folder, and instead of having folders and tree systems, we had pre-defined and user created query buttons in a side panel. It would work like this- Open the Files folder, you see everything. Click the “Audio” query button, you see all .wav, .mp3, .wma, etc. files. click “mp3” all you see is .mp3s. Or a another example: “Office” would show all Word, Excel, Powerpoint, etc. “Excel” would show only Excel files, and the user-created “Project 23” would show all documents, spreadsheets, graphics, CAD files, etc, for Project #23. Then it would have a drag and drop GUI that wouldn’t actually move the file, but rather change some of the attributes in the files, so you could move a file from “project #23” to “project #25”, or have a song show up in both the “Aerosmith” and the “’80s Music” query buttons you made. And so on but I’m getting tired of typing.
Maybe this already exits, I don’t know- I know one thing… I couldn’t program it!
“Doug Fulton, actually, he would put everything in a big flat directory and use queries to find the stuff…” I really like this idea, and its been something I’ve been pondering. What if we had a file system where all user-created files/docs were stored in one flat document folder, and instead of having folders and tree systems, we had pre-defined and user created query buttons in a side panel. It would work like this- Open the Files folder, you see everything. Click the “Audio” query button, you see all .wav, .mp3, .wma, etc. files. click “mp3” all you see is .mp3s. Or a another example: “Office” would show all Word, Excel, Powerpoint, etc. “Excel” would show only Excel files, and the user-created “Project 23” would show all documents, spreadsheets, graphics, CAD files, etc, for Project #23. Then it would have a drag and drop GUI that wouldn’t actually move the file, but rather change some of the attributes in the files, so you could move a file from “project #23” to “project #25”, or have a song show up in both the “Aerosmith” and the “’80s Music” query buttons you made. And so on but I’m getting tired of typing.
Maybe this already exits, I don’t know- I know one thing… I couldn’t program it!
I didn’t have to fear loosing data when rebooting
———————–
Keep in mind that journalling does _not_ prevent data loss, it merely guarantees filesystem consistency.
Benoit’s reasoning is that B-trees have a tendency to split into small bits that you’ll access somewhat out of order, and that the performance gets killed mostly by the rotational latency of your hard drive.
Warning : lots of numbers ahead…
First, how big indexes really are…
I checked Eugenia’s hard drive. It contains 38000 files. The 3 primary indexes (name, size, last_mod) are about 2.5MB each; the next ones (in size) are 200kB, 89kB and 56kB, then 3 times 16kB, and 38 indexes are smaller than 10kB. My Windows machine has 90000 files.
Take an IBM 120GXP. 7200RPM (4.2ms rotational latency), 48-24 MB/s, 8.5ms avg seek, 1.2ms track-to-track seek, 14.7ms full-track seek.
Let’s imagine that your index is close to the start of the disk (the fast part) (where it is actually pretty likely to be). Let’s first examine the case of a pure index search, whithout doing anything else. Each time your index is not continuous, you need to wait 5.4ms to find the next part of the data. That’s 256kB of data. Each time you have a gap in your index, you lost the chance to read 256kB of data. You can read a 512kB block in the same time as you can read two 128kB blocks.
Now, let’s imagine that we’re doing something else on the disk while searching the index. Let’s imagine that we’re reading, in sequence, one index block, one data block, one index block, one data block, etc… The access time between the index and the data is about 14.6ms . Each time you move back and forth between index and data, that’s 29.2ms of time lost moving the drive head around… which would have been enough time to read 1.4MB of data from your index.
A state-of-the-art Seagate cheetah X15 gets 15000RPM (2ms rotational latency), 69-52MB/sec, 3.6ms avg seek. 0.5ms track-to-track seek, 6.3ms full-track (extrapolated).
The same numbers become 2.5ms local access time, i.e. 172kB of data. Back-and-forth between index and data is 12.8ms, 883kB of data.
A Sept-96 Quantum Sirocco (which was a reasonably high-end desktop drive at the time) yields 4500RPM (6.7ms rotational latency), 6.9-3.5MB/sec, 11ms avg seek, 2.5ms track-to-track seek, 20ms full-track.
The same numbers become 9.2ms local access time, i.e. 64kB of data. Back-and-forth between index and data is 41.6ms, 288kB.
Seeing those numbers, today’s hardware makes it reasonable to use 1MB blocks for your indexes, and it is reasonable to assume that indexes will fit in RAM entirely. With that in mind, the ideal structure will be as simple as a flat structure or (at most) a simple 2-level structure (yes, both those structures are technically B-trees, but so shallow that they don’t deserve the name). For a filesystem index, I can hardly imagine any need for anything more complex than that.
JBQ
The same applies to memory though you’ll find many programmers are not aware of this.
A modern CPU can read DDR ram at 16 bytes per (memory) bus cycle. However it first has to access that data and there is a latency penalty of 7 bus cycles. If data is split up in different parts of memory the CPU will keep having that latency penalty and this will reduce bandwidth considerably since in 7 bus cycles you can read 112 bytes and you lose them.
More importantly the CPU is kept waiting so, on a modern CPU – say at 1.5GHz you are going to lose CPU cycles.
With DDR-PC2100 the bus speed is 133MHz, 1 cycle every 7 nanoseconds.
The CPU cycles every 0.66 nano seconds. So a 7 cycle memory bus wait incurrs a loss of 49 nanosecods which is 73 CPU cycles or, on a 3IPC (Instruction per cycle) CPU thats 219 instructions.
So, any algorithum which is going to be jumping around RAM a lot is going to be hit by latency quite badly. However if you read in big continious blocks you only get the latency penalty once at the beginning, after this you can read 16 bytes per bus cycle.
Any type of tree structure is going to break things up so a big block is going to be faster to read and process. This applies to all applications…ever wondered why using XML can be so CPU intensive…
If you are reading blocks of data you can also use little tricks like reading it into the CPU core or cache (just assign a load of variables) and then process it. I experimented with this once and the processing speed rocketed.
>I checked Eugenia’s hard drive. It contains 38000 files.
That is, my BeOS partition.
Interesting if you compare my numbers with JBQ’s you see he is talking in ms whereas I’m talking in ns, a difference of a million. So, using a highly fragmented structure on disk is going to lose you millions of cycles and consequently cripple performance.
So if MS do this your PC will seem amazingly slow thus giving Intel the excuse to sell you a new faster CPU in a new PC which of course comes with Windows and thus continues the upgrade cycle…
>any algorithum which is going to be jumping around RAM a lot is going to be hit by latency quite badly
so would that count for multithreaded applications running on one cpu??
JBQ:
*nodding*
At first I thought “Flat memory structure?!? That would crush your L2/L3 cache every time you searched for an index!”
But after reconsideration, I realized that you probably didn’t mean a long, flat list and instead were thinking of a hash-type structure? And also, the clearing of your CPU cache(s) for each index search would probably be so infrequent that the math would turn out to not make a difference… I’d need to sit down and think about it and do some numbers to be sure, though.
This is all very interesting as an example of how the moving target of ‘State of the Art’ can profoundly effect OS design choices.
Nicholas:
But what about this scenario: You have a 65K 16-byte data structures that consist of key/value pairs and you need to find a specific one corresponding to a specific key.
If you are using an unsorted list, you have to iterate over all of them -> 65K memory cycles
If they are in a simple balanced binary tree (I assume that the keys are unique), then I only incur log2(65K) * (7 + 1) -> 128 memory cycles
Depending on the traffic across the structures, arguments can be made toward sorting lists improving performance vs. balancing trees decreasing it, but I don’t think it’s fair to say that a tree structure is always going to slow things down. It will in the case of iterating over the entire thing, but not in the case of searching, which is what indices are for. Also, the arguement against them from a disk context vs against them in a memory context is a little different since non-continuous reads are orders of magnitude slower than continuous on disk whereas the difference is not THAT extreme in memory.
It’s been a while since I really did a lot of hard thinking in this area, so I readily offer up that I might be totally wrong. But lets continue! This is fun!
that would kick but because then transfering of files would be super fast…..I am sure there are a ton of other benies, but I think that the Icon is so engraned that the queries need to be in the form of free standing files in a psudo directory tree.
> Keep in mind that journalling does _not_ prevent data loss,
> it merely guarantees filesystem consistency
Actually EXT3 ensures that in the default-mode “ordered”.
ReiserFS in contrast can ensure a consistent file-system but with the probability of trash-data in the files.
XFS is a little in between, but it doesn’t prevent data-loss like EXT3 can. Just have a look at the numerous feature-comparisons and you will see why I use EXT3 on all of my boxens.
The Proposed Microsoft Filesystem goes far beyond anything done in the BeOS Filesystem. While the BeOS FS was innovative and still has unique features it does not include a true database of the “application” data within the file itself. Microsoft is proposing a revolution that proponents of Object Database systems have desired for two decades, an object databsae for a file system. By using an object database for ALL application data many new capabilities are realized. First you end up with a “data schema” for all files stored in this manner which enables you to use the data for more than the application that created it. This is one of the powerful benefits of the more limited Relational Databases. Second, you end up with the possibilty, if done right, of using fault tollerant database transactions and object versioning at a fine grained level. This can provide a powerful “undo” capability. The revolution of this idea is that currently most applications usually encode their own propieritary storage formats thus making the data inaccessible to other applications. Imagine being able to access the internals of a word processors document, or a spreadsheet.
In anyevent it will be seen if Microsoft’s brilliant teams can pull this off while preserving the legacy of the existing Windows Applications. In many ways the sheer size and internal complexity of the Windows system is the largest limiting factor in Microsoft’s inexorable push forward. The larger they make Windows the more complex it gets. The more technology encrustations they add (com, .NET, …) the more complex the internal interactions. Adding in an application oriented object database file system at this late stage in the history of windows could be the last gasp that brings the complexity issues to a head and their software development pace to a crawl.
If they succeed – and if they don’t make critical design mistakes – they will have a better system for all.
Object database file systems will rule the future of operating systems. Unfortunately they are not simple to build and they require a high performance computing platform to manage all the details.
“There is an effort … to include storage technology that will allow for some manipulation and querying for structured and unstructured data across our products,” Ressler said. “We’re certainly not putting SQL Server into the operating system. Yukon is not going into Windows or any other product in that sense.”
But the storage technology in question allows “queries both with relational syntax and through XML-type syntax,” Ressler said. “There’s source technology under development by teams across the company.”
Microsoft seems to be building a Relational SQL and XML Database system for their applications. Not bad, but not a true object database file system. The XML component is a nice segway into object database capabilities however. Unfortunately Relational SQL and XML are only about a third of what object databases are made of: Data, Code, Schema and Interconnected Networks of Objects. With any one of these characteristics missing you don’t have an “object” database. XML comes close to delivering but is just missing the “code”. Maybe their engineers will actually be smart and build a pure object database below the Relational SQL and XML worlds.
To infinity and beyond.
You’re right. I guess I could have worded that better…
Reading in a linear fashion is faster than jumping around but if there are very few jumps to be made (as with your searching example) that is going to be much faster, it’s only when there are many, many jumps required (i.e. iterating over the entire thing) that the difference will really show.
I’ll give a diferent example – I mentioned I had a rocketing of performance once, this is it – Lets say you want to move a chunk of data in RAM to elsewhere. it’s a block of 64K 16 bit samples.
One way is to read one word at a time and write it to the new address you will get the latency penalty for every read and every write you do – so thats 64K X 2 X (7+1) = 1,024,000 (memory) bus cycles. Actually it’ll be considerably less because of caching – 128,000 to 256,000 cycles at a guess.
You can get around this by working in blocks – IIRC there are 8 32 bit registers in an x86 so you fill them up (just assign 8 variables in order) then write them out using 32 bit pointers. On DDR RAM You can fill the 8 registers in just 2 bus cycles so this is now: (64K / 16) X 2 X (7+2) = 72,000 bus cycles, less by quite a margin.
However modern CPUs actually have many more “rename” registers and while you can’t program these directly you can program them indirectly.
I believe the K7 has something like 40 so you now assign 40 variables instead of 8 and the CPU *should* happliy fill up it’s rename registers. This gives:
(64K / 80) X 2 X (7+10) = 27,200 bus cycles, an improvement of 63%.
64 bit registers will improve this further to 21,600 an improvement of 20%.
In each case the amount of data trasferred in one go is increased and the number of “jumps” and thus latency hits is reduced.
The only problem here is if the data you want to move is not a multiple of the block transfer size – but that just means you just switch to a different routine for the last block. Although thinking now I think a smart way could be found to dynamically change the amount read/written.
BTW this also work for swapping and clearing chunks of memory.
—
>so would that count for multithreaded applications running on one cpu??
It’s a question of balance, the answer to this is both yes and no. I think that this will only make any significant difference at the algorithum level, but if you are running thousands of threads then I guess it could have an effect.
On the other hand this is taken advantage of in multithreaded CPUs such as the new Zeon. Since memory is slow one thread can run out of data to process so the second thread is switched in and it starts.
Intel didn’t take full advantage of this though and only got a 20% increase in preformance. The Alpha 21464 would have done the same thing but got a near 100% increase.
It’s all at the end of the day a question of getting the right balance.