BareMetal Node: HPC kernel in 6KiB

Guest post by iseyler 2011-04-07 OS News 10 Comments

BareMetal Node is a HPC platform based off of BareMetal OS. The kernel binary on the nodes is 6KiB and mainly contains the network and disk drivers. A C or Assembly program can make use of all available CPU cores via the Orchestrator program that controls what the nodes are working on.

10 Comments

2011-04-07 10:35 pm

Alfman verbose=1
I hadn’t heard of this project before, so I’m glad it was mentioned. This OS handles the basics of getting into 64bit mode managing memory, and setting up multiple cores. It looks like it has potential, but from a brief overview it does not seem ready yet.

It does not support multi-process, which is likely to be a no-go for many. It’s advertised as a feature, since the OS runs nothing but runs your program, but this means you can’t even ftp/telnet into it.

The lack of pre-emptive multi-threading isn’t that important – asynchronous IO with callbacks is generally much better. *nix evolved with emphasis on spawning threads/processes to solve problems because asynchronous kernel interfaces didn’t exist, but the approach isn’t scalable.

This OS has the opportunity to use async IO from the get-go. It’d be a big plus.

It doesn’t have an IP stack, only raw packets, and then only wjth a 100mbps rtl driver (chosen because many VMs emulate it). This is an immediate no go, since the whole point of HPC is to run directly on high performance hardware. TCP/IP is absolutely essential.

HPC is the only domain I’d expect homebrew operating systems to have a chance at widespead viability. Just about all other OS domains have a chicken and egg problem where the operating systems need drivers, applications, and compatibility before anyone will consider it useful. Since no one will write end-user drivers/software for an unpopular OS, it never achieves critical mass regardless of merit.

However in HPC, the OS just needs to run well on hundreds/thousands of computers with the exact same spec.

Just my initial thoughts, it was a good find!

2011-04-07 11:24 pm

abstraction
The node is ment to do calculations and nothing else so adding other processes running at the same time would only make it slower and add nothing to it’s intended purpose. For a node as simple as that there is no reason what so ever to need to SSH into it (what would you do if you log in anyway?). I don’t know if it has a TCP/IP stack but I don’t see the point of having any sort of advanced protocol if the only thing it needs to communicate is the program to be executed and the result of the operation on the node.

As for the ethernet drivers I know they are working on getting more of them. They recently announced a request on osdev asking people to help them developing new ones.

2011-04-08 4:17 am

Alfman verbose=1
“The node is ment to do calculations and nothing else so adding other processes running at the same time would only make it slower and add nothing to it’s intended purpose. For a node as simple as that there is no reason what so ever to need to SSH into it (what would you do if you log in anyway?).”

You could tell a linux admin that they don’t need SSH access. Even if it’s technically true, you are asking them to give up a lot.

I know and appreciate what the node is supposed to do.

However it would be extremely nice, if not absolutely necessary, to have an administrative process to control the system (remote re-provision, maintenance, and monitoring). This is true regardless of how basic the node’s primary purpose is.

Seperate processes might be desirable anyways to divide the primary task into seperate logical components with separate accounting – even if the OS does not implement security barriers between them.

Idle processes don’t take up any CPU, and a simple daemon shouldn’t need much ram either. A non-pre-emptive kernel need never interrupt the primary process except when there’s IO for a background process.

If the OS will not provide a process primitive, then we already know where the metaphorical road leads to: DOS + non-relocatable TSRs + monolithic apps.

“I don’t know if it has a TCP/IP stack but I don’t see the point of having any sort of advanced protocol if the only thing it needs to communicate is the program to be executed and the result of the operation on the node.”

HPC computing is usually synonymous with highly scalable clusters. Unless your cluster can fit on a single switch, then it’s going to need a routable protocol (ip). Now maybe some applications don’t need this, but in general an HPC OS is not really ready without it.

Also, I’m not sure why they’re using FAT16?

So, they have a lot of work to do before it’s truly ready. As I said, it’s got promise; everything needs to start somewhere.
2011-04-10 12:20 am

tylerdurden
It is not so simple, you still need to have some memory management which is not trivial and some security enforcement (multiple users), as well as checkpointing and error correction.

This sort of barebones microkernels are nothing new. A lot of previous large supercomputing systems used to have this sort of approach: a full fledged OS one the front node for uses to interact with the system, and light weight very specific microkernels that were running on the compute nodes and which were not visible to the user (although these kernels did help in abstracting the machine).

I would not be surprised if they end up reinventing a few wheels in the process. Some national labs used to roll their own lightweight system-specific kernels like the one described in this project.

Edited 2011-04-10 00:22 UTC

2011-04-10 1:03 am

Alfman verbose=1
“It is not so simple, you still need to have some memory management which is not trivial and some security enforcement (multiple users), as well as checkpointing and error correction.”

You are making it out to be something it’s not. I very much doubt they care for multi-user capabilities at all. It just needs to run what it’s told. As far as the OS is concerned, everything can be trusted. After all, their moto is that your code should run on the bare metal. It will all run in “ring 0”.

My point was that it might be helpful to add the concept of a “process” to the OS even without the concept of users or security.

“This sort of barebones microkernels are nothing new.”

Of course not, I’ve written one myself. Although I’d be very careful with terminology since “microkernel” usually refers to something different.

“I would not be surprised if they end up reinventing a few wheels in the process.”

Of course there’s tremendous overlap, especially at the early stages of the OS where they all essentially do the same things.

I wish there were a sane approach to eliminate the redundant effort going towards things like device drivers. This problem is exacerbated by the fact that linux (the biggest source of open source drivers) is a monolithic kernel with very complicated dependencies. Drivers built for linux frequently break between versions.

“Some national labs used to roll their own lightweight system-specific kernels like the one described in this project.”

I believe that’s the idea, to create a minimal standardized base kernel for building scalable HPC apps on top of.

You know, I would have though a topic like this would have gotten more attention on this site. I’m a little dismayed that, even here, people are more interested in the latest mobile device bling rather than more hardcore topics.

2011-04-10 10:16 pm

tylerdurden
Good luck running large numbers of jobs on large data sets on large number of shared nodes with no memory protection, check pointing or error correction.

Running everything in “ring 0” on a modern CPU is retarded, since the security levels come basically for free. It is like someone giving you a jack-hammer and insisting on using a chisel to break ground instead.

Yes, an OS like this does not need that much complexity and overhead to support normal stuff than a more general OS may require. But not using memory protection, or offering some kind of protection for badly behaving program (a common occurrence among HPC code), or some kind of facility for debugging/statistics/metrics gathering… makes this OS useless for a large distributed environment. You need to have control over the processes running on the cluster, be able to schedule jobs properly (at least have some kind of configurable queuing policy), be able to stop and restart said jobs cluster wide, etc.

You might as well use a bootloader and run your code directly on the baremental.

This may be an interesting OS for embedded systems, or very app specific nodes. But for large HPC clusters, this is useless unless they add some of the required sophistication/support.

Edited 2011-04-10 22:21 UTC
2011-04-11 12:12 am

Alfman verbose=1
“Good luck running large numbers of jobs on large data sets on large number of shared nodes with no memory protection, check pointing or error correction.”

Why the adversarial attitude? I didn’t say large numbers of jobs. Have you read the details for the project at their website? Right now they just support one. I was suggesting the ability to support more processes would be useful, this shouldn’t even be controversial!

As for the troubles with pointers, memory protection, and error correction, that’s a given when your programming in C. However, that’s not to say it’s unmanageable. I don’t want to defend the shortcomings of C, but many of us are able to work produtively in C.

“Running everything in ‘ring 0’ on a modern CPU is retarded, since the security levels come basically for free.”

Then it sounds like you disagree with the philosophy of the BareMetal HPC OS.

“But not using memory protection, or offering some kind of protection for badly behaving program (a common occurrence among HPC code)”

citation?

“or some kind of facility for debugging/statistics/metrics gathering…”

I’d agree with that.

“You need to have control over the processes running on the cluster, be able to schedule jobs properly (at least have some kind of configurable queuing policy), be able to stop and restart said jobs cluster wide, etc.”

The have some control, but I get the impression it’s not mature. It’d be beneficial to have a project member discuss it though.

“You might as well use a bootloader and run your code directly on the baremental.”

Sure. But in your own words:

“I would not be surprised if they end up reinventing a few wheels in the process.”

The whole point is to build a skeleton that saves other people from needing to reinvent the wheel.

2011-04-10 12:19 pm

Calipso
really neat. Definitely an eye catching headline. Kernel in 6K? awesome. People in the HPC world sure do some fascinating stuff. I find tiny OS’s really kool. Hence TinyCore being one of my favourite linux distros. Keep up the good work!

2011-04-10 7:36 pm

Alfman verbose=1
“really neat. Definitely an eye catching headline. Kernel in 6K? awesome. People in the HPC world sure do some fascinating stuff.”

Yea I know this stuff is fascinating to me and there were only 2 other posters! What’s that about?

“I find tiny OS’s really kool. Hence TinyCore being one of my favourite linux distros. Keep up the good work!”

There’s just so much bloat these days. Nobody appreciates how much you can actually do with a few KB.

If course with gigs of ram, it doesn’t matter if it’s a few KB or a few MB. However it does matter that the code is efficient, and often times less code equates to faster results.

For example, generally small operating systems boot instantly. Tiny ones can fit entirely in cache.

2011-04-11 8:58 pm

Megol
“we can achieve a runtime speed that is not possible with higher-level languages like C/C++, VB, and Java.”?

Can, yes. But not with the kind of code used.

The source code feels like 8086 code with 64 bit thrown in for fun. 8086 type code isn’t efficient on modern cores but can give small code size.

But not as used in this project.

In an initialization file (init_64.asm) 960kiB of memory is cleared. It’s not in a critical path.

Their code:

mov rdi, os_SystemVariables

xor rcx, rcx ; 2 bytes (1 if assembler optimizes)

xor rax, rax ; 2 bytes ( -“-)

clearmem:

stosq ; 2 bytes

add rcx, 1 ; 4 bytes

cmp rcx, 122880 ; 7 bytes

jne clearmem ; 2 bytes

My code, being smaller, faster and IMHO cleaner:

mov rdi, os_SystemVariables

mov ecx, 122880; 5 bytes

xor eax, eax ; 1 byte, eax=0

rep stosq ; 3 bytes

Another example in kernel64.asm, in the ap_clear routine.

; Get local ID of the core

mov rsi, [os_LocalAPICAddress]

add rsi, 0x20 ; 4 bytes

lodsd ; 1 byte

shr rax, 24 ; 4 bytes – shr eax, 24 is analogous

My code:

mov rsi, [os_LocalAPICAddress]

movzx eax, byte [rsi+0x23] ; 4 bytes