McKernel: a light-weight multi-kernel operating system

Thom Holwerda 2018-11-28 OS News 17 Comments

IHK/McKernel is a light-weight multi kernel operating system designed specifically for high performance computing. It runs Linux and McKernel, a lightweight kernel (LWK), side-by-side on compute nodes primarily aiming at the followings:

Provide scalable and consistent execution of large-scale parallel applications and at the same time rapidly adapt to exotic hardware and new programming models

Provide efficient memory and device management so that resource contention and data movement are minimized at the system level

Eliminate OS noise by isolating OS services in Linux and provide jitter free execution on the LWK

Support the full POSIX/Linux APIs by selectively offloading system calls to Linux

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

17 Comments

2018-11-28 1:01 am

kwan_e
McKernel retains a binary compatible ABI with Linux

So it is possible…

2018-11-28 1:46 am

Alfman verbose=1
kwan_e,

So it is possible…

Haha, funny

This is interesting, thanks for posting it Thom.

There are some aspects of this project that I don’t quite follow from the article. What is it exactly that makes the McKernel kernel more efficient than the linux kernel? And which calls exactly are handled by IHK/McKernel stack versus passed along to linux? I feel a concrete example would have been illustrative here.

I’m a bit confused about the motivation for this because I’m inclined to think that whatever optimizations that McKernel incorporated to improve HPC efficiency could probably have been added to the linux kernel directly without the need for a separate McKernel stack.

Still, I find this design to be an intriguing answer to the driver & software compatibility catch-22 hurdles that independent operating systems will typically face early on in their existence. Building a new kernel that runs adjacent to linux that can make use of it’s widespread driver & software support enables indy kernel developers to focus more of their attention on the features that make their kernel unique. Putting my questions above aside, I think this is quite clever!

2018-11-28 10:29 am

DeepThought
kwan_e,

There are some aspects of this project that I don’t quite follow from the article. What is it exactly that makes the McKernel kernel more efficient than the linux kernel? And which calls exactly are handled by IHK/McKernel stack versus passed along to linux? I feel a concrete example would have been illustrative here.

One point might be that the scheduler is a lot simpler. No completely fair scheduling which changes priorities over time and such. I guess it is more deterministic and a lot quicker.

Another approach was this one:

https://github.com/ReturnInfinity

but w/o the Linux-API.

This Linux API might help porting existing HPC software to McKernel.

2018-11-28 11:17 am

Alfman verbose=1
Deep Thought,

One point might be that the scheduler is a lot simpler. No completely fair scheduling which changes priorities over time and such. I guess it is more deterministic and a lot quicker.

But I’m still confused about their motivation of using a whole other kernel rather than using a linux kernel with a simpler scheduler.

It looks like the design would add more IO latency because (I assume) that normal IO has to go through two kernels instead of just one. Presumably their target workload is CPU intensive rather than IO intensive, but was there a technical reason for using two kernels instead of optimizing linux in place or was it just a case of “we wanted to try something different and this is the result”? If it’s the later, then personally I get it, it’s a lot more creative and interesting to work on your own designs. But if it’s supposed to be the former, I wish they had explained their reasoning and perhaps added benchmarks to back their hypothesis that such a design would perform better for HPC than linux alone.

2018-11-28 11:27 am

DeepThought
Deep Thought,

One point might be that the scheduler is a lot simpler. No completely fair scheduling which changes priorities over time and such. I guess it is more deterministic and a lot quicker.

But I’m still confused about their motivation of using a whole other kernel rather than using a linux kernel with a simpler scheduler. [/q]

“Optimizing” the Linux scheduler means to maintain patches to it.

And all the non-HPC software running on the Linux side will be affected if you change the scheduler.

[q]

It looks like the design would add more IO latency because (I assume) that normal IO has to go through two kernels instead of just one. Presumably their target workload is CPU intensive rather than IO intensive, but was there a technical reason for using two kernels instead of optimizing linux in place or was it just a case of “we wanted to try something different and this is the result”? If it’s the later, then personally I get it, it’s a lot more creative and interesting to work on your own designs. But if it’s supposed to be the former, I wish they had explained their reasoning and perhaps added benchmarks to back their hypothesis that such a design would perform better for HPC than linux alone.

My understanding is, they leave on the Linux side all the “nasty” stuff like filesystem, graphics etc.

If the HPC-side needs to access for example a file, it hands of this job to the Linux.

My impression of a node in an HPC cluster is mainly computing and exchanging data with other nodes. This can be done easy in a limited kernel.

The overall idea is not new though. Hypervisors are used to have Linux in parallel with an RTOS.

New is, that the McKernel looks like a vanilla Linux on API level.

But yes, some real-world figures would be nice to see if this is just an academic idea or give a benefit.
2018-11-28 12:52 pm

Alfman verbose=1
DeepThought,

“Optimizing” the Linux scheduler means to maintain patches to it. [/q]

Ok, but I think this justification goes out the window when you consider that McKernel is running a custom linux kernel anyways to enable the two kernels to communicate and run alongside each other.

My understanding is, they leave on the Linux side all the “nasty” stuff like filesystem, graphics etc.

If the HPC-side needs to access for example a file, it hands of this job to the Linux.

My impression of a node in an HPC cluster is mainly computing and exchanging data with other nodes. This can be done easy in a limited kernel.

That’s my understanding as well, but I don’t necessarily see the technical benefit of a secondary kernel. With stock linux, you can use CPU affinity to peg a process to a CPU with realtime priority such that it gets 100% utilization and it never has to release the CPU except when A) it’s done doing work or B) it’s requesting IO.

In the case of A, who cares.

In the case of B, the McKernel IO proxy approach seems to be less efficient than a native linux stack.

My impression of a node in an HPC cluster is mainly computing and exchanging data with other nodes. This can be done easy in a limited kernel.

I’m just going to assume that they wanted to test their ideas on a much simpler kernel and didn’t feel like modifying linux to do the same thing, because in principal, I don’t see why it couldn’t.

[q]New is, that the McKernel looks like a vanilla Linux on API level.

But yes, some real-world figures would be nice to see if this is just an academic idea or give a benefit.

Well, I’m never one to say linux can’t be improved upon because I know that it can be. However it’s hard to see how a dual kernel approach would significantly improve the performance over a single optimized linux kernel. Benchmarks would really help.

Anyways, this is certainly an interesting alt-os. It’d be nice to discuss it with a member of the project

Edited 2018-11-28 12:53 UTC
2018-11-28 1:35 pm

ahferroin7
I think it’s more a case of maintainability than performance here. If they’re completely separate from the Linux kernel (and it sounds like they mostly are), they don’t have to worry about adapting to purely internal changes to the Linux kernel that tend to make carrying any kind of complicated patches for any extended period of time difficult (for example the ongoing change to a 64-bit time_t representation internally), or any of the linking related licensing issues.

There’s also the very distinct possibility that this all started long before the whole ‘isolcpus’ thing that Linux now supports started. Prior to that, you couldn’t completely keep the kernel from interfering with userspace execution on a given CPU core, because there were some types of kernel threads that you couldn’t force off of a core with CPU affinities.

The whole thing reminds me of IBM’s Watson systems though. Those use physically distinct I/O and compute nodes, with the I/O nodes running a modified version of Linux (with an RHEL derived userspace I believe), while the compute nodes run a minuscule kernel consisting of a few hundred lines of C++ that just provides IPC primitives for talking to the I/O nodes.
2018-11-28 7:26 pm

cb88
“while the compute nodes run a minuscule kernel consisting of a few hundred lines of C++ that just provides IPC primitives for talking to the I/O nodes.”

That’s the key right there… most of the compute nodes never end up calling into the Linux kernel itself and the code just runs on the LWK. It makes the improvement in performance automatic at runtime.

Linux and other full unix kernels tend to be way too bloated for compute node’s even if you chop out a ton of stuff, you still end up with a multi megabyte kernel… LKW is probably only kilobytes.

But, they can share the same programming environment instead of having a special kernel ABI…. as is done on Watson and others.
2018-11-28 9:21 pm

Alfman verbose=1
cb88,

That’s the key right there… most of the compute nodes never end up calling into the Linux kernel itself and the code just runs on the LWK. It makes the improvement in performance automatic at runtime. [/q]

But that’s just it, code that would not have been called anyways by the compute nodes would not incur overhead in the first place such that changing kernels “makes the improvement in performance automatic at runtime”.

On the other hand, I do understand the desire to simplify things and the linux code base might be considered too complex to work with. So when ahferroin7 said it could be more about maintainability than performance, that makes sense to me. It’s just not clear from the writeup that this was in fact their motivation.

Linux and other full unix kernels tend to be way too bloated for compute node’s even if you chop out a ton of stuff, you still end up with a multi megabyte kernel… LKW is probably only kilobytes.

Even though this is true, their solution doesn’t use any less ram because they’re still explicitly running alongside linux. Even in terms of CPU caching on local compute nodes, linux kernel code that doesn’t get called doesn’t occupy cache space, so it’s not obvious to me there’s any benefit here. Although benchmarks could prove interesting.

[q]But, they can share the same programming environment instead of having a special kernel ABI…. as is done on Watson and others.

You’re the second person to mention it, I should learn more about watson
2018-11-28 9:42 pm

kwan_e
But that’s just it, code that would not have been called anyways by the compute nodes would not incur overhead in the first place such that changing kernels “makes the improvement in performance automatic at runtime”.

I think the analogy here, in hardware terms, is that Linux is the CPU, and the McKernel is an FPGA or ASIC.
2018-11-28 9:47 pm

Alfman verbose=1
kwan_e,

I think the analogy here, in hardware terms, is that Linux is the CPU, and the McKernel is an FPGA or ASIC.

I don’t get this analogy. They’re both co-kernels running on ordinary SMP CPUs.
2018-11-29 1:04 am

kwan_e
Yes, but one kernel is doing general purpose things, while the other is specific to achieving the massive parallel scaling.

You can optimize Linux as much as you like, but things like preemptive scheduling will always make things a bit unpredictable. The McKernel uses cooperative scheduling to reduce jitter, which I take to mean processes aren’t interrupted at random intervals.

So the analogy to CPU and FPGA is that the CPU is highly interruptible and doing lots of context switching stuff while the FPGA is single minded about its tasks.

The parallelism is easier to scale if there aren’t as much interruptions to synchronize everything.

Edited 2018-11-29 01:08 UTC
2018-11-29 2:21 am

Alfman verbose=1
kwan_e,

Yes, but one kernel is doing general purpose things, while the other is specific to achieving the massive parallel scaling.

You can optimize Linux as much as you like, but things like preemptive scheduling will always make things a bit unpredictable. The McKernel uses cooperative scheduling to reduce jitter, which I take to mean processes aren’t interrupted at random intervals. [/q]

I believe this is already possible in linux if you tell the kernel not to schedule any threads on a CPU. This way the threads you designate can run uninterrupted. This is the feature ahferroin7 was talking about before.

https://codywu2010.wordpress.com/2015/09/27/isolcpus-numactl-and-tas…

I admit I’ve never actually tried it myself, but I don’t see why it wouldn’t work.

[q]So the analogy to CPU and FPGA is that the CPU is highly interruptible and doing lots of context switching stuff while the FPGA is single minded about its tasks.

Ok, I’m going to continue thinking in terms of CPUs though, haha
2018-11-29 3:37 pm

Iapx432
It might make more sense (and maybe this is what is intended) if the full Linux Kernel only runs on a master node and the LWK runs on the zillion diverse architecture slaves. However this has already been done. Supercomputers already use lightweight CNKs (Compute Node Kernels).

2018-11-28 5:39 am

tidux
The kernel-to-userspace ABI has been stable for a long, long time. It’s just that people confuse “an OS and a set of userspace libraries” for “an OS” and then whine about issues in dynamic libraries.

2018-11-28 10:06 am

missingxtension
https://en.m.wikipedia.org/wiki/K_computer

“K computer comprises 88,128 2.0^A GHz eight-core SPARC64 VIIIfx processors contained in 864 cabinets, for a total of 705,024 cores”
2018-11-29 9:43 am

cltang
This is essentially a hypervisor/microkernel with additional support to use Linux services by (1) system call forwarding to the Linux instance and (2) mirrored address spaces with a corresponding “proxy” process running on the Linux instance.

Probably more details in the papers, but that mostly summarizes what is introduced on the homepage.

Edited 2018-11-29 09:44 UTC