The just released version 12.08 of the Genode OS Framework comes with the ability to run Genode-based systems on ARM hardware without an underlying kernel, vastly improves the support for the NOVA hypervisor, and adds device drivers for the OMAP4 SoC. Further functional additions are a FFAT-based file system service, the port of the lighttpd web server, and support for on-target debugging via GDB.
One of Genode’s most distinguished characteristics compared to traditional operating systems is the idea to bring the term “component-based” to the next level: To create not an OS but an OS construction kit. In line with Unix philosophy, it is a collection of small building blocks, out of which complex systems can be composed. But unlike Unix, those building blocks include not only applications but all classical OS functionalities including device drivers, protocol stacks, and in particular kernels. Naturally, the more diverse the landscape of components becomes, the more flexible and scalable Genode-based systems become.
The current release introduces a new option with respect to the kernel. By using the new base-hw base platform, typical microkernel functionality is integrated into Genode’s core and thereby eliminates the need for a separate kernel. At the first glance, this seems contradictory because the core component runs in user mode wheres the kernel runs in kernel mode. But it turns out that integrating user-land code and kernel code into a single program does not just work fine but vastly reduces the complexity of the overall picture. Apparently, there are many problems that both the kernel and core have to address. By merging both programs into one, redundancies in terms of data structures and functionality can be drastically reduced. As of now, the new base-hw platform supports ARM Cortex-A9 hardware. Even though it is still tagged as experimental, it is able to successfully run almost all Genode components on top.
As a second line of work, the project has largely reworked the support for the NOVA hypervisor as base platform. After two years of uncertainty about the future of this beautiful kernel, NOVA’s development moved over to Intel Labs where the kernel is developed as an Open-Source project hosted at GitHub. For Genode, these are exciting news, which prompted the project to advance the support for NOVA in several ways. The NOVA support has been upgraded to cover both x86_32 and x86_64 architectures. In addition, NOVA’s capability-based security features have been fully embraced. The latter point turns NOVA into one of the few base platforms that fully support Genode’s capability-based security concept at the kernel level.
The third focus of the current release is the addition of comprehensive device drivers for the OMAP4 SoC as used on the popular Pandaboard. The new drivers cover HDMI output, USB HID, SD-card access as well as networking. Among the further functional additions are a new FFAT-based file system service, the port of the lighttpd web server, and improved networking support for the Noux runtime for Unix applications. Those and the background information behind many more changes are covered in full detail by the release notes of version 12.08.
Thank you for the information about Genode.
I looked at their website briefly, but had dificulty finding the information I was looking for. For whom is Genode targeted? At what percentage would you say it has reached it’s goals?
I was the most surprised to find out that it doesn’t have a Wikipedia article.
In the short term, Genode is targeted at developers of special-purpose operating systems with high security requirements. The longer-term vision is much more far reaching. The following recently published interview sheds some light on this longer-term vision:
http://genode.org/about/interview_rel36
It is really encouraging to see continual progress being made on this project. It’s nice to see different OS concepts put into practice successfully.
The “base-hw” addition is very interesting. Are you planning on adding other CPU architectures also?
The NOVA hypervisor work is very intriguing as well. Especially this part:
Keep up the good work!
Edited 2012-08-23 20:04 UTC
Thanks for your encouraging words!
Regarding your question about supporting CPU architectures other than ARM for the base-hw platform, there is a rough plan to incorporate MicroBlaze support and thereby displacing the original base-mb platform. There was also a conversation on the mailing list about supporting OpenRISC via base-hw. But those lines of work are not definite. Even though the base-hw concept would principally be applicable to x86, there is currently no plan to pursue this direction. The current focus is certainly ARM.
Which architecture have you had in mind?
I am concerned about the duplication of efforts to manage resources in the kernel and core when using the other platforms.
Is there any plan to to address that problem base platforms other than base-hw?
Another area of concern is the supported base platforms. Some of the supported kernels do not seem to be developed anymore or are no longer open source. E.g. Codezero, Pistachio, OKL4 and the old Fiasco. Also, Codezero also seem to offer a subset of the features of Fiasco.OC.
From what I understand, the wide variety of kernels helps to produce ideal design decisions in genode that are applicable across platforms. Is there any other merit in continuing to develop genode on those platforms?
Congratulations on the new release and the detailed release notes. They are always interesting to read. Genode is an excellent project and I am looking forward to seeing it widely deployed in the future.
Regards.
Apparently, the redundancies between the microkernel and the first user-land component (mostly called root task) have been somehow overlooked for years now. Not just in the context of Genode but in multi-server OS projects in general. I guess the reason is that both kind of programs used to be developed by distinct groups of people. Boldly generalizing, I think that kernel developers love to stay in kernel land. Their view is somehow narrowed to the kernel API and hardly extend to a holistic system. On the other hand, user-land developers do not challenge kernel APIs too much (similar to how most software developers rarely question the hardware interfaces underneath their software).
Personally, I find the result of the base-hw experiment quite fascinating. It shows that dissolving the barrier between thinking in categories of kernel land and user land bears the opportunity for simplifying the overall architecture.
I share your observation about several of the base platforms. The motivation for keeping them around slightly differ from kernel to kernel. For Codezero, we are still hoping for a relaunch of the kernel as an Open-Source project. Pistachio is actually still maintained. For all of those kernels, there are also common reasons to not abandon them.
First, its beneficial for Genode’s API design. Each kernel poses different challenges with regard to implementing the API. By accommodating a variety of kernel interfaces, we constantly reassure the portability of the framework and force ourself to find clean solutions that work across all of the kernels.
Second, having an arsenal of kernels at our disposal is just great for cross-correlating the behaviour of the system during debugging. Many bugs can be tracked down by just looking at the differences of executing the same scenario on different platforms. In fact, at Genode Labs we are constantly switching between the kernels including the ancient L4/Fiasco kernel. As a bonus, several of the kernels offer unique debugging features, which become pretty handy from time to time.
Third, maintaining support for an already supported base platform is cheap. It comes down to maintaining approximately 2000-3000 lines of code per kernel. For a kernel that won’t move, the maintenance costs are almost zero (except for changes of the Genode API).
Thank you for clearing that up. Would I be right to say that the platforms that are better supported at the moment are NOVA and Fiasco.OC?
I read somewhere that NOVA is the platform that NOVA is the kernel that most naturally fits with the design of Genode. Could you compare the level of support for these two kernels (Fiasco.OC and NOVA).
Also, would it be feasible to fork a kernel with support for a wide range of architectures e.g. Fiasco.OC and modify it to be more deeply integrated with Genode in order to eliminate some of the duplication? (Promoting it to a first class citizen in the framework.)
Regarding base-hw: Since the kernel and core are now running in the same address space, is this against one of the philosophies Genode i.e. the isolation of subsystems/components? Is running them in the same address space just an interim solution or is that the way forward?
I believe that since kernel and core are both critical to the system then it makes sense to integrate them if it reduces the complexity of the TCB.
“Would I be right to say that the platforms that are better supported at the moment are NOVA and Fiasco.OC?”
Yes, in particular when looking at the security model. Those two kernels support Genode’s way of delegating access rights throughout the system at kernel level, which makes them naturally a good fit for the framework. If factoring out the security aspect, there are other kernels that cover the whole Genode functionality as well. For example, on OKL4 or L4/Fiasco, the whole Genode API is fully functional. Linux is also pretty well supported in terms of stability but, of course, it is inherently limited to the features offered by the Linux kernel. For example, because there is no way to manipulate an address space of a remote process, Genode’s managed dataspace concept won’t work on Linux.
“Could you compare the level of support for these two kernels (Fiasco.OC and NOVA).”
With the current release, both base platforms are almost on par. Fiasco.OC has a slight edge with regard to the life-time management of kernel resources but we are working on getting the NOVA base platform there, too.
“Also, would it be feasible to fork a kernel with support for a wide range of architectures e.g. Fiasco.OC and modify it to be more deeply integrated with Genode in order to eliminate some of the duplication? (Promoting it to a first class citizen in the framework.)”
Personally, I don’t feel the slightest temptation to do so. The F/OSS microkernel community is small enough, and actually quite fragmented. A fork would not only be a step in the wrong direction, it would pose a big liability for the forking project. We should better try to discuss our findings with the respective kernel developers to elaborate ways of how the redundancies can be reduced by changing the kernel API. For example, from Genode’s perspective, the in-kernel mapping database as featured by both Fiasco.OC and NOVA is more of a hindrance than a feature. So we would try to convince the kernel developers to remove it or to make it optional. We are actually in close contact with NOVA’s developers, discussing such ideas.
“Is running them in the same address space just an interim solution or is that the way forward?”
As both components are always forming the root of the process tree, there is no security benefit of separating them. Speaking from Genode’s perspective, putting them into a single component is clearly the way to go. From the perspective of kernel developers, this is not so easy because each kernel tries to accommodate user-land implementations other than Genode as well.
I think there’d be merit in having base-hw on x86 given the widespread availability of off the shelf hardware.. but of course you gotta focus with what matters to you.
System programming jobs have become rare here, I’ve always thought it would be so much fun to land a job working on an alternative operating system instead of just doing it as a hobby.
So anyway, back on topic, I read this in your release notes:
“We complemented our C runtime with support for the pread, pwrite, readv, and writev functions. The pread and pwrite functions are shortcuts for randomly accessing different parts of a file. Under the hood, the functions are implemented via lseek and read/write. To provide the atomicity of the functions, a lock guard prevents the parallel execution of either or both functions if called concurrently by multiple threads.”
You are implementing these functions (pread/pwrite) with two system calls then? Is there one lock per process, per file descriptor, or something else? Is this lock held in the kernel or in user space? It seems to me like such locks could impose a major synchronization bottleneck on SMP architectures, is there a reason you wouldn’t just add new syscalls for pread/pwrite?
For running Genode on x86 in general, there is no urgent need to have this architecture covered by base-hw. There are several other kernels among Genode’s supported base platforms that support x86 just fine, i.e., NOVA.
Thank you for having taken the time to study the release notes in such detail.
The paragraph you cited refers to the libc. Before the change, the mentioned functions had been mere dummy stubs. Now, they do something meaningful. The lock is locally within the process. The kernel doesn’t know anything about the lock nor is it directly involved in handling the actual read/write/lseek operation. Please remember that we are using a microkernel-based architecture where I/O is performed by user-level components rather than the kernel.
Is one lock for pread/pwrite per process a bottleneck? This is a good question, which is quite hard to answer without having a workload that heavily uses these functions from multiple threads. As long as many processes contend for I/O or the workload is generally bounded by I/O, this is not a scalability issue.
For multi-threaded POSIX applications that call those functions concurrently, however, I agree that the lock per process could be replaced by a lock per file descriptor to improve SMP scalability. I couldn’t name such an application from the top of my head, though. Do you have an example that would be worthwhile to investigate? We may change the locking once we see this becoming a real issue rather than a speculative one. Until then, it is just nice to have the functional gap in Genode’s libc closed without the risk of introducing race conditions.
nfeske,
Like you, I’d have to research it more. But I think an excellent test would be a database engine that doesn’t use memory mapped IO. I think mysql is such a database, particularly because 32bit addressing is an unacceptable limitation. Not sure how it works in 64 bit though.
http://doc.51windows.net/mysql/?url=/MySQL/ch07s05.html
“Only compressed MyISAM tables are memory mapped. This is because the 32-bit memory space of 4GB is not large enough for most big tables. When systems with a 64-bit address space become more common, we may add general support for memory mapping.”
When you implement a pread in libc, does it look something like this?
(Apologies in advance for the spacing bugs…Thom get that fixed!!)
int pread(…) {
aquire_process_mutex(…);
long long pos = lseek(…);
int ret = read(…);
lseek(pos); // since pread isn’t supposed to have side effects
free_mutex(…);
return ret;
}
This makes 3 calls to the file system, do those functions have their own internal mutexes such that each pread/pwrite call will actually invoke 4 total mutex cycles (instead of 1 needed by a native pread function)? That would be alot of sync overhead on SMP systems (IMHO).
Also, I think the following example might be able to break the above atomicity:
void uncertainty() {
char data;
int handle = open(…,O_WRONLY|O_TRUNC);
int pid = fork();
if (pid==0) {
data=1;
pwrite(handle, &data, sizeof(data), 1)
} else {
data=2
pwrite(handle, &data, sizeof(data), 1);
waitpid(pid);
}
}
We would normally expect only 2 possible arbitrary outcomes:
0x00 0x01 # child overwrote parent
0x00 0x02 # parent overwrote child
However due to race conditions on lseek, we might end up with these variances as well.
0x02 0x01
0x01 0x02
Granted this example is contrived. I don’t know if there are typical applications that share file descriptors between processes and use pread/pwrite on them?
I brought this up because I really enjoy technical analysis, not because of any particular concern. But if I’m bugging you too much feel free to tell me to sod off
You are welcome!
Indeed, the code looks similar to the snippet you posted. See here:
https://github.com/genodelabs/genode/blob/master/libports/src/lib/li…
Fortunately, your concerns do not apply for Genode. In Genode’s libc, the seek offset is not held at the file system but local to the process within the libc. The file-system interface is designed such that the seek offset is passed from the client to the file system with each individual file-system operation. The seek value as seen at libc API level is just a value stored alongside the file descriptor within the libc. Therefore, lseek is cheap. It is just a library call updating a variable without invoking a syscall.
Your example does indeed subvert the locking scheme. But as Genode does not provide fork(), it wouldn’t work anyway.
Btw, if programs are executed within the Noux runtime (see [1]), lseek is actually an RPC call to the Noux server. So the pread/pwrite implementation carries an overhead compared to having pread/pwrite as first-class operations. So there is room for optimization in this case.
[1] http://genode.org/documentation/release-notes/11.11#OS-level_Virtua…
Given all the steps that are involved in a single read I/O operation, however, I am uncertain about the benefit of this specific optimization. To prevent falling into the premature-optimization trap, I’d first try to obtain the performance profile of a tangible workload. Another reason I’d be hesitant to introduce pread/pwrite as first-class operations into Noux is that in general, we try to design interfaces to be as orthogonal as possible. Thanks to this guideline, the Noux server is a cute little component of less then 5000 LOC. Introducing pread/pwrite in addition to read/write somehow spoils this principle and increases complexity.
Thanks for the pointer to the database engine. This might be a good starting point for a workload to be taken as reference when optimizing for performance and scalability.
nfeske,
“Your example does indeed subvert the locking scheme. But as Genode does not provide fork(), it wouldn’t work anyway. ;-)”
Shows what I know
“The file-system interface is designed such that the seek offset is passed from the client to the file system with each individual file-system operation.”
Makes sense. How do you handle files with the append flag?
int f=open(“xyz”, O_WRONLY | O_APPEND | O_CREAT, S_IRUSR | S_IWUSR);
sleep(1);
write(f, “1”, 1);
sleep(1);
write(f, “2”, 1);
close(f);
Running two instances of this program simultaneously on linux produces “1122”. However if libc uses a process-local file offset, then it would probably output “12”. I imagine you just ignore the offset that gets passed for files opened in append mode?
“To prevent falling into the premature-optimization trap, I’d first try to obtain the performance profile of a tangible workload.”
A simple test here on an arbitrary linux system:
char buffer[1000];
int f=open(“xyz”, O_RDWR | O_APPEND | O_CREAT, S_IRUSR | S_IWUSR);
for(i=0; i<1000000; i++) {
/* TEST 1
off_t old = lseek(f, 10, SEEK_CUR);
lseek(f, 10, SEEK_SET);
read(f, &buffer, sizeof(buffer));
lseek(f, old, SEEK_SET);
*/
/* TEST 2
pread(f, &buffer, sizeof(buffer), 10);
*/
}
I recorded the fastest time of 3 runs…
buffer size=1
TEST 1 – seek + read = 1.072s
TEST 2 – pread = 0.663s
buffer size=1000
TEST 1 – seek + read = 1.254s
TEST 2 – pread = 0.882s
buffer size=10000
TEST 1 – seek + read = 3.636s
TEST 2 – pread = 3.183s
I’m a little surprised that even with a 10K buffer size, there’s still a very noticeable half-second difference with the lseek syscall approach on linux. I suspect Genode-Noux would exhibit similar trends. But does it matter? That depends on who we ask. Sometimes design factors are worth some additional overhead. There are always trade offs.
Your experiment is pretty convincing. Especially when considering your original suggestion to use a database a workload. For this application, requests for individual database records are certainly much smaller then 10 KiB.
But your experiment also shows another point quite clear: The effectiveness of the Linux block cache. A throughput of 3 GiB/sec is quite nice for accessing a disk. I think that the addition of a block-cache component to Genode would be the most valuable performance improvement at the current stage. There is actually a topic in our issue tracker but nobody is actively working on it at the moment:
https://github.com/genodelabs/genode/issues/113
“I imagine you just ignore the offset that gets passed for files opened in append mode?”
Almost. The file-system interface does not differentiate a mode when opening a file but there is an append operation that can be applied to an open file by specifying ~0 as seek position. For reference, here is the interface:
https://github.com/genodelabs/genode/blob/master/os/include/file_sys…
“I’m a little surprised that even with a 10K buffer size, there’s still a very noticeable half-second difference with the lseek syscall approach on linux. I suspect Genode-Noux would exhibit similar trends.”
I agree. Thanks for investigating. I will keep your findings in the back of my head. Once we stumble over a pread/pwrite-heavy Noux application with suffering performance, getting rid of superfluous lseek calls looks like a worthwhile consideration.
nfeske,
Thanks for answering my questions, it’s all I have for now. I’m looking forward to seeing more of Genode in the future!