Linux 3.12 released

Guest post by diegocg 2013-11-07 Linux 24 Comments

Linux kernel 3.12 has been released. This release includes support for offline deduplication in Btrfs, automatic GPU switching in laptops with dual GPUs, a performance boost for AMD Radeon graphics, better RAID-5 multicore performance, improved handling of out-of-memory situations, improvements to the timerless multitasking mode, separate modesetting and rendering device nodes in the graphics DRM layer, improved locking performance for virtualized guests, XFS directory recursion scalability improvements, new drivers and many small improvements. Here’s the full list of changes.

24 Comments

2013-11-08 4:56 am

Alfman verbose=1
Thom, you missed what I’d consider to be one of the bigger highlights: changing the out of memory killer!

https://lwn.net/Articles/562211/

As was described in this June 2013 article, the kernel’s out-of-memory (OOM) killer has some inherent reliability problems. A process may have called deeply into the kernel by the time it encounters an OOM condition; when that happens, it is put on hold while the kernel tries to make some memory available. That process may be holding no end of locks, possibly including locks needed to enable a process hit by the OOM killer to exit and release its memory; that means that deadlocks are relatively likely once the system goes into an OOM state.

Linux has never been extremely stable in the past under out of memory conditions, I’ve always considered the out of memory killer to be major hack for a fundamental problem.

Following a bunch of cleanup work, these patches make two fundamental changes to how OOM conditions are handled in the kernel. The first of those is perhaps the most visible: it causes the kernel to avoid calling the OOM killer altogether for most memory allocation failures. In particular, if the allocation is being made in response to a system call, the kernel will just cause the system call to fail with an ENOMEM error rather than trying to find a process to kill.

About time! The previous behavior of returning a successful mallocs only to have to invoke the linux OOM killer never made any sense. Killing processes heuristically without so much as asking the user first is ugly at best and outright reckless at worst. What will linux kill? How about a document, web browser, X11 session, etc. I’ve learned to significantly over-provision ram to avoid the OOM, but I’ve never been pleased about it’s design nor even it’s existence.

That may cause system call failures to happen more often and in different contexts than they used to. But, naturally, that will not be a problem since all user-space code diligently checks the return status of every system call and responds with well-tested error-handling code when things go wrong.

And that’s how it should have always been, by returning errors to user space, we let processes handle these errors gracefully.

void*buffer = malloc(20*1024);

if (!buffer) { this doesn’t happen }

http://opsmonkey.blogspot.com/2007/01/linux-memory-overcommit.html

Of course this is a problem in the first place because linux was intentionally designed to overcommit memory. I think in practice the harms outweigh the benefits. Hopefully the changes here in 3.12 will make linux rock solid even under out of memory conditions with overcomit turned off (no more killing of innocent processes).

Edited 2013-11-08 05:05 UTC

2013-11-08 9:04 am

Savior
What will linux kill? How about a document, web browser, X11 session, etc.

According to my experience, it’s almost always the last one (X11)… only seldom had I the luck of it being something harmless, like Firefox. If this update fixes this problem, I’ll be very happy.

2013-11-08 1:30 pm

ajs124
huh, it worked pretty reliably for me and killed firefox or its plugin container most of the time.

2013-11-08 9:33 am

Vanders
What will linux kill? How about a document, web browser, X11 session, etc.

From experience it’s usually sshd, making it as hard as possible to actually get to the machine and clean it up.
2013-11-08 4:04 pm

jessesmith
This is excellent, I always considered the default behaviour dishonest. Telling a process it can have memory which isn’t available and then killing a process (seemingly at random) always struck me as a terrible idea.
2013-11-11 8:19 pm

MrVain
Thom, you missed what I’d consider to be one of the bigger highlights: changing the out of memory killer!

https://lwn.net/Articles/562211/

As was described in this June 2013 article, the kernel’s out-of-memory (OOM) killer has some inherent reliability problems. A process may have called deeply into the kernel by the time it encounters an OOM condition; when that happens, it is put on hold while the kernel tries to make some memory available. That process may be holding no end of locks, possibly including locks needed to enable a process hit by the OOM killer to exit and release its memory; that means that deadlocks are relatively likely once the system goes into an OOM state.

Linux has never been extremely stable in the past under out of memory conditions, I’ve always considered the out of memory killer to be major hack for a fundamental problem.

O_o

How many other design problems does Linux have that has not been fixed in the year 2013? This is RAM overcommit OOM thing is horrendous, and if you read the article, Linux still has the OOM killing thing going on, but less seldom. No wonder people say Linux is unstable…

2013-11-12 6:31 am

Alfman verbose=1
Kebabbert,

“No wonder people say Linux is unstable…”

Ugh, there’s a difference between criticizing and trolling, and you just crossed it

“This is RAM overcommit OOM thing is horrendous, and if you read the article, Linux still has the OOM killing thing going on, but less seldom.”

Well this is a fair question, and to the best of my knowledge, these changes are supposed to completely clean up the kernel’s own handling of OOM conditions by allowing syscalls to return ‘no memory’ errors instead of blocking while the OOM killer runs.

Note this does not change user space over committing, that can already be disabled separately (sysctl vm.overcommit_memory=2). As long as syscalls were still triggering OOM killer inside the kernel due to syscalls which couldn’t return ‘no memory’ errors, there was still a problem. Now that this is fixed, I would expect that the OOM killer will never be invoked.

Also, to really have a full grasp of the situation, we need to understand the background behind over-commit to begin with, and that has a lot to do with one particular function: fork. Unix multi-process based programs work by cloning a process into children. At the moment the child process is spawned, 100% of it’s memory pages will be identical to the parent. In terms of fork, it makes sense to share the memory pages instead of copying them. Depending on what the child and parent actually do, they may or may not update these shared pages as they execute. If and when they do, the OS performs a ‘copy-on-write’ operation, allocating a new page to hold the modified page.

The OS can now use the pages that would have been used by child processes for other purposes. Using extra pages for caching is *good* since they can be dumped at any time should they be needed. Using them for spawning more processes or additional resources at run time is *bad* (IMHO) because now they cannot be recovered without taking back resources already in use elsewhere (aka OOM-killer).

Unfortunately, there are cases when overcommitment is unavoidable under fork semantics. Consider a large app (ie a database consuming 1GB ram) wanting to spawn a child process to perform a simple parallel task. This child process might only need <1MB of ram, but it’s never-the-less sharing 999MB of memory with it’s parent for as long as it executes. NOT overcommiting implies the two processes combined need 2GB instead of 1.001GB. And if 2GB is not available, well then the parent process will not be able to spawn any children for any reason.

So, overcommiting is good for fork, which is why linux does it. This highlights the real underlying problem, nearly everybody thinks overcommit is bad, but the truth is many are still huge proponents of fork().

2013-11-12 7:08 pm

Alfman verbose=1
To the extent that fork still gets use and it’s cons aren’t going away, I’d be very interested in seeing a simple compromise by making over-commit *explicit*.

Such as: no application will ever be overcommitment unless it explicitly requests it or it’s process is configured for it by the administrator. By default, when a process requests memory from the kernel, the kernel will keep that commitment (anything else is bad practice). But in cases where overcommitment is still desired or needed, it can be done explicitly.

So, for example, when a 1GB database tries to call fork, it could tell the kernel that overcommiting the 1GB child process would be preferable to being denied with a ‘no memory’ error by default. If the overcommitment turns into an OOM condition, then only those processes who elected over-commit would be killed, which seems pretty fair.

This could be implemented pretty easily in linux by adding a new flag to the existing clone syscall. Anyone else have thoughts about this?

2013-11-08 2:28 pm

WereCatf
I’m interested in Btrfs’s newly-gained support for data deduplication, but after taking a look at this app called “bedup,” which is supposed to do the actual heavy lifting, I’m left confused. The application only looks for identical files, it doesn’t look for identical blocks of data, so does this mean Btrfs’s dedup is also limited to file-level deduplication or is it just a limitation of the app in question?

2013-11-08 4:42 pm

Bill Shooter of Bul Platinum Prime
I think its just the bedup app, but don’t quote me on that. I think its just a case of crawling before walking. Files are easy to examine for dupes, and its fairly useful for a lot of people like myself. Although, I may wait a few kernel revisions for any bugs to shake out before using it on anything vital.
2013-11-08 5:50 pm

Laurence
If you want deduping then you’re better off with ZFS – not only does it work against fragments, but it works live.

In fact on the whole I’ve been unimpressed with Btrfs during my recent testing.

2013-11-08 6:57 pm

Alfman verbose=1
Laurence,

“In fact on the whole I’ve been unimpressed with Btrfs during my recent testing.”

I’ve been meaning to do this myself, I’m very interested in what you tested and the results of your tests, if you don’t mind elaborating.

I’ve maintained nonstandard kernels in the past, but I’m trying to get away from that and stick to mainline as much as possible, so ZFS is less appealing for that reason even though it seems to be robust and mature.

I have a very strong desire to install a copy-on-write FS on production servers to replace an rsync –link-dest solution I have deployed for generational backups. It’s pretty clever and quite efficient, however one major problem with this approach is that files cannot be moved, otherwise they fail to link. It’s possible to relink dups after the fact using ‘hardlink’ or ‘fdupes’, but it’s not ideal and raises other concerns about unintentionally hardlinked files getting restored.

One question I do have is whether it’s possible to copy one Btrfs FS in deduped form from one host to another without having to rededup it anywhere in the process? I’ll research it eventually, but maybe someone here knows…?

2013-11-08 9:12 pm

jessesmith
I have been running ZFS on Linux boxes with standard kernels for over a year. Using the ZFS kernel module (or ZFS-FUSE) does not require a custom kernel.

2013-11-08 11:12 pm

Alfman verbose=1
jessesmith,

“I have been running ZFS on Linux boxes with standard kernels for over a year. Using the ZFS kernel module (or ZFS-FUSE) does not require a custom kernel.”

Thanks for the suggestion. The problem with fuse is that it doesn’t offer good performance. In some benchmarks, Fuse-ZFS barely registers on the chart.

http://www.phoronix.com/scan.php?page=article&item=zfs_fuse_performance

This link is somewhat dated, but it’s never the less been my experience that ALL fuse filesystems suffer from excessively high CPU utilization and low performance, particularly under high concurrency, and the benchmarks here bear that out. I find it unlikely that a recent benchmark would be any kinder to Fuse-ZFS. I may be convinced to go with ZFS, but if I do it will definitely be a patched kernel. That’s not to say fuse-zfs isn’t useful for it’s features, it would just defeat the point in my having a high performance raid array.

Btrfs is also compared in the link above for anyone interested (I wouldn’t be surprised if it has improved since 2010).

Edited 2013-11-08 23:13 UTC
2013-11-09 12:09 am

phoenix
He said to use the ZFSonLinux kernel modules, not FUSE modules. Several distro now include ZFSonLinux packages in their repos, and more are available on their website. No custom kernel required, and no loss of performance due to FUSE.
2013-11-09 7:17 am

Alfman verbose=1
phoenix,

Thanks for pointing that out, somehow I focused only on ZFS-Fuse. I do not see ZFS kernel modules being supported natively by Debian, mint, centos, and I presume ubuntu, do any distros carry it natively?

I did see that zfsonlinux.org has modules for many popular distros (ie dep for debian, rpm for cent), but that’s still technically the kind of 3rd party code/binary modules that I was trying to avoid. For those who don’t mind this, then they can (and should) go this route. However let me just mention my own reasons for hesitating with this solution:

1. I believe I may be in violation of the CDDL & GPL licenses if I redistributed a kernel with ZFS to my clients.

http://arstechnica.com/information-technology/2010/06/uptake-of-nat…..

It’s nearly inconceivable my clients would actually notice, much less care. I really have no idea if the copyright holders would care in the slim chance they got wind of it (ie oracle or linux devs). Maybe they wouldn’t, but it’d still bug me if *my* business was in violation, you know?

2. My previous experiences with 3rd party modules (source & binaries) is that every single kernel update has the potential to break the 3rd party kernel code. The result means that module developers must keep on top of each and every mainline release (if distributing source), and each and every distro release (if distributing binaries), in order to not leave a gap in supported kernels. It wouldn’t be the first time I’ve had to get my hands into the code to fix a temporary incompatibility with new kernels.

In fairness to zfsonlinux, I’m NOT pointing a finger at them, it’s just an inherent problem with the kernel lacking a stable API/ABI. I may very well go with ZFS to gain it’s functionality, but honestly using it as a root file system would make me nervous every time I upgraded the kernel on a server. I don’t have this nervousness with say EXT4, so I’d probably keep root as an ext4 partition until ZFS is in mainline or if the ZFS kernel modifications were officially supported by the distro.

Edited 2013-11-09 07:34 UTC
2013-11-09 4:08 pm

jessesmith
It is not a violation of the license if you distribute the ZFS module. The ZFS on Linux project has a nice FAQ section explaining this. If you merged the ZFS and kernel packages/source then it might be a violation, but distributing them separately is not.

As for breaking across updates, this is unlikely as the API does not change across security updates, it’ll only be an issue across major upgrades. You’re even safer if you use PPAs for Ubuntu-based distributions (including Mint) as the ZFS module is built specifically for your distribution and does not rely on the third-party upstream project.
2013-11-09 11:57 pm

Alfman verbose=1
jessesmith,

“It is not a violation of the license if you distribute the ZFS module. The ZFS on Linux project has a nice FAQ section explaining this. If you merged the ZFS and kernel packages/source then it might be a violation, but distributing them separately is not.”

That’s my point exactly, unlike zfsonlinux.org, I’d be distributing them together. I guess I could always play semantic games like: Oh my, you need a good OS for your server, let me sell you one with linux on it. And later…oh remember that server that I provided you with, well it could sure use a ZFS file system now, and luckily 90% of the disk space was never partitioned…

“As for breaking across updates, this is unlikely as the API does not change across security updates, it’ll only be an issue across major upgrades.”

I disagree, modules often need to be pegged to a specific kernel. I was maintaining my own kernel with 3rd party FS modules (AUFS) the AUFS modules would regularly need to be updated, even when AUFS itself had no changes! Often times the latest AUFS version and the latest kernel versions weren’t compatible; These were frequently trivial changes in the kernel, which I was able to track down within 10 minutes, but it still got tiresome. Granted, the zfsonlinux team may do an awesome job of keeping up with kernel changes, and I would factor this in.

“the ZFS module is built specifically for your distribution and does not rely on the third-party upstream project.”

Right now I specifically looked for it on debian and mint, what’s the package called?

I hope you understand I’m not trying to put down ZFS as a viable solution in many cases, but having conflicting open source licenses is troubling. I don’t think it’s just hypothetical either, this exhibits itself in unfortunate ways, like zfs modules not being available for all the platforms supported by distros – for example an ARM NAS box. It isn’t my intention at all to start debating against ZFS supporters, I am very much interested in ZFS and I think it’s great technology! I was merely trying to point out my own dilemmas.

2013-11-08 8:14 pm

diegocg
Live dedup is definitively not a perfect solution, it hurts performance and for many people it’s a bad choice. Btrfs will add live dedup in next releases, so users can choose which method suits better to their needs.

Edited 2013-11-08 20:14 UTC

2013-11-09 11:52 pm

ddc_
Live dedup […] hurts performance

It is actually very much implementation-dependent. Live dedup trades some expensive writes for much more of less expensive checks, so the outcome depends havily on the usage pattern and algo.

2013-11-08 10:17 pm

gan17
Does any modern distro even default to butterface or recommend it as a stable solution?

I remember the time everyone in Linux-land was talking about butterface being the next big thing, and that was YEARS ago.

2013-11-09 11:46 pm

ddc_
Fedora provides it in installer, though not as default AFAIR. But they enjoy thinking of themselves as of bleeding edge, so their example may not be indicative. Arch still doesn’t provide boot time fsck for btrfs (there is a package in AUR, but that is unofficial anyway). Given that is rolling release and pretty much targetted at “powerusers”, this should be indicative.
2013-11-10 12:54 am

jessesmith
The openSUSE distribution has supported Btrfs for a while now. I do not think it’ll be the default under the release after this coming one, but they have done a lot of work with Btrfs and integrating snapshots in with the YaST admin tools. They seem to be the only Linux distribution taking Btrfs really seriously at the moment. Fedora and Ubuntu both have Btrfs as an option at install time, but it isn’t really well implemented in either distribution.

2013-11-11 8:32 pm

MrVain
Linux overcommits RAM and when RAM is exceeded, Linux starts to kill processes randomly, which makes the system unstable and might loose your data or crash. Other OSes does not allow overcommitting of RAM, so RAM will never be exceed – this means your system is stable even under low memory. Linux will cause problems when low on RAM, because it will kill some processes randomly.

https://lwn.net/Articles/553449/

This guy is not so clever. He complains that Solaris does not allow overcommitting RAM, which did not allow him to use Emacs under low memory conditions. Well, Solaris might now allow him to use Emacs under low memory conditions, but the system will start to kill random processes when he starts Emacs. Instead Solaris will refuse to start Emacs.

This guy wishes that Solaris would behave as Linux: allow him to start Emacs, and at the same time kill another random process. How clever is that? The system might loose his data, or crash! I would not like him in to a server hall.