OpenZFS 2.2.0 released

Thom Holwerda 2023-10-14 OS News 17 Comments

The primary new feature of this latest release is this one:

Block cloning is a facility that allows a file (or parts of a file) to be “cloned”, that is, a shallow copy made where the existing data blocks are referenced rather than copied. Later modifications to the data will cause a copy of the data block to be taken and that copy modified. This facility is used to implement “reflinks” or “file-level copy-on-write”. Many common file copying programs, including newer versions of /bin/cp on Linux, will try to create clones automatically.

There’s many more new features and fixes, of course, so head on over to the release page for more information.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

17 Comments

2023-10-14 8:03 pm

sukru
This is one area the open source community cannot come together. (Otherwise we don’t have any concerns like X/Wayland, Gnome/KDE, vim/).

ZFS is generally considered a superior file system (albeit with some curious lack of features, like being able to shrink filesystems, which is a concern for home or small business users, while enterprise is happy to just replace hard drives).

However we can never have it on the Linux kernel. The carefully crafted CDDL was designed to fit the open source definition, but strictly incompatible with GPL. With now Oracle in charge, it is even unlikely for that to happen.

And that leaves me with BTRFS. Fortunately did not have any issues so far, but they are really slow in catching up.

2023-10-16 2:41 pm

Flatland_Spider
ZFS not being able to shrink a pool isn’t that odd. FS shrink is more of a recent FS feature. For instance, XFS doesn’t have a shrink feature, and it’s one of the more popular Linux filesystems.

Then ZFS isn’t the best filesystems for small systems. It’s much happier on servers with decent resources.

2023-10-16 3:08 pm

Alfman verbose=1
Flatland_Spider,

ZFS not being able to shrink a pool isn’t that odd. FS shrink is more of a recent FS feature. For instance, XFS doesn’t have a shrink feature, and it’s one of the more popular Linux filesystems.

Then ZFS isn’t the best filesystems for small systems. It’s much happier on servers with decent resources.

Technically that’s because it’s nearly trivial to tell a file system “oh BTW there’s more free space here” when you make a disk larger, but making a disk smaller means scanning the entire structure to make sure stuff gets moved out of sectors being removed before shrinking. Doing this while the FS is mounted makes it more difficult still.

btrfs’s approach to raid is extremely flexible as you can add and remove volumes of arbitrary sizes like butter and everything gets relocated as necessary (I don’t think this expression is a thing, but hey I’m sticking with it ). This works because btrfs raid happens across smaller blocks rather than across whole block devices. I tested the dynamic re-balancing, and although it’s slow, it’s completely transparent. There’s a lot to like about btrfs (baring the limitations brought up in other comments).

Then ZFS isn’t the best filesystems for small systems. It’s much happier on servers with decent resources.

I’m curious about whether it’s a solid replacement for mdraid. I hear it’s resource intensive, but I don’t have a good idea of what this actually means in practice? Is the performance impact of ZFS drastic? Is it just when features like deduplication are enabled or is it just a heavy file system in general? I’ll probably make time to try ZFS in the next couple months to get a hands on feel for it. Maybe I should try to publish my findings as an article here, but very few of my submissions ever get posted so I’m not sure.

2023-10-16 3:54 pm

Flatland_Spider
> Technically that’s because it’s nearly trivial to tell a
> file system “oh BTW there’s more free space here”
> when you make a disk larger, but making a disk
> smaller means scanning the entire structure to
> make sure stuff gets moved out of sectors being
> removed before shrinking. Doing this while the FS
> is mounted makes it more difficult still.

I don’t consider lack of shrink being a missing feature. It’s a hard problem to solve, and one that I’d rather move to softpartitions like sub-volumes (btrfs) or datasets (ZFS).

> I’m curious about whether it’s a solid replacement for mdraid.

ZFS RAID-Z is solid. The Sun devs took the time to think about RAID and it’s problems.

ZFS can also export raw disk space via volumes which can be formatted with a different FS, like LVM logical volumes. Btrfs doesn’t have a equivalent to this, which was annoying when I first started using btrfs.

> I hear it’s resource intensive, but I don’t have a good idea of
> what this actually means in practice? Is the performance
> impact of ZFS drastic?

ZFS likes RAM. I think it’s It’s 1GB of RAM per TB of disk space.

ZFS being a CoW in the time of HDs also had something to do with that. Running an SSD as a cache (L2ARC) in front of HDs was recommended, but pure SSD arrays don’t require this and it’s a better idea to add more RAM, if possible.

ZFS can be run on some low end boxes (1GB), but that requires tuning. 4GB is a reasonable floor for an out of the box install. Even then, using something else is probably a better idea for a low ram production box.

ZFS also likes JBODs rather then RAID arrays. It needs the disks to be dedicated to ZFS so it can manage the disks. It can be installed on HW RAID or MD RAID arrays, but it’s at it’s best with raw disks.

> Is it just when features like deduplication are enabled or is it just a heavy file system in general?

“Don’t turn on dedup” is kind of a ZFS mantra. Dedup is very resource intense, and I believe it also breaks ZFS’s native encryption.

It’s a cool idea, but it’s not worth the price in practice.

> I’ll probably make time to try ZFS in the next couple months to get a hands on feel for it.

You should it’s a really cool FS, and the tooling experience is really nice. The division between the “zpool” and “zfs” tools is slightly annoying, but it’s better then the btrfs tooling experience.

> Maybe I should try to publish my findings as
> an article here, but very few of my submissions
> ever get posted so I’m not sure.

“ZFS and Linux (2023)” would be interesting.

Mine are hit or miss, and I’ve been boring lately.

2023-10-16 4:40 pm

Alfman verbose=1
Flatland_Spider,

I don’t consider lack of shrink being a missing feature. It’s a hard problem to solve, and one that I’d rather move to softpartitions like sub-volumes (btrfs) or datasets (ZFS).

It is a missing check mark. A long time ago this one directional resizing got us kids into a pickle with doublespace on dos, haha. One way operations aren’t ideal, but I understand it may be less important for some.

ZFS can also export raw disk space via volumes which can be formatted with a different FS, like LVM logical volumes. Btrfs doesn’t have a equivalent to this, which was annoying when I first started using btrfs.

Yes, that’s a good point. I store virtual machine images in LVM and it’s nice that ZFS offers the equivalent to this. btrfs would require me to store images as files.

ZFS likes RAM. I think it’s It’s 1GB of RAM per TB of disk space.

Ouch, that is steep. Most of my NAS devices don’t satisfy that.

ZFS being a CoW in the time of HDs also had something to do with that. Running an SSD as a cache (L2ARC) in front of HDs was recommended, but pure SSD arrays don’t require this and it’s a better idea to add more RAM, if possible.

I hadn’t even though of this in terms of ZFS. Years ago I tried flashcache and bcache. I was really interested in hybrid storage. But I was quite disappointed with their performance. I actually used a ramdisk as the cache drive to test for bottlenecks and they didn’t seem to be smart enough about what they were caching and I wasn’t able to tune them. It’s been a long time and they may be more mature today.
https://medium.com/selectel/bcache-against-flashcache-for-ceph-object-storage-71d0151eb060

Combining flashcache or bcache with mdraid/lvm adds lots of complexity so I ended up not going down that path. ZFS supporting this internally is really cool to me.

You should it’s a really cool FS, and the tooling experience is really nice. The division between the “zpool” and “zfs” tools is slightly annoying, but it’s better then the btrfs tooling experience.

I’ll have to learn this for myself, haha
2023-10-18 1:42 am

sukru
Alfman, Flatland_Spider,

Interesting…

This puts the two use cases in disjoint sets.

If you need virtual machines and/or iscsi targets, yes, zfs’s volume support would be significantly better than using a file on btrfs.

On the other hand, if you are swapping out hard drives (to increase or decrease sizes), btrfs is still miles ahead in terms of usability. (Can for example replace 4x6TB volume with 2x18TB in place, freeing two slots and increasing size at the same time).
2023-10-18 8:46 am

MrVain
> ZFS likes RAM. I think it’s It’s 1GB of RAM per TB of disk space.

Yes, that is the recommendation when you use Dedup. If you are not going to use Dedup, then that much RAM is not needed. However, ZFS have a very good disk cache, so ZFS runs faster with lot of RAM. If you dont have much RAM, then ZFS must access the harddisk all the time, which lowers performance significantly.

Many people run ZFS on 4GB RAM boxes. There is not a big disc cache, but runs fine. Myself ran ZFS on a 1 GB RAM Solaris PC for several years. If you google a bit, you will see someone has ported and ran ZFS on a Raspberry Pi using 128 MB RAM. That is MegaByte, not Gigabyte.

The only reason to use lot of RAM with ZFS, is to run Dedup. If you dont use Dedup and dont care about fast disk cache – and if you only care about ZFS unique data protection abilities (which no other filesystem protects against, for instance bugs in the harddisk firmware is never detected nor protected – unless you use ZFS) – you can run ZFS on 128 MB RAM computers just fine.

2023-10-14 11:28 pm

Alfman verbose=1
sukru,

ZFS is generally considered a superior file system (albeit with some curious lack of features, like being able to shrink filesystems, which is a concern for home or small business users, while enterprise is happy to just replace hard drives).

Yeah, the raid being setup statically is a con. ideally you could reconfigure the array without rebuilding it from scratch (like btrfs).

However we can never have it on the Linux kernel. The carefully crafted CDDL was designed to fit the open source definition, but strictly incompatible with GPL. With now Oracle in charge, it is even unlikely for that to happen.

I agree, the license incompatibility is a drawback. Which deserves the blame CDDL or GPL? Many FOSS licenses allow mixing and matching but unfortunately the GPL doesn’t play well with others. GPL 2 is even incompatible with it’s successor GPL 3.

And that leaves me with BTRFS. Fortunately did not have any issues so far, but they are really slow in catching up.

I had been looking forward to trying it and I finally got around to running a barrage of tests on btrfs recently. Redundancy-wise and performance wise btrfs performed as expected and there were lots of things I liked. But reliability-wise it was a hard fail. It didn’t just fail in some niche cases either, every simulated raid failure involving a power cycle guaranteed administrative downtime. Unless you have another raid solution in place for the boot & root disks, btrfs will render the system unbootable. For many, raid is explicitly wanted to increase uptime & reliability, but btrfs doesn’t do that (switching to degraded requires administrative intervention). For people who have the time, expertise, and access to manually recover, btrfs may be ok. However many of my computers are hundreds of miles away and are not immediately accessible. It’s a problem that I can’t even run a remote shell to troubleshoot remotely. So despite being intrigued by btrfs’s other features, the raid implementation leaves me deferring to mdraid + lvm for those who need failures to be handled without interruption. It doesn’t have all the bells and whistles of btrfs or zfs, but it is mature and works reliably.

I’d like to try ZFS next, but I’m more hesitant with out of tree drivers. That is a major con for me personally due to how regular linux kernel ABI breakages affect my distro. Of course these effect every linux distro, but for most users this happens upstream behind the scenes so they don’t need to pay attention.

2023-10-15 12:51 am

sukru
Alfman,

I wonder whether you mean RAID-5 by raid.

In my case I have used btrfs in one way or another for about 10 years. But I have always used RAID 1+0, and never RAID 5 or RAID 6. If I have a hard drive failure, btrfs can be booted in read only mode with a hot spare (or a cold one), and system can be used while it is recovering. (Or can be booted read-write, if you are risk seeking).

Or, the raid can be left to mdadm, and with some work it seems like even self healing would be supported: https://www.reddit.com/r/btrfs/comments/kpyodw/selfhealing_with_mdadm_raid5_working/ (But I prefer using btrfs native raid just to reduce complexity).

Anyway, for ZFS, I could recommend one of these distros with “root on ZFS” support, which would mean possibly better stability: https://openzfs.org/wiki/Distributions

2023-10-15 3:33 am

Alfman verbose=1
sukru,

I wonder whether you mean RAID-5 by raid.
In my case I have used btrfs in one way or another for about 10 years. But I have always used RAID 1+0, and never RAID 5 or RAID 6. If I have a hard drive failure, btrfs can be booted in read only mode with a hot spare (or a cold one), and system can be used while it is recovering. (Or can be booted read-write, if you are risk seeking).

I meant mirrored raid (ie 1 or 10). I’ve heard that btrfs raid 5 and 6 are risky because filesystem integrity can be corrupted on power failure. In any case I didn’t test these raid levels myself.

The lack of btrfs hot spare support is another issue with btrfs that users have been asking for…
https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Project_ideas.html#Hot_spare_support
Of course as an administrator you can provision a “cold spare” and then tell btrfs to add it later to a degraded array. But with other raid technologies including mdraid, the system can automatically regenerates parity without having to be instructed to do it manually.

The issue with btrfs isn’t that raid can’t be done properly – it can. But that it has to be told to do it. The lack of automation makes it unsuitable for deployments where immediate manual intervention is impractical and downtime is unacceptable.

https://linuxnatives.net/2015/using-raid-btrfs-recovering-broken-disks

If a disk in a btrfs RAID 1 array fails, then btrfs will refuse to mount that filesystem and error messages will be visible in the syslog. If it was the root filesystem, then the system will refuse to boot normally and the system will usually boot to an initramfs console. Luckily all decent systems that support btrfs (like Ubuntu 14.04) will have btrfs tools included in the initramfs environment, so you can run btrfs commands from there and try to recover from the situation without the need to boot the system form an alternative media, like a live CD.

A btrfs volume with a failed (missing) disk will output something like this:

$ btrfs fi show
Label: none uuid: 4e90ec15-e6f5-470d-96be-677f654a5c79
Total devices 3 FS bytes used 1.59TiB
devid 1 size 2.71TiB used 1.60TiB path /dev/sda1
devid 2 size 1.80TiB used 1.48TiB path
devid 3 size 447.14GiB used 121.00GiB path /dev/sdc1/pre>
To force the btrfs volume to mount anyway, the degraded option can be used:

$ mount -t btrfs -o degraded /dev/sda2 /home

The correct thing to do when a disk in an RAID array fails, is to replace it. Once you have a new disk in place notify btrfs about it with the command:

$ btrfs replace start /dev/sdd1 /dev/sdb1 /home

I tested all of this in ubuntu last month and it does work. However btrfs is really not production ready for high availability applications until this all gets automated and the system can continue running without being micromanaged. Incidentally I’m building a nas storage solution for my brother, who wants reliable storage but isn’t tech savvy enough to manage btrfs himself. I was seriously considering btrfs if it had worked automatically! But making him dive into the command line is a no-go and btrfs raid failed to automatically recover from all disk failure scenarios I tested. This is “normal” for btrfs, the raid array is still recoverable but unlike mdraid it puts the onus on an administrator to execute recovery actions.

Or, the raid can be left to mdadm, and with some work it seems like even self healing would be supported: https://www.reddit.com/r/btrfs/comments/kpyodw/selfhealing_with_mdadm_raid5_working/ (But I prefer using btrfs native raid just to reduce complexity).

That’s true, but then you’d loose the benefit of btrfs recovering from a mirror when silent read errors fail the FS integrity checks (although from what I’ve read this is rare). dm-integrity brings some of that benefit to mdraid. But I totally agree with you on the simplicity of using btrfs for everything. LVM2 has become complex and I want to replace it. It was with this in mind that I was hoping to be able to use btrfs.

Anyway, for ZFS, I could recommend one of these distros with “root on ZFS” support, which would mean possibly better stability:

Well, I wasn’t really concerned about the big distros (they distribute kernels and fs modules that have already been built to be compatible with each other). It’s a much bigger problem for those of us installing our own mainline kernels, which makes ZFS a con for me because resolving the breakages quickly becomes tiring.

2023-10-16 12:16 am

sukru
Alfman,

My recent experience was with Netgear’s ReadyNAS, which uses BTRFS as their underlying file system. (Before that I used to manually build and configure my NAS, again with BTRFS).

Anyway, two points:

1) It can do hot spares, even with multiple volumes.

2) It can do automated recovery of corrupted data even usint mdadm for RAID instead of native BTRFS one.

Actually 3 points:

3) It can boot up in degraded mode, when one disk drive is missing/broken/corrupted/etc.

So, all of these are possible, albeit with a 3rd party ecosystem. As far as I know Synology does exactly the same.

And the reddit link I shared, people discuss doing this with their own scripts. Hot spare for example should be trivial to implement for someone with your skillset. Using mdraid for checksum based auto-correction would probably require mdraid (+dm-integrity), unless those NAS manufacturers would be so kind to share their implementations. And mounting in degraded read only mode has some caveats, but works.

Anyway, again good luck.

2023-10-16 3:04 am

Alfman verbose=1
sukru,

My recent experience was with Netgear’s ReadyNAS, which uses BTRFS as their underlying file system. (Before that I used to manually build and configure my NAS, again with BTRFS).

Netgear products may be doing more than btrfs does on a normal linux distro. I did all of my testing on ubuntu. All of the discussions I found confirm the results I got were the “expected” results for btrfs and many users have the same concerns as I do.

Anyway, two points:
1) It can do hot spares, even with multiple volumes.

I do see references in regards to both netgear and synology NAS devices, those are probably vendor added features for their products. However if you can find any information that suggests otherwise I am interested in it!

2) It can do automated recovery of corrupted data even usint mdadm for RAID instead of native BTRFS one.

Well, yes but that’s cheating. Of course I can use mdraid, but the idea behind btrfs was to replace mdraid and take advantage of btrfs raid’s flexibility over traditional raid… The btrfs features are compelling, but the lack of automatic recovery is a huge con for me. I don’t want to have to micromanage btrfs or write scripts to handle btrfs failure modes under different operating systems (ie ubuntu, centos, debian) and circumstances (ie boot vs running system), etc.

3) It can boot up in degraded mode, when one disk drive is missing/broken/corrupted/etc.

I did this, but I saw many warnings not to attempt to set this as the default because btrfs doesn’t kick the bad drive and corruption can result if the missing drive comes back. Again, maybe this too can be scripted, but there are a lot of edge cases that can go wrong in subtle ways when things go unexpected. IMHO btrfs should aim to become mature at handling these like mdraid.

So, all of these are possible, albeit with a 3rd party ecosystem. As far as I know Synology does exactly the same.

I don’t deny that features like automated recovery and hot spares could be provided with external tooling outside of btrfs, but 1) as far as I’m aware, no such tooling is included normal distros 2) it would need to work in various operating systems both at boot and while running, 3) recovery and high availability should really be built into raid rather than needing more work and testing by the admin to get there.

Anyway, again good luck.

Thanks, Maybe 3rd party implementations of btrfs that fill in the gaps are fine, but given that I’m using the normal btrfs drivers in common distros, I think it’s fair to criticize the limitations of btrfs as it exists there.

2023-10-16 3:19 pm

Flatland_Spider
> LVM2 has become complex and I want to replace it.
> It was with this in mind that I was hoping to be able
> to use btrfs.

ZFS works great as a LVM replacement. It’s so nice.

Btrfs not so much. It’s missing features, or the features have drawbacks and people discourage their usage. (Subvolume quotas, RAID, encryption.)

I would like btrfs to live up to all the promises it made, but that doesn’t look like it’s going to happen.

MD, LUKS, LVM, btrfs is the best way to approximate ZFS in Linux without using ZFS.

2023-10-16 4:10 pm

Alfman verbose=1
Flatland_Spider,

MD, LUKS, LVM, btrfs is the best way to approximate ZFS in Linux without using ZFS.

Thanks for your input. I’m already “approximating” such features using mdraid for raid and lvm for snapshots, using scripts to tie it all together. LVM2 provides essential features. It’s my go-to solution and gets the job done, but I’m not very pleased with the way it works and the evolutionary baggage. My interest was in finding a cleaner approach

2023-10-16 3:04 pm

Flatland_Spider
> Yeah, the raid being setup statically is a con. ideally you could
> reconfigure the array without rebuilding it from scratch (like btrfs).

I think this is being worked on. I’m not sure where it’s at though. It might have been released since this was several years ago, and it’s been a while since I’ve needed to know the status of ZFS ZRAID features.

> Redundancy-wise and performance wise btrfs performed
> as expected and there were lots of things I liked. But
> reliability-wise it was a hard fail.

The conventional wisdom is to run btrfs on top of a MD RAID array, if RAID is needed. That’s one of the many features of btrfs which are half-baked.

As I’ve been told, btrfs is just a filesystem, and it’s better to think of it as a replacement for ext4 rather then a ZFS competitor.

My favorite place to run btrfs is on desktops/laptops and VMs. For physical servers, I end up with XFS and ZFS quite a bit, especially since I work in the RH ecosystem mostly.

> I’d like to try ZFS next, but I’m more hesitant with
> out of tree drivers. That is a major con for me
> personally due to how regular linux kernel ABI
> breakages affect my distro.

The ZFS drivers being out of tree aren’t bad, especially if a stable kernel is used. Using DKMS rather then the kmod helps as well.

2023-10-16 3:55 pm

Alfman verbose=1
Flatland_Spider,

The conventional wisdom is to run btrfs on top of a MD RAID array, if RAID is needed. That’s one of the many features of btrfs which are half-baked.
As I’ve been told, btrfs is just a filesystem, and it’s better to think of it as a replacement for ext4 rather then a ZFS competitor.

Oh, really? I hadn’t heard that (well except for raid 5/6) but maybe that is why sukru suggested the specific combination of mdraid + btrfs.
To me though, I was interested in replacing mdraid with btrfs raid to take advantage of btrfs raid flexibility – never again would I have to worry about static raid configurations and matched disks. Adding/replacing disks arbitrarily and transparent re-balancing are btrfs pros compared to mdraid (and ZFS from what I gather).

My favorite place to run btrfs is on desktops/laptops and VMs. For physical servers, I end up with XFS and ZFS quite a bit, especially since I work in the RH ecosystem mostly.

I realize that Redhat are promoters of ZFS, but to your knowledge is there any technical reason to reconsider using it on non-RH distros? I would have presumed that “ZFS here is the same as ZFS there”, but I might be missing something.

The ZFS drivers being out of tree aren’t bad, especially if a stable kernel is used. Using DKMS rather then the kmod helps as well.

I typically download and use the latest stable mainline kernels in my distro, these are often much newer than the debian/ubuntu kernels and often include new drivers and bug fixes too. I could peg my distro to a redhat or ubuntu release cycle, then I’d be more compatible with their ZFS builds, but it’s a trade off. For example my motherboard’s 2.5gbps NIC needs a newer driver that isn’t supported by the kernel that ships with debian. So do I use a kernel that supports this NIC or a kernel that’s compatible with specific out of tree ZFS drivers? Note that I expect this specific incompatibility to get fixed with time, but it’s just one example of the type of issues caused by kernel ABI instability. In the past I would personally patch the out of tree drivers to make them compatible, but after years of that I developed an aversion to out of tree drivers and this is the reason why.

2023-10-17 7:05 pm

Flatland_Spider
> I realize that Redhat are promoters of ZFS,
> but to your knowledge is there any technical
> reason to reconsider using it on non-RH
> distros?

RH doesn’t promote ZFS. They like XFS.

There are scientific labs which run CentOS who really like ZFS for storage, and they are the big drivers of ZFS on Linux.

It’s a good storage FS. Proxmox, which is Debian based, uses it for their storage FS, and I’m going to find out how it works with OpenSUSE Tumbleweed and LEAP.

> I typically download and use the latest stable
> mainline kernels in my distro, these are often
> much newer than the debian/ubuntu kernels
> and often include new drivers and bug fixes too.

OpenZFS does a good job of keeping up with kernel releases. The latest release supports the 3.10-6.5 kernels. It might be a few weeks after new kernels are released before OpenZFS catches up, but it’s not too bad.

It’s not so much that OpenZFS targets the stable kernels, it’s more they lag a little bit after new kernels are released, and a stable kernel allows more fire and forget updates because, as you’ve pointed out, the ABI doesn’t change. I’ve run Fedora, which pushes frequent kernel updates, with ZFS, and I did need to pay attention to kernel updates. The ZFS DKMS module worked fine, but my storage drives would go missing until ZFS released updates for the newer kernel when I did forget. LOL (I don’t run ZFS on root unless it’s a FreeBSD box.)

I’m going to see how Tumbleweed is with ZFS. From what I’ve heard, it’s even more aggressive with updates then Fedora, so we’ll see how it goes.