There are few things more frustrating than paying for high-speed PC components and then leaving performance on the table because software slows your system down. Unfortunately, a default setting in Windows 11 Pro, having its software BitLocker encryption enabled, could rob as much as 45 percent of the speed from your SSD as it forces your processor to encrypt and decrypt everything. According to our tests, random writes and reads — which affect the overall performance of your PC — get hurt the most, but even large sequential transfers are affected.
While many SSDs come with hardware-based encryption, which does all the processing directly on the drive, Windows 11 Pro force-enables the software version of BitLocker during installation, without providing a clear way to opt out. (You can circumvent this with tools like Rufus, if you want, though that’s obviously not an official solution as it allows users to bypass the Microsoft’s intent.) If you bought a prebuilt PC with Windows 11 Pro, there’s a good chance software BitLocker is enabled on it right now. Windows 11 Home doesn’t support BitLocker so you won’t have encryption enabled there.
Nothing like buying a brand new PC and realising you’re losing a ton of performance for something you might not even need on a home PC.
The article does a good job of highlighting facts, but doesn’t go into the “why” aspects. As a software developer, I am curious what the bottlenecks really are. Bitlocker uses an AES 128 cipher that has been hardware accelerated for over a decade. CPU AES performance is very roughly 500MB/s – 3000MB/s per core depending on CPU model/age/etc/block chaining/etc.
https://calomel.org/aesni_ssl_performance.html
This should NOT pose a CPU bottleneck *if* windows performs the crypto multithreaded, however tomshardware’s benchmarks scores of around 600MB/s *might* be explained if the crypto implementation is single threaded. It would be a silly oversight, but the good news is that it should be fixable if this were the problem.
Another potential bottleneck could be that the encryption block size could be larger than the sectors written. If this is the case, bitlocker might have to read a full block, modify it, and write the full block back – in other words a single sector written ends up turning into larger operations that consume more bandwidth under the hood (similar, but not exactly the same as how raid5/6 writes turn into multiple read/write operations to compute parity). This also could explain the tomshardware benchmarks. This work multiplication is bad for performance, and might not be so easy to fix without more significant changes for bitlocker. If the performance loss is caused by unavoidable work amplification, then this performance loss will not be fixable.
This could probably be tested inside of a VM to see what’s going on at the hardware interface, but I don’t personally have copies of windows 11 to try this on. If anyone else does, we could get to the bottom of it by running some of our own tests.
PCIe 5.0 m.2 SSDs have 15.76GB/s of bandwidth bidirectional. PCIe 4.0 half that, and PCIe 3.0 half that, which is still 4GB/s so just just a bit of overhead even on PCIe 3.0 systems. To not slow down the fastest 4 lane M.2 SSDs.. and even then only a single one you’d need 16GB/s of encryption bandwidth.
cb88,
Tomshardware specifically tested this SSD:
https://www.samsung.com/us/computing/memory-storage/solid-state-drives/990-pro-w-heatsink-pcie-4-0-nvme-ssd-4tb-mz-v9p4t0cw/
Note that even the unencrypted tests didn’t come close to maxing out the PCI lane capacity. So either NTFS itself or the SSD itself are causing bottlenecks before PCI 4, much less PCI 5.
When it comes to encryption, it makes sense to look at the CPU for any slowdowns relative to unencrypted, however it assumes the same IO write pattern, which as I mentioned may not be the case. While I am not equipped to measure bitlocker bottlenecks, what I can do is measure the AES bandwidth that intel CPUs are capable of. Tomshardware used a i9-12900K CPU, which I don’t have, but I measured AES bandwidth from the slower i9-11900k and it was easily able to beat their benchmarks.
So while I don’t have the full picture of what bitlocker is doing, it seems rather unlikely to be a CPU bottleneck. Even single threaded encryption should have performed faster than what they measured. Note the bandwidths above are for single threaded encryption! I confirmed that the bandwidths above scale linearly as you add more threads.
So I have to do some hand waiving because I’m testing openssl on linux rather than bitlocker on windows. But still, their CPU should have been able to provide enough bandwidth several times over. So my guess is that the performance loss is caused by I/O amplification when bitlocker encryption is used.
Alfman,
In addition to being lazy wrt. single vs. multi threaded encryption, they probably want to avoid dedicating 100% CPU cycles to disk I/O, which in general is expected to do “other things”. (Can’t even make it run as “low priority” either, since this is actually a high priority task. It would break parallel programs).
There are possible solutions of course (adding a new special class in the scheduler could be a starting point).
All of these require more work, from a team that is probably tasked with many other things.
Side note: This will also break DirectStorage on the recent graphics cards, since it will no longer be possible to transfer data directly from the nvme to the GPU.
Actually it will also break “zerocopy” network services.
(And at this point, I might be really upset with SSD chipset manufacturers giving us broken crypto, forcing Windows to move this to the CPU).
(yay for lack of edit, there was a somewhere up there)
and it won’t even render,
testing:
<
\<
sukru,
They have a lot of options in selecting what to prioritize, but unless those cores are actually busy, there’s no reason not to use them if you need more encryption bandwidth. A new scheduler could clearly do the trick, but I don’t even think it’s necessarily. IMHO it would be effective to make the main IO thread inherit the priority of the process and spawn other worker threads with lower priority up to some max so that they’ll use idle cycles without interrupting other work. Except for the low end CPUs, it’s rare not to have many idle cores under most applications and I suspect there were idle cores available during tomshardware bencharmarks.
Although, given how fast a single thread can do AES, I don’t think the CPU was responsible for their bitlocker bottlenecking in the first place. This is why I suggested the block size theory could be the root cause.
Yes, I wonder what the impact is there. In principal the decryption could be GPU accelerated. I know that storage media using onboard crypto exists and I would think that decryption would be transparent to directstorage/GPU even today, so this only applies to CPU decryption.
I am curious what kind of performance difference direct storage really makes in the first place. Copying through memory naturally has to add some latency, but it’s still very fast. Decryption too. It’s an interesting point.
I haven’t used encrypted media and am not sure what you mean, can you elaborate what is broken with them?
sukru,
(Starting a new thread on purpose).
I found this information about direct storage.
https://www.tomshardware.com/news/directstorage-performance-amd-intel-nvidia
Previously I was under the impression that directstorage loaded contents directly into the GPU without copying through RAM, but the diagram shows this is not the case. It’s just that the decompression is moved into the GPU. Given that this is the case, bitlocker shouldn’t have an impact on the directstorage model. Contents can be decrypted upon loading into ram while direct storage does it’s normal thing.
Moving the decompression onto the GPU has a huge performance advantage, which makes sense to me. The chart makes clear that the decompression is bottlenecked by the CPU.
Alfman,
Interesting, I always though they had “peer to peer” DMA support:
https://www.kernel.org/doc/html/next/driver-api/pci/p2pdma.html
Turns out Windows does not have it even for the network server APIs:
https://learn.microsoft.com/en-us/previous-versions/windows/desktop/cc904344(v=vs.85)
(They just skip additional copies and kernel context switches)
As for using this on the GPU, the gain is not latency. Or rather, not only latency, but a huge decrease in VRAM usage as well.
https://github.com/GameTechDev/SamplerFeedbackStreaming
This is a demo from Intel for “Sample Feedback Streaming”. I basically showcases reducing texture RAM usage by over 99.5%! How does it work? Every frame the GPU calculates which exact textures (mipmaps) are necessary, and asks them to be loaded before the next frame comes in. If the throughput is fast enough (especially with native decompression), you basically no longer need to load any textures at all.
(As far as I know, no game is using this technique, since it would mean it only runs with RDNA2, or Turing or later cards, Xbox Series consoles, but noting prior, including the PS5, which don’t have the necessary hardware).
sukru,
I’ve always taken it to mean no userspace copying. Syscalls like sendfile work with files, but my expectation has always been it still goes through the normal kernel disk reading channels, not bypass them. I suspect that bypassing system caches would cause more of a bottleneck than an improvement.
man7.org/linux/man-pages/man2/splice.2.html
man7.org/linux/man-pages/man2/vmsplice.2.html
man7.org/linux/man-pages/man2/sendfile.2.html
The VRAM still needs to hold the uncompressed copies though. Keeping compressed assets in VRAM is technically possible, but you’d have to sacrifice active texture & asset memory, which is already quite limited, so I don’t think that’s what they are doing. I think directstorage keeps compressed assets in system ram and decompresses them into vram on demand. This enables rapid swapping of new assets while still offering full use of VRAM for the rendering pipeline.
I while back I wrote a cuda benchmark. Here’s a 3080ti on PCI4 system:
Directstorage compression multiplies the effectiveness of PCI bandwidth over decompression via CPU. Obviously decompressed assets would require significantly more bandwidth to transfer to the GPU. I would think that directstorage not only improves load times, but also enables more in-game assets on GPUs with rather limited VRAM thanks to the ability to request thousands of high def compressed textures per second from system ram.
Well now I’m seeing RTX-IO with a diagram that bypasses memory…
https://www.nvidia.com/en-us/geforce/news/rtx-io-gpu-accelerated-storage-technology/
How perplexing! This made me wonder if the tomshardware diagram is factually wrong, but microsoft docs have it documented the same way as tomshardware does. I’m inclined to take microsoft’s document authoritatively.
https://github.com/microsoft/DirectStorage/blob/main/Docs/DeveloperGuidance.md
(lots of good details about direct storage, btw)
The nvidia document does not specifically say it bypasses memory, So did nvidia’s artist get it wrong in taking the liberty of drawing it that way? Even the chart next to the diagram suggests the CPU is not totally bypassed (ie by depicting a cpu load>0).
If we assume all the diagrams are technically correct, then it would seem to imply that “directstorage” as documented by microsoft is not equal to “RTX-IO” as documented by nvidia. One bypasses system memory whereas the other doesn’t.
This source explicitly declares that system memory is bypassed…
https://techreport.com/news/what-is-nvidia-rtx-io/
However in attempting to fact-check the claims, the author links to the exact same nvidia document that says nothing of the sort. It makes me wonder if his conclusions were drawn from the diagram alone?
It’s late and maybe I’m missing something. If you are able to find information that explains the different diagrams, please share! I don’t like this uncertainly at all, haha.
This AMD video answers so many of our questions! It clearly shows how the directstorage data pipeline works, and indeed it goes through memory.
https://youtu.be/LvYUmVtOMRU?t=884
This video also specifically addresses your point about bitlocker too! Directstorage bypasses all the standard windows disk & FS kernel filters unless something prevents it like compression, encryption, etc.
Lots of great information in this video!
I’m going to caulk up nvidia’s contradictory diagram to an illustrator getting it wrong. Going by the information from all other authoritative sources, diagrams 3 should look more like diagram 1.
https://www.nvidia.com/en-us/geforce/news/rtx-io-gpu-accelerated-storage-technology/
Alfman,
I think we have been talking in two separate threads at this point.
Yes, DirectStorage might not be skipping the CPU, but DMA based implementations are also a thing: https://en.wikipedia.org/wiki/Zero-copy
That being said, the SFS is another technique, that builds upon DirectStorage (or similar APIs), and special hardware support (hence sampler “feedback”), that could potentially remove any need to load textures manually:
https://github.com/GameTechDev/SamplerFeedbackStreaming
The idea is, in regular engines you’d need to estimate which textures are potentially on the screen. Almost always need to overshoot. Every now and then undershoot, and have “pop ins”.
SFS gives you the exact list. Meaning you don’t need to load anything that is not visible. In the Intel demo, they use “just ~200MB of a 1GB heap, despite over 350GB of total texture resources.” (That seems to be the total data streamed. The textures themselves are 13GB, making savings 1 – (0.2 / 13) = 98%).
Anyway, that is another discussion.
sukru,
Yes. I’m not sure how often this is used in practice though? The thing is bypassing the CPU might not be that beneficial if the CPU wasn’t a primary bottleneck to begin with. Take network cards capable of building packets themselves with (true) zero copy DMA operations. Operating systems may end up copying data anyways because it’s much easier to implement their userspace APIs this way (ie non-blocking writes). And sometimes the accelerators actually perform slower than the CPU operations they are replacing. Additionally hardware accelerators are very difficult to change/update.
https://techcommunity.microsoft.com/t5/core-infrastructure-and-security/why-are-we-deprecating-network-performance-features-kb4014193/ba-p/259053
So I don’t know what the ideal is. It’s one of those things where CPU bypass sounds like a good idea, but could prove to be complex (like you noted with bitlocker). You also loose things like caching.
SFS has been coming up a lot in relation to direct storage. The technique itself isn’t new, it’s just becoming standardized.
I thought I read that unless you were willing to render the frame twice, there was a frame delay (ie it tells you the resources needed by the previous frame).
Since the bandwidth is finite, very fast scene changes will still overload the pipeline. I don’t know if I can find it again, but IIRC one of the SFS testing demos incurred 10-40% frame drops while decompressing new assets on the fly. Obviously it depends on the game/assets/etc, but the point being even if you only needed 200MB for the current frame, it’s probably best to keep more assets loaded in the GPU when they have a high probability of being needed in the future. IMHO the benefit of SFS is when you are forced to evict resources from VRAM because you are out of space, but not to do so prematurely simply because you rotated the camera and an asset is out of view.
Yes. Another completely different discussion is the effects on NAND storage itself. Normally we only think about longevity in terms of writes, but reads have an impact on flash cells too.
https://en.wikipedia.org/wiki/Flash_memory#Read_disturb
It’s virtually a non-problem when the OS caches reads in memory. But if you were to design something that quickly and regularly cycled through read/discard/read/discard operations without using a cache, then instances of this “read disturb” phenomenon could increase significantly as you played a game. Mature flash controllers should be mitigating this by rewriting frequently read areas before a read disturb happens, but IMHO it’s best to cache frequently used assets than to re-read them constantly.
Oops, I meant framerate instead of frames.