When we are storing data, we typically assume that our storage system of choice returns that data later just as we put it in. However what guarantees do we have that this is actually the case? The case made here is the case of bitrot, the silent degradation of the physical charges that physically make up today’s storage devices.

To counter this type of problem, one can employ data checksumming, as it is done by both btrfs and ZFS. However, while in the long run btrfs might be the tool of choice for this, it is fairly complex and not yet too mature, whereas ZFS, the most prominent candidate for this type of features, is not without hassle and it must be recompiled for every kernel update (although automation exists).

In this blogpost, we’ll therefore take a look into a storage design that actually checks whether the returned data is actually valid and not silently corrupted inside our storage system and is completely designed with components available in Linux itself without the need to recompile and test your storage layer on every kernel upgrade. We find that this storage design, while fulfilling the same purpose as ZFS, does not only yield comparable performance, but actually in some cases even able to significantly outperform it, as the benchmarks at the end indicate.

The setup

The system we will test in the following will be constructed from four components (including the filesystem):

  • dm-integrity provides blocklevel checksumming for all incoming data and will check said checksum for all data read through it (and return a read error if it doesn’t match).
  • mdraid provides redundancy if a disk misbehaves, avoiding data loss. If the dm-integrity-layer detects invalid data from the disk, mdraid will see a read-error and can immediately correct this error by reading it from somewhere else in the redundancy pool.
  • dm-crypt on top of the RAID, right below the filesystem will encrypt the data1
  • ext4 as a filesystem to actually store files on it

The order matters, dm-integrity must be placed between the harddisks and the mdraid, to ensure that data corruption errors can be corrected by mdraid after they have been detected by dm-integrity. Similarly, dm-crypt has nothing to do with redundancy, therefore it should be placed on top of of mdraid, to only having to be passed once, and not multiple times as it would be the case for placing it alongside dm-integrity.2

This yields the following storage design:

Storage system architecture

Architecture of the storeage system. Grey depicts physical hardware, green depicts device mapper technologies, yellow indicates mdraid and blue indicates filesystems. The encryption layer will later be ommitted when comparing performance to ZFS.

Architecture of the storeage system. Grey depicts physical hardware, green depicts device mapper technologies, yellow indicates mdraid and blue indicates filesystems. The encryption layer will later be ommitted when comparing performance to ZFS.

This script will assemble the above depicted system and mount it to /mnt.

Resulting performance

Now that we have an assembled system, we’d like to quantify the performance impact of the different layers. The test setup for this is a Kobol Helios4 (primarily because it’s the intended target platform for this storage system) with four disks3 running Debian 10 “Buster”. The Helios4 is powered by an energy efficient dual-core ARM SoC optimized for storage systems. The setup was not built based on the entire disks, but on partitions of a size of 10GiB each, allowing multiple parallel designs and therefore easing the benchmarking procedure, as well as speeding up the experiments4.

Layer analysis of performance impact

Throughput

To benchmark throughput, the following commands were used:

# write:
echo 3 > /proc/sys/vm/drop_cache  # drop all caches
dd if=/dev/zero of=/path/to/testfile bs=1M count=8000 conv=fdatasync

# read
echo 3 > /proc/sys/vm/drop_cache  # drop all caches
dd if=/path/to/testfile of=/dev/zero conv=fdatasync

This procedure was repeated ten times for proper statistics for the different cases for both read (r) and write (w):

  • ext4 filesystem only (f, one disk only)
  • encryption, topped by ext4 (cf, one disk only)
  • mdraid (RAID6/double parity), topped by the former (rcf, 4 disks)
  • the final setup, including integrity (ircf, 4 disks)
Throughput for different arrangements of layers on Kobol Helios4 system for both read (top row) and write (bottom row). Each dot indicates one measurement. f: filesystem, cf: crypto+f, rcf:raid+cf, ircf: integrity+rcf. Testplatform

Throughput for different arrangements of layers on Kobol Helios4 system for both read (top row) and write (bottom row). Each dot indicates one measurement. f: filesystem, cf: crypto+f, rcf:raid+cf, ircf: integrity+rcf. Testplatform

We see different interesting results in the data: First of all, the encryption engine on the helios4 is not completely impact free, although the resulting performance is still more than sufficient for the designated uses for a Helios4.

Secondly, we see that adding integrity layer does have a noticable impact on writing, but a negligible impact on reading, indicating that especially for systems primarily intended to read data from adding the integrity layer is a matter of negligible cost.

For write-heavy systems, the performance impact is more considerable, but for certain workloads, such as the home-NAS-case the Helios is designed for, the performance can still be considered fully sufficient, especially as normally the system would cache those writes to a certain extend, which was explicitly disabled for benchmarking purposes (see conv=fdatasync in the benchmark procedure).

The reason for the degradation in write (but not in read) is likely due to the fact that the integrity layer and the mdraid layer are decoupled from one another, the raid-layer is not aware that for a double parity setup it effectively has to write the same information three times, and the integrity layer has to account for all three writes before the data is considered synced as required by conv=fdatasync.

Latency

We have found that the throguhput of such a storage design is yielding useful throughput-rates. The next question is about the latency of the system, which we will, for simplicity, only estimate for random 4K-reads. Again, just as above, we will investigate the impact of the different layers of the system. To do so, we read 100.000 sectors randomly from an 18GB sized testfile filling the storage mountpoint for each configuration after dropping all caches.

Latency comparison of system layers

Distribution of latencies in milliseconds for different arrangements of layers on Kobol Helios4 system. green denotes the median, black denotes the average of the latencies of the respective setup, access times below 1ms were considered cache hits and therefore excluded for the computation of mean and average. Testplatform

Distribution of latencies in milliseconds for different arrangements of layers on Kobol Helios4 system. green denotes the median, black denotes the average of the latencies of the respective setup, access times below 1ms were considered cache hits and therefore excluded for the computation of mean and average. Testplatform

The figure above yields several interesting insights: First of all, we do see cache hits close to zero milliseconds, furthermore, we see that the latency distribution is fairly evenly distributed over the available range and finally and most interestingly, we see that the impact of the several layers onto the latency is measurable, but rather irrellevant for typical practical purposes.

Performance comparison with ZFS

So far we have tested the setup on its intended target platform, the Helios4. To compare the resulting system against the elephant in the room, ZFS, we will use a different test platform, based on an Intel i7-2600K as CPU and 4x1TB disks, as the zfs-dkms was not reliably buildable on Debian Buster on ARM, and when it actually built, it explicitly stated that 32bit-processors (as the Helios’ CPU) are not supported by upstream, although technically the system would run.

To allow for a cleaner comparison, the testbed was accordinly changed to accomodate for ZFS’s preferences. As ZFS did not adhere to conv=fdatasync5, the main memory was restricted to 1GB, swap was turned off and the size of the testfile was chosen to be 18GB. This way, any caching happening would be at least significantly reduced as there was little space next to the OS inside of the main memory for caching.

All tests were run on a Debian 10 “Buster” with Linux 4.19, ZFS was used in form of the package zfs-dkms in version 0.8.2 from backports. The storage layer for both setups was layouted with double parity (RAID6/raidz2) and, as the zfs-dkms package in Debian was not able to do encryption, the mdraid-based setup was also setup without the encryption layer.

Throughput

The commands used for benchmarking throughput were conceptually the same as above:

# write:
echo 3 > /proc/sys/vm/drop_cache  # drop all caches
dd if=/dev/zero of=/path/to/testfile bs=1M count=18000 conv=fdatasync

# read
echo 3 > /proc/sys/vm/drop_cache  # drop all caches
dd if=/path/to/testfile of=/dev/zero conv=fdatasync

Although ZFS seemingly does not adhere to any cache dropping or syncing instructions, they were still performed and adhered to by the mdraid-based setup.

Throughput comparison, zfs & md

Throughput for both ZFS and the md-based setup for both read and write. Each dot indicates one measurement, the green line indicates the median of all measurements. Testplatform

Throughput for both ZFS and the md-based setup for both read and write. Each dot indicates one measurement, the green line indicates the median of all measurements. Testplatform

The results are very interesting: While the md-based setup performs less consistent in and of itself, it still consistently outperforms ZFS in read performance. When it comes to write, though, ZFS performs noticably better.6

To investigate the cause of this unexpected balance, we note that while ZFS combines raid, integrity and filesystem in one component, for the md-based setup these are separate components. Of these components, not only the filesystem, but also the dm-integrity implements journalling to avoid inconsistencies in case of a power outage. This leads to increased work until the transaction has been fully flushed to disk, which can be seen in the next figure, where the md-based system (without the encryption layer) is tested with both journalling enabled and disabled in the integrity layer:

Throughput effect of integrity journaling

Throughput for the md-based setup, with journalling enabled and disabled in the integrity-layer for both read and write. Each dot indicates one measurement, the green line indicates the median of all measurements. Testplatform

Throughput for the md-based setup, with journalling enabled and disabled in the integrity-layer for both read and write. Each dot indicates one measurement, the green line indicates the median of all measurements. Testplatform

We find that write-performance increases significantly (and taking into account the factor of increase it actually surpasses the write performance of ZFS). We can therefore conclude that the Linux components are very capable of matching ZFS performance-wise, at least for low memory environements, and especially outperform ZFS in read-performance. In a purely throughput-optimized configuration (such as when an uninterruptible power supply is installed, so that the guarantees of journalling are not necessarily crucial) it even outperforms a standard ZFS-configuration in both read and write7. Before you disable integrity-journalling in production, however, take into account the implications8.

Disabling journalling on the filesystem layer of the md-based setup did not have any measurable impact onto the throughput performance of the setup.

Latency

Analoguous to the latency measurements on the Helios4 we can measure the latency profile for both the md-based storage as well as ZFS, reading 1 million random 4K-samples from an 18GB sized testfile.

Latency comparison, zfs & md

Throughput for both ZFS and the md-based setup for both read and write. Each dot indicates one measurement, the green line indicates the median of all measurements. Latency median is depicted by lime-green, latency average in black (which is hard to see for md, because both lie almost exactly on top of each other). Latencies below 1ms were considered cache hits and therefore excluded from the computation of average and median. Testplatform

Throughput for both ZFS and the md-based setup for both read and write. Each dot indicates one measurement, the green line indicates the median of all measurements. Latency median is depicted by lime-green, latency average in black (which is hard to see for md, because both lie almost exactly on top of each other). Latencies below 1ms were considered cache hits and therefore excluded from the computation of average and median. Testplatform

Besides the expected cache hits close to zero millisecond-mark we find some very interesting results with regards to latency: While the md-based system is very consistent with its latency profile, both with the distribution as well as the average and median being almost identical, ZFS exhibits a slightly lower median latency, but contrasts that with an up to five-fold larger latency tail, which lifts the average latency noticably above that of the md-based setup, indicating that ZFS seems far less predictable with regard to read-latcency.

Remark on RAID checks

One thing to note is that checks and resyncs of the proposed setup are significantly prolonged (by a factor of 3 to 4) compared to an mdraid without the integrity-layer underneath. The investigation so far has so far not revealed the underlying cause. It is not CPU-bound, indicating that the read-performance is not held back by checksumming, the latency figures above do not imply an increase in latency significant enough to cause a factor 3-4 prolonged check-time, disabling the journal did not change this either (as one would expect, as the journal is unrelated to read, which should be the only relevant mode for a raid-check).

ZFS was roughly on par with mdraid without the integrity-layer underneath with regard to raid-check time.

If anyone has an idea of the root cause of this behavior, feel encouraged to contact me, I’d be intrigued to know.

Conclusion

First of all, the conclusion when pondering if ZFS is required for guaranteeing data integrity in a storage layer can clearly be answered with a no. Not only does the combination of mdraid and dm-integrity yield more than sufficient data rates for a typical home-storage-setup, the data also indicates that at least in low-memory environements this kind of setup can actually outperform ZFS for read-heavy operations, both in throughput as well as being at least comparable if not more reliable with regard to latency. Especially for long-term-storage solutions primarily intended for just having tons of storage space, such as home-NAS-systems, this is typically the more relevant mode of operation as long as the write-performance is sufficient (which the data confirms).

Therefore data integrity on the storage layer is very possible without going through the hassle of having to build ZFS additionally with each kernel upgrade and preserving the option for easy restore with any modern linux, even if ZFS is not available, should the primary storage system fail.

The prolonged check- and resync-times still lack a reasonable explanation, which is unsatisfying. However, from a purely practical point of view, rebuilds and resyncs are typically a rather rare occurance, therefore the prolongation of check- and resync-time is a comparatively small (and rarely paid) price for the guarantee of data integrity.

Additional thoughts and outlook

In addition to the conclusions we can draw from the data we gathered, the md-based setup has another advantage ZFS does not (yet) provide: for ZFS all disks for the storage pool must be available upfront at first assembly9 (or the redundancy will not be “any 2 disks may fail and the storage pool is still operational” for a system ultimately intended to have double parity) whereas this is not the case with mdraid, which can extend or grow a pool10, resyncing it to, for all (changing) configurations, at any point in time (except disk failure and rebuild) fulfill “any two disks may fail” (for an intended double parity/RAID6-setup with over time increasing disk count).

Of course, ZFS is not only blocklevel checksumming. ZFS provides additional tooling for different purposes which might make its use still worthwhile. If one was to extend the md-raid-based system by some of these features, it might be of interest to use XFS as the filesystem on top, as with a sufficiently recent Kernel XFS also supports features like filesystem-snapshots and many more. Alternatively, one could introduce an LVM-layer into the system (or, if desired, even replace md with LVM).

As usual, feedback is always appreciated.

Thanks to Hauro for proofreading and helpful comments.


  1. This not only has the advantage of data protection should the system be stolen or otherwise fall into the wrong hands, it also ensures that any disk down in the stack will never see plaintext data, so broken drives do not have to be wiped (especially if they are broken in a way one has to resort to this kind of wiping) ↩︎

  2. Actually the functionality of dm-crypt and dm-integrity can be used within the same layer. However, this would then be authenticated encryption performed by dm-crypt, which is intended to prevent tampering and therefore uses cryptographic hashes. While standalone dm-integrity can do that, it uses a simple crc32 integrity checksum by default, which is better suited to detect not-on-purpose degradation. Furthermore, this would mean not only one encryption layer, but instead multiple of them, reducing performance. Actually, this kind of setup was briefly tested and performed worse with regards to throughput. ↩︎

  3. HITACHI HDS721010CLA332, 1TB ↩︎

  4. As the dm-integrity layer does blocklevel checksumming, it overwrites the entire disk once by default, ensuring that all checksums are initialized properly. This, however, takes quite some time on a full 1TB drive, significantly slowing down the creation of different testbeds. ↩︎

  5. This was evident by the fact that ZFS claimed it was able to persist to spinning disk with 1GB/s, which is roughly five times what the disks underneath can do in raw-byte write performance. Therefore either ZFS is an historically unmatched engineering feat, circumventing hardware limitations, or, more likely, ZFS was still caching despite being politely asked not to do this. ↩︎

  6. A certain part of ZFS’s performance will be due to remaining caching performed by ZFS, but that should be fairly negligible. ↩︎

  7. This could, of course, change if similar components in ZFS could be disabled to tweak throughput rate, which was not further investigated. ↩︎

  8. With journalling enabled, the worst case in a power outage is the loss of data that hasn’t been written to disk yet, the device itself, however, will always contain blocks with consistent checksums. Disabling journalling on the integrity layer implies that in case of a power outage, the integrity device might end up holding blocks with incorrect checksums. Given that above this layer resides the RAID-layer, which can in theory correct those inconsistent checksums as long as at least one of the three redundant copies of that block contains a valid checksum. Then again, in case of a power outage, there is a) no guarantee that this will be the case as the file in question must be modified in all three redundant copies and b) the standard assembly method of an md-raid will kick out a disk on the first read error, therefore multiple inconsistent blocks spread over the different devices might theoretically be recoverable, but in practice the raid will just kick out those disks at the first error encountered. It would have to be investigated if that behavior of mdraid can be modified to better suit this use case (which was not investigated further and might actually be pretty straightforward). Until then I would not disable journalling on the redundancy layer, given that the write-performance is most likely more than sufficient for most small setups, such as the 4-disk-setup presented here, especially taking into account that in real-life linux will also cache writes before flushing them to disk. ↩︎

  9. This can be somewhat circumvented by initially building a degraded pool, but until the missing disks are added, the degraded pool has also degraded redundancy, somewhat thwarting the entire point of such a system. ↩︎

  10. For example, mdraid allows growing from a 4-disk-RAID6 to a 5-disk-RAID6, or migrating from a 3-disk-RAID5 to a 4-disk-RAID6, from a 4-disk-RAID6 to a 4-disk-RAID5, etc. ↩︎