Skip to main content

rapidtar: I reimplemented tar in Rust, because I can

🦀🎉📼

Submitted by kmeisthax on Sat, 02/02/2019 - 15:57 in News

Today I am releasing a Rust crate titled rapidtar, at version 0.1. You can install it from crates.io or pull the Git repository. It is a proof-of-concept file archival utility that uses parallel directory traversal and write buffering to speed up tar archive creation. It currently supports no other functions, those are to be implemented in future versions.

Why rapidtar?

For the past year I have been using some used LTO-4 tape drives for the purpose of performing daily backups of work-related files. Typically, a daily backup from my backup server to the tape drive takes about 5 hours, with a necessary and unavoidable trip downstairs to swap tapes every backup. (My work share is about 1TB and LTO-4 only stores 800GB per tape.) 5 hours is about 56MB/s on average, but in practice it's more like 10MB/s with occasional bursts of full speed operation. This amount of shoe-shining isn't good for the drives or tapes long term, and it's also extremely inconvenient for a drive which is rated for at least 120MB/s.

The reason for this is because of a thing called npm. This is a dependency retrieval utility for node.js projects which has also butted it's way into front-end theme development. Every front-end build tool worth it's salt nowadays is developed in JavaScript, run in the node.js runtime, and pulls it's dependencies from npm. And the thing about node.js developers is that they love pulling ridiculous numbers of small dependencies into their projects, which create absurdly long paths (long enough that some replication software chokes on them) full of millions of tiny files. This kills the crab.

As a refresher, there are two ways to read data off of a modern storage device quickly:

  • Read a large file sequentially
  • Read a large number of files at once

Generally speaking, sequential I/O is monumentally faster than queued I/O, but both are far faster than just reading files one-at-a-time. So, my thought was simple: spawn hundreds of threads, have them all read the directories to be archived in parallel into a queue, and then serialize the queue into a tarball. The resulting archive won't be sorted correctly, but it will still be an otherwise correct tar archive.

How do other people solve this problem?

Avoiding tape where possible. For example, you could just write slow tarballs (or other archives) to disk, and then copy the now-fast sequential file onto tape. This doesn't reduce the backup window at all, but it does avoid tying up your drives. This is commonly referred to as "disk-to-disk" backup, as opposed to "disk-to-tape". Some more extreme solutions include dropping tape entirely and using cloud storage, which I'm not entirely a fan of.

What can rapidtar do?

Not much. Currently, only the -c flag of tar is supported, and it only generates PAX format archives. GNU tar is capable of interoperating with the output of rapidtar, which is currently my primary interop testing mechanism.

On Windows, tape can be written to by specifying the device path \\.\TAPEn. rapidtar recognizes Windows tape devices and will automatically write 10kb blocks for compatibility with GNU tar (though you can, and should change that). On other platforms, no code currently exists to recognize and treat tape devices differently, which can cause interoperability failures if you attempt to write to a tape device.

rapidtar also ships with a utility called rapidmt, which provides tape seeking control on Windows. This is limited to seek commands; other functions of standard mt such as software eject and partition control are not yet implemented.

How fast does it do these things?

On my Windows machine, rapidtar can saturate an LTO-5 drive. The maximum performance I have observed is roughly 164MB/s, which exceeds native write speed. The extra performance can be attributed to moderate amounts of compressibility in the data. However, at default settings you will probably see lower performance, as we ship with fairly conservative defaults intended to use a moderate amount of RAM and interop with GNU tar defaults. In general, this is what you should consider doing, in order of performance benefits:

  • Increase the --blocking_factor so that rapidtar will write larger records onto the tape. The default of 20 (aka 10KB) is excessively small for modern tape drives and causes significant overhead. 2000 (1000KB) tends to work much better.
  • Increase the --serial_buffer_limit. This allows rapidtar to queue more records to be written to tape, at the expense of increased memory usage. By default, we have a limit of 1G, though you can see some gains from allowing up to 10GB to be used.
  • Increase the --channel_queue_depth. This controls how many small files rapidtar will read-ahead and queue in RAM pending serialization. At small record sizes, the channel queue depth is critical to maintaining performance on pathological cases, but large record sizes don't benefit at all from it.
  • Increase the --parallel_io_limit. The default of 32 read threads appears to be good enough and more threads doesn't help. I suppose if your device supported - and could fill - deeper read queues, increasing this would provide a benefit.

What will rapidtar do in the future?

I plan to make a 0.2 release with additional features. Critically, I need to support volume spanning, else I can't actually use rapidtar to archive my data. I will also need to add Linux code for managing tape devices since my backup server runs tape.

My full wishlis-- er, "roadmap" is as follows:

  • Volume spanning (planned for 0.2)
  • Full Linux support (planned for 0.2)
  • Support for list & extract operations
  • File contents checksum (will require tar format extensions)
  • Index lists (e.g. storing the headers alone on disk with indexes for faster tape retrieval)
  • Support for update operation & incremental backup
  • Compression
  • Volume spanning with compression (not even GNU tar supports this, even though it's possible)
  • Support for other archive formats (e.g. ZIP, LTFS)

Hopefully I can implement even a few of these things...

Tags
software