Time for a change. Time for a change.

I propose a change in the way our operating systems let us handle file management. I think most of us with more than a few files have been there, backing up some files, reorganizing all the digital baggage. You start a copy process, but it’s about 3 GB so you’re looking around and find another 18 GBs of ISO files to move, and then you start moving some 4 GB of MP3s. Pretty soon you have 3 or 4 copy operations taking place in parallel, and your hard drive doesn’t like it.

Photo by Michael Coté, cote on Flickr

If I select a dataset to copy, say 100 large files, the copy process either discovers each file and copies it, or makes a list of all files and directories and copies each file in the list. Either way it is a nice sequential operation. If I happen to have physically different drives this process is quite efficient. One drive reads and the other writes. Sure there are interruptions, but the drives are mostly able to stay on track.

If I run 3 or 4 copy operations on the same pair of drives, or worse a single drive, I now have the drive spending a lot of time repositioning the read/write heads. Repositioning the heads is a time consuming process and the read caches empty quickly, if there are write caches they fill up equally fast, and then we wait more.

For a few small files this isn’t too noticeable, but when the dataset becomes large, like a media library it shows up real quick. Solid state drives (SSD) are better since they don’t have the moving parts. I’d like to believe that laying down fragmented files is sub optimal even on a SSD though. Further more, at least for a while SSDs aren’t replacing spinning hard drives for bulk storage.

Caching is supposed to help, but it can’t compete with datasets many times larger than your memory. Even if you have 20 GBs of memory and datasets in the 10 GB range the OS won’t use a 10 GB buffer. I think larger buffers could help, but I think that is part of a bigger solution.

I wrote earlier that a single copy operation from one drive to another was efficient. I’d like to see a copy queue implemented. Now only one copy operation is taking place at a time. Knowing that only one copy operation is taking place at a time lets the OS know it can use a larger buffer, that it won’t have twenty copy processes all needing their own buffers.

Bonus points for parallelizing copies when many spindles are involved, maybe multiple reads doesn’t hurt performance too much so run a few readers on one spindle if there are multiple spindles being used for writes. Obviously copies that are queued up on different drives could safely run in parallel.

Now that we have a queue why not let the user prioritize or reorder the queue. I think some intelligence built in should automatically promote a 20 MB copy over a 20 GB copy, especially to removable drives.

Photo by Erik Pitti, epitti on Flickr

Since the queue could know every file being copied I think it makes sense to optimize for the many small file dataset duplication problem. In a dataset with multiple small files a duplication operation to the same physical drive is not very efficient if each file is copied atomically. Let the queue figure out when this is happening and read multiple files before writing them out.

One Response to “Time for a change. Time for a change.”

  1. Michael V. Coppola writes:

    Fun read. I think the concept of a copy queue makes a lot of sense for noncontiguous operations, particularly those going to/from the same physical media. As far as I’m aware however, the case with SSDs is that they are inherently fragmented by design as a result of the internal (and transparent) wear-leveling mechanisms used in recent iterations.