What are hash based file paths?

If you have a system that manages files via an associated hash, this hash can be used to determine the storage location for that file.

A common variant is to break down a hash string in hexadecimal notation into blocks of 4. Each block then corresponds to the level of a directory.

Example

There is a digital object that has the md5 checksum 30867de2a8196314bb1aa707e75fb7f0, then you would divide the hash value into blocks of 4:

  • 3086
  • 7de2
  • a819
  • 6314
  • bb1a
  • a707
  • e75f
  • b7f0

Each of the blocks results in a directory component of the file path, which now reads like this: /3086/7de2/a819/6314/bb1a/a707/e75f/b7f0.

Why you should avoid hash-based file paths

There are two major disadvantages of hash-based paths:

  1. the path structure is not easily understandable by humans because no context is recognizable.
  2. Depending on the hash, the directory names are evenly distributed. If you have to search in the directories, you generate a non-negligible I/O load due to the large number of syscalls (see comparison S3 (NFS) to direct storage NFS). The same applies to the case of deletion, as a depth search takes place across each directory.

In addition, the underlying storage cannot easily be expanded with additional mount points, since the path components of the first level are equally distributed.

Why hash-based file paths can sometimes be a good idea

If you want to process data in parallel, the hash based structure makes sense because when you use a good hash function you can save and read data in parallel with virtually no collisions.

For systems that want to ensure that their data is integrity, they can use the location as a checksum check. No extra bookkeeping is necessary.

Further sources and fun facts

  • the process is patented
  • it is the standard directory structure of ElasticSearch, Zotero and Archivematica’s storage server
  • found via Stackoverflow
  • related terms are ‘index’ and ‘shard’