What are hash based file paths?

If you have a system that manages files via an associated hash, this hash can be used to determine the storage location for that file.

A common variant is to break down a hash string in hexadecimal notation into blocks of 4. Each block then corresponds to the level of a directory.

Example

There is a digital object that has the md5 checksum 30867de2a8196314bb1aa707e75fb7f0, then you would divide the hash value into blocks of 4:

3086
7de2
a819
6314
bb1a
a707
e75f
b7f0

Each of the blocks results in a directory component of the file path, which now reads like this: /3086/7de2/a819/6314/bb1a/a707/e75f/b7f0.

Why you should avoid hash-based file paths

There are two major disadvantages of hash-based paths:

the path structure is not easily understandable by humans because no context is recognizable.
Depending on the hash, the directory names are evenly distributed. If you have to search in the directories, you generate a non-negligible I/O load due to the large number of syscalls (see comparison S3 (NFS) to direct storage NFS). The same applies to the case of deletion, as a depth search takes place across each directory.

In addition, the underlying storage cannot easily be expanded with additional mount points, since the path components of the first level are equally distributed.

Why hash-based file paths can sometimes be a good idea

If you want to process data in parallel, the hash based structure makes sense because when you use a good hash function you can save and read data in parallel with virtually no collisions.

For systems that want to ensure that their data is integrity, they can use the location as a checksum check. No extra bookkeeping is necessary.

Further sources and fun facts

the process is patented
it is the standard directory structure of ElasticSearch, Zotero and Archivematica’s storage server
found via Stackoverflow
related terms are ‘index’ and ‘shard’