Improve md5 calculation -- an unexpected journey
In a project it was necessary to calculate the md5 checksums of files as fast as possible. Under Perl5 there is the module Digest::MD5.
The suggested way to use this module is not the fastest. The reason is that the method addfile() does not use the buffer optimally.
In the following I have tested all possible variants: the suggested addfile approach, the buffer optimized, the File::Map based and the system call to 'md5sum' variant:
The variant "memory-mapped" is about 10% faster than the others. Here a result for checksumming a DNG-file with size of 13MB on a NVME device:
Unfortunately there is a problem with large files. The Digest::MD5 probably calculates the values wrong for scalars >1GB (see https://rt.cpan.org/Public/Bug/Display.html?id=123185). In this case, the memory mapped approach should not be used.
The suggested way to use this module is not the fastest. The reason is that the method addfile() does not use the buffer optimally.
In the following I have tested all possible variants: the suggested addfile approach, the buffer optimized, the File::Map based and the system call to 'md5sum' variant:
#!/usr/bin/env perl
# bench to check how fast is memory mapped access
use strict;
use warnings;
use utf8;
use Benchmark qw(:all) ;
use File::Map qw( map_file);
use Digest::MD5;
use File::Slurp;
sub md5offile_mapped {
my $fn = shift;
map_file my $data, $fn, '<';
my $md5obj = Digest::MD5->new;
$md5obj->add($data);
return $md5obj->hexdigest;
}
sub md5offile_orig {
my $fn = shift;
my $fh;
open($fh, '<', $fn) || die ("Can't open '$fn', $!");
binmode($fh);
my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
$atime,$mtime,$ctime,$blksize,$blocks)
= stat $fh;
my $buffer;
my $md5obj = Digest::MD5->new;
while (read($fh, $buffer, $blksize)) {
$md5obj->add($buffer);
}
close $fh || die ("could not close file '$fn', $!");
return $md5obj->hexdigest;
}
sub md5offile_addfile {
my $fn = shift;
my $fh;
open($fh, '<', $fn) || die ("Can't open '$fn', $!");
binmode($fh);
my $md5obj = Digest::MD5->new;
$md5obj->addfile( $fh );
close $fh || die ("could not close file '$fn', $!");
return $md5obj->hexdigest;
}sub md5offile_md5file {
my $fn = shift;
return system("md5sum $fn >/dev/null 2>&1");
}
my $file = shift @ARGV;
read_file($file); # to warm cache
timethese(500, {
'memory_mapped' => sub{ md5offile_mapped( $file ); },
'original' => sub{ md5offile_orig( $file ); },
'add_file' => sub{ md5offile_addfile( $file ); },
'system' => sub{ md5offile_md5file( $file ); },
;}
);
The variant "memory-mapped" is about 10% faster than the others. Here a result for checksumming a DNG-file with size of 13MB on a NVME device:
Benchmark: timing 500 iterations of add_file, memory_mapped, original, system...
add_file: 12 wallclock secs (10.68 usr + 1.09 sys = 11.77 CPU) @ 42.48/s (n=500)
memory_mapped: 10 wallclock secs (10.43 usr + 0.23 sys = 10.66 CPU) @ 46.90/s (n=500)
original: 12 wallclock secs (10.98 usr + 0.95 sys = 11.93 CPU) @ 41.91/s (n=500)
system: 16 wallclock secs ( 0.13 usr 0.27 sys + 13.97 cusr 1.34 csys = 15.71 CPU) @ 31.83/s (n=500)
Unfortunately there is a problem with large files. The Digest::MD5 probably calculates the values wrong for scalars >1GB (see https://rt.cpan.org/Public/Bug/Display.html?id=123185). In this case, the memory mapped approach should not be used.