Why does Perl's tr/\n// get slower and slower as line lengths increase?
In perlfaq5, there's an answer for How do I count the number of lines in a file?. The current answer suggests a sysread
and a tr/n//
. I wanted to try a few other things to see how much faster tr/n//
would be, and also try it against files with different average line lengths. I created a benchmark to try various ways to do it. I'm running this on Mac OS X 10.5.8 and Perl 5.10.1 on a MacBook Air:
wc
(fastest except for short lines) tr/n//
(next fastest, except for long average line lengths) s/n//g
(usually speedy) while( <$fh> ) { $count++ }
(almost always a slow poke, except when tr///
bogs down) 1 while( <$fh> ); $.
(very fast) Let's ignore that wc
, which even with all the IPC stuff really turns in some attractive numbers.
On first blush, it looks like the tr/n//
is very good when the line lengths are small (say, 100 characters), but its performance drops off when they get large (1,000 characters in a line). The longer the lines get, the worse tr/n//
does. Is there something wrong with my benchmark, or is there something else going on in the internals that makes tr///
degrade? Why doesn't s///
degrade similarly?
First, the results.:
Rate very_long_lines-tr very_long_lines-$count very_long_lines-$. very_long_lines-s very_long_lines-wc
very_long_lines-tr 1.60/s -- -10% -12% -39% -72%
very_long_lines-$count 1.78/s 11% -- -2% -32% -69%
very_long_lines-$. 1.82/s 13% 2% -- -31% -68%
very_long_lines-s 2.64/s 64% 48% 45% -- -54%
very_long_lines-wc 5.67/s 253% 218% 212% 115% --
Rate long_lines-tr long_lines-$count long_lines-$. long_lines-s long_lines-wc
long_lines-tr 9.56/s -- -5% -7% -30% -63%
long_lines-$count 10.0/s 5% -- -2% -27% -61%
long_lines-$. 10.2/s 7% 2% -- -25% -60%
long_lines-s 13.6/s 43% 36% 33% -- -47%
long_lines-wc 25.6/s 168% 156% 150% 88% --
Rate short_lines-$count short_lines-s short_lines-$. short_lines-wc short_lines-tr
short_lines-$count 60.2/s -- -7% -11% -34% -42%
short_lines-s 64.5/s 7% -- -5% -30% -38%
short_lines-$. 67.6/s 12% 5% -- -26% -35%
short_lines-wc 91.7/s 52% 42% 36% -- -12%
short_lines-tr 104/s 73% 61% 54% 14% --
Rate varied_lines-$count varied_lines-s varied_lines-$. varied_lines-tr varied_lines-wc
varied_lines-$count 48.8/s -- -6% -8% -29% -36%
varied_lines-s 51.8/s 6% -- -2% -24% -32%
varied_lines-$. 52.9/s 8% 2% -- -23% -30%
varied_lines-tr 68.5/s 40% 32% 29% -- -10%
varied_lines-wc 75.8/s 55% 46% 43% 11% --
Here's the benchmark. I do have a control in there, but it's so fast I just don't bother with it. The first time you run it, the benchmark creates the test files and prints some stats about their line lengths:
use Benchmark qw(cmpthese);
use Statistics::Descriptive;
my @files = create_files();
open my( $outfh ), '>', 'bench-out';
foreach my $file ( @files )
{
cmpthese(
100, {
# "$file-io-control" => sub {
# open my( $fh ), '<', $file;
# print "Control found 99999 linesn";
# },
"$file-$count" => sub {
open my( $fh ), '<', $file;
my $count = 0;
while(<$fh>) { $count++ }
print $outfh "$count found $count linesn";
},
"$file-$." => sub {
open my( $fh ), '<', $file;
1 while(<$fh>);
print $outfh "$. found $. linesn";
},
"$file-tr" => sub {
open my( $fh ), '<', $file;
my $lines = 0;
my $buffer;
while (sysread $fh, $buffer, 4096) {
$lines += ($buffer =~ tr/n//);
}
print $outfh "tr found $lines lines n";
},
"$file-s" => sub {
open my( $fh ), '<', $file;
my $lines = 0;
my $buffer;
while (sysread $fh, $buffer, 4096) {
$lines += ($buffer =~ s/n//g);
}
print $outfh "s found $lines linen";
},
"$file-wc" => sub {
my $lines = `wc -l $file`;
chomp( $lines );
print $outfh "wc found $lines linen";
},
}
);
}
sub create_files
{
my @names;
my @files = (
[ qw( very_long_lines 10000 4000 5000 ) ],
[ qw( long_lines 10000 700 800 ) ],
[ qw( short_lines 10000 60 80 ) ],
[ qw( varied_lines 10000 10 200 ) ],
);
foreach my $tuple ( @files )
{
push @names, $tuple->[0];
next if -e $tuple->[0];
my $stats = create_file( @$tuple );
printf "%10s: %5.2f %5.f n", $tuple->[0], $stats->mean, sqrt( $stats->variance );
}
return @names;
}
sub create_file
{
my( $name, $lines, $min, $max ) = @_;
my $stats = Statistics::Descriptive::Full->new();
open my( $fh ), '>', $name or die "Could not open $name: $!n";
foreach ( 1 .. $lines )
{
my $line_length = $min + int rand( $max - $min );
$stats->add_data( $line_length );
print $fh 'a' x $line_length, "n";
}
return $stats;
}
I wondered whether the benchmarks we've been using have too many moving parts: we are crunching data files of different sizes, using different line lengths, and trying to gauge the speed of tr
relative to its competitors -- with an underlying (but untested) assumptions that tr
is the method whose performance is varying with line length.
Also, as brian has pointed out in a few comments, we are feeding tr
buffers of data that are always the same size (4096 bytes). If any of the methods should be insensitive to line size, it should be tr
.
And then it struck me: what if tr
were the stable reference point and the other methods were the ones varying with line size? When you look out your spaceship window, is it you or that Klingon bird-of-prey that's moving?
So I developed a benchmark that held the size of the data files constant: line length varies, but the total number of bytes stays the same. As the results show:
tr
is the approach least sensitive to variation in line length. Since the total N of bytes processed is constant for all three line lengths tested (short, medium, long), this means that tr
is quite efficient at editing the string it is given. Even though the short-line data file requires many more edits, the tr
approach is able to crunch the data file almost as fast as it handles the long-line file. <>
speed up as the lines become longer, although at a diminishing rate. This makes sense: since each call to <>
requires some work, it should be slower to process a given N of bytes using shorter lines (at least over the range tested). s///
approach is also sensitive to line length. Like tr
, this approach works by editing the string it is given. Again, shorter line length means more edits. Apparently, the ability of s///
to make such edits is much less efficient than that of tr
. Here are the results on Solaris with Perl 5.8.8:
# ln = $. <>, then check $.
# nn = $n <>, counting lines
# tr = tr/// using sysread
# ss = s/// using sysread
# S = short lines (50)
# M = medium lines (500)
# L = long lines (5000)
Rate nn-S
nn-S 1.66/s --
ln-S 1.81/s 9%
ss-S 2.45/s 48%
nn-M 4.02/s 142%
ln-M 4.07/s 145%
ln-L 4.65/s 180%
nn-L 4.65/s 180%
ss-M 5.85/s 252%
ss-L 7.04/s 324%
tr-S 7.30/s 339% # tr
tr-L 7.63/s 360% # tr
tr-M 7.69/s 363% # tr
The results on Windows ActiveState's Perl 5.10.0 were roughly comparable.
Finally, the code:
use strict;
use warnings;
use Set::CrossProduct;
use Benchmark qw(cmpthese);
# Args: file size (in million bytes)
# N of benchmark iterations
# true/false (whether to regenerate files)
#
# My results were run with 50 10 1
main(@ARGV);
sub main {
my ($file_size, $benchmark_n, $regenerate) = @_;
$file_size *= 1000000;
my @file_names = create_files($file_size, $regenerate);
my %methods = (
ln => &method_ln, # $.
nn => &method_nn, # $n
tr => &method_tr, # tr///
ss => &method_ss, # s///
);
my $combo_iter = Set::CrossProduct->new([ [keys %methods], @file_names ]);
open my $log_fh, '>', 'log.txt';
my %benchmark_args = map {
my ($m, $f) = @$_;
"$m-$f" => sub { $methods{$m}->($f, $log_fh) }
} $combo_iter->combinations;
cmpthese($benchmark_n, %benchmark_args);
close $log_fh;
}
sub create_files {
my ($file_size, $regenerate) = @_;
my %line_lengths = (
S => 50,
M => 500,
L => 5000,
);
for my $f (keys %line_lengths){
next if -f $f and not $regenerate;
create_file($f, $line_lengths{$f}, $file_size);
}
return keys %line_lengths;
}
sub create_file {
my ($file_name, $line_length, $file_size) = @_;
my $n_lines = int($file_size / $line_length);
warn "Generating $file_name with $n_lines linesn";
my $line = 'a' x ($line_length - 1);
chop $line if $^O eq 'MSWin32';
open(my $fh, '>', $file_name) or die $!;
print $fh $line, "n" for 1 .. $n_lines;
close $fh;
}
sub method_nn {
my ($data_file, $log_fh) = @_;
open my $data_fh, '<', $data_file;
my $n = 0;
$n ++ while <$data_fh>;
print $log_fh "$data_file $n $nn";
close $data_fh;
}
sub method_ln {
my ($data_file, $log_fh) = @_;
open my $data_fh, '<', $data_file;
1 while <$data_fh>;
print $log_fh "$data_file $. $.n";
close $data_fh;
}
sub method_tr {
my ($data_file, $log_fh) = @_;
open my $data_fh, '<', $data_file;
my $n = 0;
my $buffer;
while (sysread $data_fh, $buffer, 4096) {
$n += ($buffer =~ tr/n//);
}
print $log_fh "$data_file tr $nn";
close $data_fh;
}
sub method_ss {
my ($data_file, $log_fh) = @_;
open my $data_fh, '<', $data_file;
my $n = 0;
my $buffer;
while (sysread $data_fh, $buffer, 4096) {
$n += ($buffer =~ s/n//g);
}
print $log_fh "$data_file s/ $nn";
close $data_fh;
}
Update in response to Brad's comment. I tried all three variants, and they behaved roughly like s/n//g
-- slower for the data files with shorter lines (with the additional qualification that s/(n)/$1/
was even slower than the others). The interesting part was that m/n/g
was basically the same speed as s/n//g
, suggesting that the slowness of the regex approach (both s///
and m//
) does not hinge directly on the matter of editing the string.
I'm also seeing tr///
get relatively slower as the line lengths increase although the effect isn't as dramatic. These results are from ActivePerl 5.10.1 (32-bit) on Windows 7 x64. I also got "too few iterations for a reliable count" warnings at 100 so I bumped the iterations up to 500.
VL: 4501.06 288
LO: 749.25 29
SH: 69.38 6
VA: 104.66 55
Rate VL-$count VL-$. VL-tr VL-s VL-wc
VL-$count 2.82/s -- -0% -52% -56% -99%
VL-$. 2.83/s 0% -- -51% -56% -99%
VL-tr 5.83/s 107% 106% -- -10% -99%
VL-s 6.45/s 129% 128% 11% -- -99%
VL-wc 501/s 17655% 17602% 8490% 7656% --
Rate LO-$count LO-$. LO-s LO-tr LO-wc
LO-$count 16.5/s -- -1% -50% -51% -97%
LO-$. 16.8/s 1% -- -50% -51% -97%
LO-s 33.2/s 101% 98% -- -3% -94%
LO-tr 34.1/s 106% 103% 3% -- -94%
LO-wc 583/s 3424% 3374% 1655% 1609% --
Rate SH-$count SH-$. SH-s SH-tr SH-wc
SH-$count 120/s -- -7% -31% -67% -81%
SH-$. 129/s 7% -- -26% -65% -80%
SH-s 174/s 45% 35% -- -52% -73%
SH-tr 364/s 202% 182% 109% -- -43%
SH-wc 642/s 433% 397% 269% 76% --
Rate VA-$count VA-$. VA-s VA-tr VA-wc
VA-$count 92.6/s -- -5% -36% -63% -79%
VA-$. 97.4/s 5% -- -33% -61% -78%
VA-s 146/s 57% 50% -- -42% -67%
VA-tr 252/s 172% 159% 73% -- -43%
VA-wc 439/s 374% 351% 201% 74% --
Edit: I did a revised benchmark to compare the rates for different line lengths. It clearly shows that tr///
starts out with a big advantage for short lines that rapidly disappears as the lines grow longer. As for why this happens, I can only speculate that tr///
is optimized for short strings.
Line count rate comparison http://img69.imageshack.us/img69/6250/linecount.th.png
Long lines are about 65 times larger than short lines, and your numbers indicate that tr/n// runs exactly 65 times slower. This is as expected.
wc evidently scales better for long lines. I don't really know why; perhaps because it is tuned to just count newlines, especially when you use the -l
option.