Re: [dev] libzahl paper

From: Mattias Andrée <maandree_AT_kth.se>
Date: Tue, 14 Jun 2016 22:05:34 +0200

Well, there are of course still operations that
are cheaper. Such is the case for the code you
posted. Operations such as XOR, AND and OR
should be hard to beat. NEQ is however...

But branching is definitely cheaper than it
used to be. Perhaps “incredibly” is stretching it.

However,

   for (i = 0; i < 8; i++) {
      if (crc & 1) {
          crc >>= 1;
          crc ^= 0xedb88320ul;
      } else {
          crc >>= 1;
      }
  }

is pretty fast too.


On Tue, 14 Jun 2016 20:55:25 +0200
Markus Wichmann <nullplan_AT_gmx.net> wrote:

> On Mon, Jun 13, 2016 at 11:32:11PM +0200, Mattias Andrée
> wrote:
> > Proofreading and suggestions is greatly appreciated!
> >
>
> Page 5: "Branches are incredibly cheap on modern CPUs."
> Not so! I wrote a variety of CRC32 algorithms to
> benchmark (which will just measure my personal CPU, I
> know, but it's better than guesswork), and had two
> different bitwise versions:
>
> for (i = 0; i < 8; i++)
> {
> c = crc & 1;
> crc >>= 1;
> if (c)
> crc ^= 0xedb88320ul;
> }
>
> And:
>
> for (i = 0; i < 8; i++)
> {
> crc = (crc >> 1) ^ (-(crc & 1) & 0xedb88320ul);
> }
>
> The second version had twice the throughput of the first.
> (Of course that 4-bytes-at-a-time version had the highest
> throughput, after that the times got bigger again, but
> that's beside the point.) That remained even when writing
> the loop in assembly by hand (the compiler refused to
> just write this:)
>
> shr eax
> jnc 1f
> xor eax, edb88320ul
> 1:
>
> And that even though the second version is longer in
> assembly. But eliminating that branch really did increase
> the throughput immensly.
>
> Although, I don't know what would happen on a different
> architecture.
>
> Ciao,
> Markus
>


Received on Tue Jun 14 2016 - 22:05:34 CEST

This archive was generated by hypermail 2.3.0 : Tue Jun 14 2016 - 22:12:13 CEST