SHA1 using SIMD techniques

$Id: sha1.html,v 1.17 2004/12/21 01:50:05 dean Exp $

Overview

SHA1 is rather unique among message digests in that it has an excessive amount of parallelism available. The paper SHA: A Design for Parallel Architectures? shows that with a typical 32-bit RISC instruction set you would need seven single-cycle ALUs and 26 registers to achieve a minimum critical path length of 160 cycles (2.5 cycles per byte).

Upon analysis of the SHA1 specification it was discovered that a portion of the computation is very well suited to a SIMD architecture such as IA32 with SSE2. By taking advantage of this it is possible to come significantly closer to the minimum critical path described in the above paper.

Performance Data

Dec 18, 2004 NOTE: This data is out of date! For updated data, see this note.

Using a specialized timing harness which iterates over the same buffer continuously, the following cycles per byte results are achieved (lower is better) on four IA32 processors supporting SSE2:

message size	Xeon/P4	Pentium-M	Opteron	Efficēon
64	16.2	13.9	9.16	6.80
256	14.1	13.4	8.73	6.09
1024	13.6	13.4	8.65	5.95
8192	13.7	13.3	8.59	5.90

For reference, here is SHA1 data from OpenSSL 0.9.7c. Note that this is not exactly comparable to the above data due to API differences (especially in the 64-byte message case!), but is indicative of the general performance of integer-only implementations of SHA1:

message size	Xeon/P4	Pentium-M	Opteron	Efficēon
64	67.5	69.2	42.2	48.3
256	31.1	26.3	16.9	19.8
1024	22.1	15.6	10.6	12.7
8192	19.5	12.5	8.73	10.7

For a real world, in-system test, sha1sum from GNU coreutils 5.0 was modified with this new implementation of SHA1. The input is 256MB of /dev/zero, and the numbers are in cycles per byte (lower is better):

	Xeon/P4	Pentium-M	Opteron	Efficēon
baseline	32.2	17.7	15.6	18.7
SSE2 SHA1	17.0	14.6	11.5	11.9
improvement	1.9x	1.2x	1.4x	1.6x

Download

Barebones implementation and timing harness: sha1-sse2-v2.tar.gz
Patch to GNU coreutils-5.0: coreutils-5.0-sha1-sse2.patch.gz

License

This SHA1 implementation is public domain. I hope this enables its use in the myriad of projects which include SHA1 code and are under a myriad of other licenses.

Implementation Details

Using notation similar to FIPS 180-2, here is a subset of SHA1:

	0 <= t <= 15:	W[t] = M[t]	// the input message block
	16 <= t <= 79:  W[t] = ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1)

	For t = 0 to 79:
	{
		T = ROL(a, 5) + f_t(b,c,d) + e + K[t] + W[t]
		e = d
		d = c
		c = ROL(b, 30)
		b = a
		a = T
	}

Of particular interest is the K[t] + W[t] component of the calculation of T.

Take the W[t] recurrence for t > 15 and repeat it 4 times:

	W[t  ] = ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1)
	W[t+1] = ROL(W[t-2] ^ W[t-7] ^ W[t-13] ^ W[t-15], 1)
	W[t+2] = ROL(W[t-1] ^ W[t-6] ^ W[t-12] ^ W[t-14], 1)
	W[t+3] = ROL(W[t  ] ^ W[t-5] ^ W[t-11] ^ W[t-13], 1)

Note that if W is aligned to 16-bytes, and t is a multiple of 4 then this has a very clear mapping onto a 128-bit SIMD architecture. Each of the 5 columns of W above are 128-bits of data. The only glitch to calculating all 4 of W[t]..W[t+3] in parallel is the last line which itself depends on W[t].

There is a convenient identity: ROL(x^y,c) = ROL(x,c)^ROL(y,c)

Using that identity we have the following:

	W[t  ] = ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1)
	W[t+1] = ROL(W[t-2] ^ W[t-7] ^ W[t-13] ^ W[t-15], 1)
	W[t+2] = ROL(W[t-1] ^ W[t-6] ^ W[t-12] ^ W[t-14], 1)
	W[t+3] = ROL(   0   ^ W[t-5] ^ W[t-11] ^ W[t-13], 1)

	W[t+3] = W[t+3] ^ ROL(W[t], 1)

In the implementation, the first four steps are calculated in parallel using SIMD hardware. The fifth step is also calculated using SIMD hardware even though it represents very little parallelism.

Define prep[t] = K[t] + W[t], and since the K[t] are 32-bit constants it should be clear that we can calculate prep[t]..prep[t+3] in parallel after we have calculated W[t]..W[t+3].

The calculation of prep[t] is all that is performed in the SIMD hardware. The remainder of the calculation is performed in ALU hardware as it is in a traditional SHA1 implementation. The dataflow for a single block can be envisioned like so:

	message block -> SIMD -> local stack -> ALU -> update hash

All 80 steps are completely unrolled, and the SIMD computation precedes the ALU computation by 12 steps so that prep[t] is available (on the stack) when the integer computation requires it. This "software pipelining" is continued around the loop back edge -- when there are more blocks to be processed, the SIMD portion of the pipeline begins prior to the hash update of the previous block completing.

There are other peephole optimizations present in the code such as refactoring some f_t functions, and replacing some SSE2 operations with others to improve parallelism on various SSE2 implementations.

The implementation is in C using Intel's SSE2 intrinsics, and the Intel C compiler is used to compile the code. There are over 1200 x86 instructions in the resulting code, which would make hand scheduling a nightmare. In theory the code can be compiled by modern GCC, however the code trips an internal compiler error (bugid 11627).

The end result is very well suited to a wide-issue processor such as Transmeta's Efficēon. Efficēon has two 64-bit memory units, two 32-bit ALUs, and two 64-bit SIMD units (plus others not relevant here), and it can issue to all 6 in every cycle. On average 3.5 pipelines are fed an instruction in every cycle of this SHA1 implementation, and many cycles issue to all 6 pipelines. There are several unimplemented enhancements possible in Efficēon's dynamic translator (known as CMS) which would increase the average issue width beyond 4 on this workload.

Even after this SIMD "factoring" this SHA1 implementation remains ALU-bound. In the Efficēon translated code there are approximately 340 instructions issued to each of the ALUs, while only 250 are issued to each of the SIMD units, and 100 to each memory unit. However there hasn't been any further insight into portions of the computation which could be moved to the SIMD side of the machine.

There remain 160 integer rotation instructions, and of the 4 CPUs mentioned earlier, only Opteron has more than one integer shift unit. Xeon is penalized by an unpipelined integer rotation instruction -- the shift unit is in use for 4 cycles for each rotation which yields a 160*4 = 640 cycle minimum critical path for 64 bytes, which accounts for 10 of the ~14 cycles/byte result.

Xeon is further limited by a 3 μop per clock issue width, and issue port contention between the integer shift unit and SIMD operations. Pentium-M has similar issue width and port contention restrictions. This code is somewhat unlike normal code in that it has available both integer and SIMD operations in nearly every clock -- these processors favour balances which are heavily weighted to integer or to SIMD/FP but not both. Further details can be found in the IA-32 Intel Architecture Optimization Reference Manual.

Future Work

I'd like to make a patch versus OpenSSL, however I'm presently wrestling with three problems:

There are few compilers which can do the code justice. Intel C++ Compiler verions 7.0 and 7.1 do a fine job with the code -- v7.1 was used for the results presented above. Intel made a rather unfortunate decision in the v8.x to "optimise" away all FPU/SIMD->ALU data movement using MOVD/PSRLDQ instructions. This is an attempt to hide the long latency for such data movement on the P4 -- but their compiler does not take into account the fact that my code is already software pipelined to hide this latency. There is no way around this, and it affects performance by at least -30%.
GCC version 3.4 and later can compile this code, but do not generally produce code which is better than the integer-only code. There appear to be many reasons for this inefficiency, but so far only a couple PRs have been opened to address it -- I'm presently waiting for the latest -fnewra support to be merged forward onto the mainline.
In theory it should be possible to use a single code path for both endian inputs with essentially no penalty. (Note that the numbers quoted above include a byte-swap on the entire input -- which is required because SHA1 is big-endian, and IA32 is little-endian.) This theory is based on the observations earlier about how many ALU operations there are in relation to SSE2 operations. In practice I've managed to make a merged codebase like this but it is underperforming the always-byte-swap version. The problem appears to be the compiler not mixing code streams well enough for p4/p-m/k8 (efficēon doesn't mind so much because it reschedules regardless).
OpenSSL's API expects to pass the input to the digest function one block at a time. For larger messages it's desirable to begin the SSE2 processing of the next block prior to finishing the ALU processing of the current block.

dean gaudet -- you can probably guess my email address if you need it