$Id: sha1.html,v 1.17 2004/12/21 01:50:05 dean Exp $ OverviewSHA1 is rather unique among message digests in that it has an excessive amount of parallelism available. The paper SHA: A Design for Parallel Architectures? shows that with a typical 32-bit RISC instruction set you would need seven single-cycle ALUs and 26 registers to achieve a minimum critical path length of 160 cycles (2.5 cycles per byte). Upon analysis of the SHA1 specification it was discovered that a portion of the computation is very well suited to a SIMD architecture such as IA32 with SSE2. By taking advantage of this it is possible to come significantly closer to the minimum critical path described in the above paper. Performance DataDec 18, 2004 NOTE: This data is out of date! For updated data, see this note. Using a specialized timing harness which iterates over the same buffer continuously, the following cycles per byte results are achieved (lower is better) on four IA32 processors supporting SSE2:
For reference, here is SHA1 data from OpenSSL 0.9.7c. Note that this is not exactly comparable to the above data due to API differences (especially in the 64-byte message case!), but is indicative of the general performance of integer-only implementations of SHA1:
For a real world, in-system test,
Download
LicenseThis SHA1 implementation is public domain. I hope this enables its use in the myriad of projects which include SHA1 code and are under a myriad of other licenses. Implementation DetailsUsing notation similar to FIPS 180-2, here is a subset of SHA1: 0 <= t <= 15: W[t] = M[t] // the input message block 16 <= t <= 79: W[t] = ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1) For t = 0 to 79: { T = ROL(a, 5) + f_t(b,c,d) + e + K[t] + W[t] e = d d = c c = ROL(b, 30) b = a a = T } Of particular interest is the Take the W[t ] = ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1) W[t+1] = ROL(W[t-2] ^ W[t-7] ^ W[t-13] ^ W[t-15], 1) W[t+2] = ROL(W[t-1] ^ W[t-6] ^ W[t-12] ^ W[t-14], 1) W[t+3] = ROL(W[t ] ^ W[t-5] ^ W[t-11] ^ W[t-13], 1) Note that if There is a convenient identity: Using that identity we have the following: W[t ] = ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1) W[t+1] = ROL(W[t-2] ^ W[t-7] ^ W[t-13] ^ W[t-15], 1) W[t+2] = ROL(W[t-1] ^ W[t-6] ^ W[t-12] ^ W[t-14], 1) W[t+3] = ROL( 0 ^ W[t-5] ^ W[t-11] ^ W[t-13], 1) W[t+3] = W[t+3] ^ ROL(W[t], 1) In the implementation, the first four steps are calculated in parallel using SIMD hardware. The fifth step is also calculated using SIMD hardware even though it represents very little parallelism. Define The calculation of message block -> SIMD -> local stack -> ALU -> update hash All 80 steps are completely unrolled, and the SIMD computation
precedes the ALU computation by 12 steps so that There are other peephole optimizations present in the code such
as refactoring some The implementation is in C using Intel's SSE2 intrinsics, and the Intel C compiler is used to compile the code. There are over 1200 x86 instructions in the resulting code, which would make hand scheduling a nightmare. In theory the code can be compiled by modern GCC, however the code trips an internal compiler error (bugid 11627). The end result is very well suited to a wide-issue processor such as Transmeta's Efficēon. Efficēon has two 64-bit memory units, two 32-bit ALUs, and two 64-bit SIMD units (plus others not relevant here), and it can issue to all 6 in every cycle. On average 3.5 pipelines are fed an instruction in every cycle of this SHA1 implementation, and many cycles issue to all 6 pipelines. There are several unimplemented enhancements possible in Efficēon's dynamic translator (known as CMS) which would increase the average issue width beyond 4 on this workload. Even after this SIMD "factoring" this SHA1 implementation remains ALU-bound. In the Efficēon translated code there are approximately 340 instructions issued to each of the ALUs, while only 250 are issued to each of the SIMD units, and 100 to each memory unit. However there hasn't been any further insight into portions of the computation which could be moved to the SIMD side of the machine. There remain 160 integer rotation instructions, and of the 4 CPUs mentioned earlier, only Opteron has more than one integer shift unit. Xeon is penalized by an unpipelined integer rotation instruction -- the shift unit is in use for 4 cycles for each rotation which yields a 160*4 = 640 cycle minimum critical path for 64 bytes, which accounts for 10 of the ~14 cycles/byte result. Xeon is further limited by a 3 μop per clock issue width, and issue port contention between the integer shift unit and SIMD operations. Pentium-M has similar issue width and port contention restrictions. This code is somewhat unlike normal code in that it has available both integer and SIMD operations in nearly every clock -- these processors favour balances which are heavily weighted to integer or to SIMD/FP but not both. Further details can be found in the IA-32 Intel Architecture Optimization Reference Manual. Future WorkI'd like to make a patch versus OpenSSL, however I'm presently wrestling with three problems:
dean gaudet -- you can probably guess my email address if you need it |