SSE2 assisted RSA

$Id: rsa.html,v 1.3 2004/09/23 19:52:11 dean Exp $

The bn_mul_add_words routine of openssl-0.9.7c was modified to use the SSE2 pmuludq instruction, which performs a 32-bit x 32-bit -> 64-bit multiply.

The following table shows the improvements for signing 1024-bit RSA keys. The values are cycles per signature (lower is better). INT refers to the existing integer-only implementations in openssl-0.9.7c.

	Xeon/P4	Pentium-M	Opteron	Efficēon
INT (32-bit)	11.1M	6.64M	4.66M	7.19M
SSE2	8.20M	5.76M	4.50M	5.66M
32-bit improvement	1.4x	1.2x	1.0x	1.3x
INT (64-bit)			~1M

Note that on Opteron it makes much more sense to use a native 64-bit implementation when possible because the integer unit supports 128-bit multiplies and fast carry on 64-bit ints. However for the purposes of evaluating this patch for use in 32-bit libraries compiled for all SSE2 platforms it makes sense to compare the 32-bit number on Opteron. It can be seen that on all SSE2 platforms this new code yields improved results over the 32-bit integer code.

Downloads

A patch against openssl-0.9.7c which requires compile-time enabling of SSE2: openssl-0.9.7c-bn-sse2-v1.patch
A patch against openssl-0.9.8 CVS which supports run-time detection of SSE2: openssl-0.9.8-CVS-bn-sse2-v2.patch

Miscellaneous

Note that in the implementation, MMX registers are used rather than XMM registers. This is because there is never any need for more than 64-bits per register, and better parallelism may be achieved on Pentium-M, Opteron, and Efficēon which implement SSE2 using a pair of 64-bit MMX units (whereas P4 uses a single 128-bit unit). Unfortunately, pmuludq is available only in the SSE2 instruction set.

There still remains other opportunities for improvement -- this patch so far only modifies the innermost loop of bn_mul_add_words, which handles eight 32-bit words per iteration. It's possible the cleanup loop could benefit as well. An attempt was made to convert bn_add_words, but there's nothing to overlap with carry propagation to offset the longer latencies of instructions in the SSE2 units.

It's possible a carry-saves representation of intermediates within the bn library would yield much more improved throughputs. In a carry-saves representation, carry propagation is postponed until an operation needs it explicitly. An intermediate form uses twice the storage, the extra space "saves" the carry that must still be propagated. An implementation of the inner bn_mul_add_words like this shows lots of promise, but it's not clear how to spread this new form out into the rest of the library.

dean gaudet -- you can probably guess my email address if you need it