Maamoun TK maamoun.tk@googlemail.com writes:
Wider multiplication would improve the performance for 64-bit general registers but as the case for the current SIMD implementation, the radix 2^26 fits well there.
If multiply throughput is the bottleneck, it makes sense to do as much work as possible per multiply. So I don't think I understand the benefits of interleaving, can you explain?
Let's consider the 64-bit case, since that's less writing. B = 2^64 as usual. Then the state is
H = h_2 B^2 + h_1 B + h_0
(with h_2 rather small, depending on how far we normalize for each block, lets assume at most 3 bits, or maybe even h_2 <= 4).
R = r_1 B + r_0
By the spec, high 4 bits of both r_0 and r_1, and low 2 bits of r_1 are zero, which makes mutliplication R H (mod p) particularly nice.
We get
R H = r_0 h_0 + B (r_1 h_0 + r_0 h_1) + B^2 (r_1 h_1 + r_0 h_2) + B^3 r_1 h_2
But then B^2 = 5/4 (mod p), and hence B^2 r_1 = 5 r_1 / 4 (mod p), where the "/ 4" is just shifting out the two low zero bits. So let r_1' = 5 r_1 / 4,
R H = r_0 h_0 + r_1' h_1 + B (r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2)
These are 4 long multiplications (64x64 --> 128) and two short, 64x64 --> for the products involving h_2. (The 32-bit version would be 16 long multiplications and 4 short).
From the zero high bits, we also get bounds on these terms,
f_0 = r_0 h_0 + r_1' h_1 < 2^124 + 5*2^122 = 9*2^122
f_1 = r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2 < 2^125 + 5*2^61 + 2^127
So these two chains can be added together as 128-bit quantities with no overflow, in any order, there's plendy of parallelism. E.g., power vmsumudm might be useful.
For final folding, we need to split f_1 into top 62 and low 66 bits, multiply low part by 5 (fits in 64 bits), and add into f_0, which still fits in 128 bits.
And then take the top 64 bits of f_0 and add into f_1 (result <= 2^66 bits).
The current C implementation uses radix 26, and 25 multiplies (32x32 --> 64) per block. And quite a lot of shifts. A radix 32 variant analogous to the above would need 16 long multiplies and 4 short. I'd expect that to be faster on most machines, but I'd have to try that out.
In contrast, trying to use a similar scheme for multiplying by (r^2 (mod p)), as needed for an interleaved version, seems more expensive. There are several contributions to the cost:
* First, the accumulation of products by power of B needs to take into account carry, as result can exceed 2^128, so one would need something closer to general schoolbok multiplication.
* Second, since r^2 (mod p) may exceed 2^128, we need three words rather than two, so three more short multiplications to add in.
* Third, we can't pre-divide key words by 4, since low bits are no longer guaranteed to be zero. This gives more expensive reduction, with more multiplies by 5.
The two first points makes smaller radix more attractive; if we need three words for both factors, we can distribute the bits to ensure some of the most significant bits are zero.
Since the loop of block iteration is moved to inside the assembly implementation, computing one multiple of key at the function prologue should be ok.
For large messages, that's fine, but may add a significant cost for messages of just two blocks.
Regards, /Niels