Hi, I've been thinking a bit more on the structure of polynomial evaluation, which at a high level is rather similar for ghash and gcm.
** Intro **
The function to be computed is
R_j = K (R_{j-1} + M_j)
where K is the secret key and M_j are the message blocks. Operations take place in some finite field. With n message blocks, M_0, ..., M_{n-1}, and initial R_{-1} = 0, we get
R_{n-1} = m_{n-1} K + m_{n-2} K^2 + ... m_0 K^n
I.e., a degree n polynomial with coefficients M_j, constant term = 0, evaluated at the point K.
To be concrete, consider a 64-bit architecture, and a finite field where elements are represented as two words (for poly1305, we need two words plus a few extra bits, but ignore that now, for simplicity. And for ghash, let's also ignore the complications from bit-reversal). Let B represent the bignum base. I'm going to make this a bit handwavy, but we'll have B = 2^64 or B = x^64 depending on type of field.
The finite field is represented by a mod P operation, where the structure of P is nice, with a leading one bit followed by more than 64 zeros, and then a few more non-zero bits at the end. This implies that we can define a multiply operation
Y_2 B^2 + Y_1 B + Y_0 = (X_1 B + X_0) (K_1 B + K_0) (mod P)
by four independent independent multiplication instructions (widening, 64x64 --> 128) involving X and some precomputed values depending on K, and accumulation only involving shifts/adds that have low latency. The result is not fully reduced mod P; it consists of three words.
We can then reduce this to two words by multiplying Y_2 by a suitable single-word constant, with final two-word result
R = C Y_2 + Y_1 B + Y_0
Applying this to the original recurrency, R_j = K (R_{j-1} + M_j), the X input corresponds to R_{j-1} + M_j, so the critical dependency path from R_{j-1} to R_j includes *two* multiply latencies. E.g, if multiply latency is 5 cycles, it's not possible to get this evaluation scheme to run faster than 10 cycles per block. (In practice, the accumulation will also contribute one or a few cycles to the critical path, but *much* less than those two multiplies).
So question is, how can we do better?
** Postponing reduction **
The approach taken in the new x86_64 poly1305 code I just pushed is to skip the final reduction, and let the state be one word larger (and this is particularly cheap for poly1305, because we don't quite increase state size by a word, but from two words + 3 bits to two words plus ~60 bits). The multiply operation becomes
Y_2 B^2 + Y_1 B + Y_0 = (X_2 B^2 + X_1 B + X_0) (K_1 B + K_0) (mod P)
This can be arranged with 6 independent multiply instructions + cheap accumulation. (I haven't worked out the details for the ghash case, but I do expect that it's rather practival there too).
Then the dependency chain from one block to the next is reduced to one multiply latency, 5 cycles in our example. In case all other needed instructions can be scheduled (manually, or by the processor's out-of-order machinery) to run in 5 cycles in parallel with the multiplies, we would get a running time of 5 cycles per block.
** Interleaving **
The other approach, used in the recent powerpc gcm code, is to interleave multiple blocks. For simplicity, only consider 2-way interleaving here. The key thing is that if we expand te recurrency once, we get
R_j = K (M_j + K (R_{j-2} + M_{j-1})) = K M_j + K^2 (R_{j-2} + M_{j-1})
We get two field multiplications, but one of them, K M_j, is completely independent of previous blocks (R_{j-2}), and can be computed in parallel. It may add a cycle or so to accumulation latency, but we can do essentially twice as much work without making the critical path longer. We get 8 independent multiply instructions, and one dependent for the final folding. Can be extended to more than two blocks if needed (depending on number of available registers).
Another variant could be to separate even and odd parts of the polynomial being evaluated, and evaluate both parts at K^2. We can then compute the two recurrencies
E_j = K^2 (E_{j-1} + M_{2j}) O_j = K^2 (O_{j-1} + M_{2j+1})
in parallel. It's unclear to me what the pros and cons are compared to previous variant. One may get some advantage from both multiplies using the same factor K^2. On the other hand, each recurrency has to be accumulated and folded separately, which costs instructions and registers. Maybe more useful for hardware implementation? This variant is currently not used in Nettle.
** Doing both **
It's possible to combine those two tricks. Processing of two blocks would then be an operation of the form
Z_2 B^2 + Z_1 B + Z_0 = (X_2 B^2 + X_1 B + X_0) K^2 + (Y_1 B + Y_0) K
Here, the Xs (three words) represent R_{j-2} + M_{j-1}, the Ys repreent M_{j-2}, and the Zs represents R_j, as three words (without final folding). We would need 10 independent multiples, one more than with plain interleaving, but critical path includes only one multiply latency.
I think this is a promising alternative, if one would otherwise need to interleaving a large number of blocks to get full utilization of the multipliers.
** How to choose **
When implementing one of those schemes, different processor resources may be the bottleneck. I'd expect it to be one of
o Multiply latency, i.e, latency of the dependency chain from one block to the next (including also a few additions, but multiply latency willb e the main part). If this is the bottleneck, it means all other instructions can be scheduled in parallel, and the processor will sit idle for some cycles, waiting for a multiply to complete. Typical latency for multiply is 5 times longer than for an addition (but ratio difers quite a bit between processors, of course)
o Multiply throughput, i.e., the maximum number of (independent) multiply instructions that can be run per cycle. Typical number is 0.5 -- 2. If this is the bottleneck, the processor will spend some cycles idle, waiting for a multiplier to be ready to accept a new input.
o A superscalar processor can issue several instructinos in the same cycle, but there's a fix small limit. Typical number is 2 -- 6. So, e.g., if the processor can issue maximum 4 instructions per cycle, the evaluation loop consists of 40 instructions, and the loop actually runs in close to 10 cycles per iteration, then instruction issue is the bottleneck.
The tricks discussed in this note are useful for finding an evaluation scheme where multiply latency isn't a bottleneck. But once a loop hits the limit on multiply throughput or instructions per cycle, other tricks are needed to optimize further. In particular, the postponed reduction has a cost in multiply throughput, since it needs some additional multiply instruction.
I think one should aim to hit the limit on multiply throughput; that one is hard to negotiate (it's possible to reduce the number of multiply instructions somewhat, by the Karatsuba trick, but due to the additional overhead, likely to be useful only on processors with particularly low multiply throughput).
Regards, /Niels