Hi, I've been thinking a bit more on the structure of polynomial
evaluation, which at a high level is rather similar for ghash and gcm.
** Intro **
The function to be computed is
R_j = K (R_{j-1} + M_j)
where K is the secret key and M_j are the message blocks. Operations
take place in some finite field. With n message blocks, M_0, ...,
M_{n-1}, and initial R_{-1} = 0, we get
R_{n-1} = m_{n-1} K + m_{n-2} K^2 + ... m_0 K^n
I.e., a degree n polynomial with coefficients M_j, constant term = 0,
evaluated at the point K.
To be concrete, consider a 64-bit architecture, and a finite field where
elements are represented as two words (for poly1305, we need two words
plus a few extra bits, but ignore that now, for simplicity. And for
ghash, let's also ignore the complications from bit-reversal). Let B
represent the bignum base. I'm going to make this a bit handwavy, but
we'll have B = 2^64 or B = x^64 depending on type of field.
The finite field is represented by a mod P operation, where the
structure of P is nice, with a leading one bit followed by more than 64
zeros, and then a few more non-zero bits at the end. This implies that
we can define a multiply operation
Y_2 B^2 + Y_1 B + Y_0 = (X_1 B + X_0) (K_1 B + K_0) (mod P)
by four independent independent multiplication instructions (widening,
64x64 --> 128) involving X and some precomputed values depending on K,
and accumulation only involving shifts/adds that have low latency. The
result is not fully reduced mod P; it consists of three words.
We can then reduce this to two words by multiplying Y_2 by a suitable
single-word constant, with final two-word result
R = C Y_2 + Y_1 B + Y_0
Applying this to the original recurrency, R_j = K (R_{j-1} + M_j), the X
input corresponds to R_{j-1} + M_j, so the critical dependency path from
R_{j-1} to R_j includes *two* multiply latencies. E.g, if multiply latency
is 5 cycles, it's not possible to get this evaluation scheme to run
faster than 10 cycles per block. (In practice, the accumulation will
also contribute one or a few cycles to the critical path, but *much*
less than those two multiplies).
So question is, how can we do better?
** Postponing reduction **
The approach taken in the new x86_64 poly1305 code I just pushed is to
skip the final reduction, and let the state be one word larger (and this
is particularly cheap for poly1305, because we don't quite increase
state size by a word, but from two words + 3 bits to two words plus ~60
bits). The multiply operation becomes
Y_2 B^2 + Y_1 B + Y_0 = (X_2 B^2 + X_1 B + X_0) (K_1 B + K_0) (mod P)
This can be arranged with 6 independent multiply instructions + cheap
accumulation. (I haven't worked out the details for the ghash case, but
I do expect that it's rather practival there too).
Then the dependency chain from one block to the next is reduced to one
multiply latency, 5 cycles in our example. In case all other needed
instructions can be scheduled (manually, or by the processor's
out-of-order machinery) to run in 5 cycles in parallel with the
multiplies, we would get a running time of 5 cycles per block.
** Interleaving **
The other approach, used in the recent powerpc gcm code, is to
interleave multiple blocks. For simplicity, only consider 2-way
interleaving here. The key thing is that if we expand te recurrency
once, we get
R_j = K (M_j + K (R_{j-2} + M_{j-1}))
= K M_j + K^2 (R_{j-2} + M_{j-1})
We get two field multiplications, but one of them, K M_j, is completely
independent of previous blocks (R_{j-2}), and can be computed in
parallel. It may add a cycle or so to accumulation latency, but we can
do essentially twice as much work without making the critical path
longer. We get 8 independent multiply instructions, and one dependent
for the final folding. Can be extended to more than two blocks if needed
(depending on number of available registers).
Another variant could be to separate even and odd parts of the
polynomial being evaluated, and evaluate both parts at K^2. We can then
compute the two recurrencies
E_j = K^2 (E_{j-1} + M_{2j})
O_j = K^2 (O_{j-1} + M_{2j+1})
in parallel. It's unclear to me what the pros and cons are compared to
previous variant. One may get some advantage from both multiplies using
the same factor K^2. On the other hand, each recurrency has to be
accumulated and folded separately, which costs instructions and
registers. Maybe more useful for hardware implementation? This variant
is currently not used in Nettle.
** Doing both **
It's possible to combine those two tricks. Processing of two blocks
would then be an operation of the form
Z_2 B^2 + Z_1 B + Z_0
= (X_2 B^2 + X_1 B + X_0) K^2 + (Y_1 B + Y_0) K
Here, the Xs (three words) represent R_{j-2} + M_{j-1}, the Ys repreent
M_{j-2}, and the Zs represents R_j, as three words (without final folding).
We would need 10 independent multiples, one more than with plain
interleaving, but critical path includes only one multiply latency.
I think this is a promising alternative, if one would otherwise need to
interleaving a large number of blocks to get full utilization of the
multipliers.
** How to choose **
When implementing one of those schemes, different processor resources
may be the bottleneck. I'd expect it to be one of
o Multiply latency, i.e, latency of the dependency chain from one block
to the next (including also a few additions, but multiply latency
willb e the main part). If this is the bottleneck, it means all
other instructions can be scheduled in parallel, and the processor
will sit idle for some cycles, waiting for a multiply to complete.
Typical latency for multiply is 5 times longer than for an addition
(but ratio difers quite a bit between processors, of course)
o Multiply throughput, i.e., the maximum number of (independent) multiply
instructions that can be run per cycle. Typical number is 0.5 -- 2.
If this is the bottleneck, the processor will spend some cycles idle,
waiting for a multiplier to be ready to accept a new input.
o A superscalar processor can issue several instructinos in the same
cycle, but there's a fix small limit. Typical number is 2 -- 6. So,
e.g., if the processor can issue maximum 4 instructions per cycle,
the evaluation loop consists of 40 instructions, and the loop
actually runs in close to 10 cycles per iteration, then instruction
issue is the bottleneck.
The tricks discussed in this note are useful for finding an evaluation
scheme where multiply latency isn't a bottleneck. But once a loop hits
the limit on multiply throughput or instructions per cycle, other tricks
are needed to optimize further. In particular, the postponed reduction
has a cost in multiply throughput, since it needs some additional
multiply instruction.
I think one should aim to hit the limit on multiply throughput; that one
is hard to negotiate (it's possible to reduce the number of multiply
instructions somewhat, by the Karatsuba trick, but due to the additional
overhead, likely to be useful only on processors with particularly low
multiply throughput).
Regards,
/Niels
--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.