Hi,
a while ago I was asked to explain the 64-bit C versions of ecc_secp256r1_modp and ecc_secp256r1_modq (in ecc-secp256r1.c), and I found that a bit difficult.
I've rewritten them, on branch https://git.lysator.liu.se/nettle/nettle/-/blob/secp256r1-mod/ecc-secp256r1..... Main difference is handling of the case that next quotient is close to 2^{64}: Old code allowed the quotient to overflow 64 bits, using an additional carry variable q2. New code ensures that next quotient is always at most 2^{64} - 1.
For the new implementation, the modp function is a special case of the 2/1 division in https://gmplib.org/~tege/division-paper.pdf (would usually need 3/2 division to get sufficient accuracy, but reduces to 2/1 since the next most significant word of p is 0), and the modq function is a special case of divappr2, described in https://www.lysator.liu.se/~nisse/misc/schoolbook-divappr.pdf.
I've not been able to measure any significant difference in speed (I get somewhat noisy measurements from the examples/ecc-benchmark tool), although I would expect the new code to be very slightly faster. These functions are not that performance critical, since the bulk of the reductions for this curve is done using redc, not mod.
Any additional testing, benchmarking, or code staring, is appreciated. I will likely merge the new code to the master branch in a few days.
Regards, /Niels