Hi Niels,
I tried to apply your method but can't get it work, while applying it one question came to my mind.
First, compute b_0(x) / x^64 (mod P(x)), which expands it from 64 bits to 128,
c_1(x) x^64 + c_0(x) = b_0(x) / x^64 (mod P(x))
Here you are trying to get partially reduced product by computing b_0(x) / x^64 (mod P(x)) but since the degree of input is 127, we can use the polynomial defining the finite field with x^64 elements, in this case P(x) = X^64+X^4+X^3+X+1 and P' = P^-1 (mod X^64) = X^63+X^61+X^60+1 which is the same constant 0xB0 and the function now: c_1(x) x^64 + c_0(x) = ((b_0 mod X^64) * p') mod X^64