Martin Storsjö martin@martin.st writes:
Hmm, yes, I think that might have been the case. So since we can't rely on that being aligned anyway, we could just as well skip the 8 byte offset.
If it works now, I don't think we should touch this code further before release.
For later optimization (if it really makes a difference to performance if we use aligned or unaligned loads and stores here? I don't know), one could keep the 8 byte extra allocation, then do something like
lea 8(%rsp), %r10 and $-16, %r10
(%r10 should always be free for scratch use at both entry and exit, right?). Then %r10 will be 16 byte aligned, and hold either %rsp or %rsp + 8. And we can then do fully aligned loads and stores of the xmm registers via offsets from %r10.
Regards, /Niels