On Tue, 23 Apr 2013, Niels Möller wrote:
Martin Storsjö martin@martin.st writes:
Hmm, yes, I think that might have been the case. So since we can't rely on that being aligned anyway, we could just as well skip the 8 byte offset.
If it works now, I don't think we should touch this code further before release.
Yes, that's probably wisest.
For later optimization (if it really makes a difference to performance if we use aligned or unaligned loads and stores here? I don't know), one could keep the 8 byte extra allocation, then do something like
lea 8(%rsp), %r10 and $-16, %r10
(%r10 should always be free for scratch use at both entry and exit, right?). Then %r10 will be 16 byte aligned, and hold either %rsp or %rsp
- And we can then do fully aligned loads and stores of the xmm
registers via offsets from %r10.
That would probably work. I don't know these things well enough to say whether there's any serious performance to be gained by doing this, compared to the inconvenience of wasting one register.
// Martin