James Cloos cloos@jhcloos.com writes:
,----< §10.4, p169 of 47414_15h_sw_opt_guide.pdf¹ > | Optimization | | When moving data from a GPR to an XMM register, use separate store and | load instructions to move the data first from the source register to a | temporary location in memory and then from memory into the destination | register, taking the memory latency into account when scheduling both | stages of the load-store sequence.
Thanks for the hint. Maybe I can try that, it sounds like a fairly easy fix. If I can get the code run at three instructions per cycle, that would be a pretty nice speedup on amd processors.
| Whenever possible, use loads and stores of the same data length. (See | 6.3, ‘Store-to-Load Forwarding Restrictions” on page 98 for more | information.)
Not sure how to interpret this. The interesting cases here are:
1. Writing the 64 low bits of an xmm register, (movq with memory destination) and reading it back into a gpr.
2. Writing a 128-bit xmm register (movaps), and reading it back into two gpr registers.
And then the opposite direction.
Regards, /Niels