"NM" == Niels Möller nisse@lysator.liu.se writes:
NM> Thanks for the hint. Maybe I can try that, it sounds like a fairly easy NM> fix. If I can get the code run at three instructions per cycle, that NM> would be a pretty nice speedup on amd processors.
Indeed.
| Whenever possible, use loads and stores of the same data length. (See | 6.3, ‘Store-to-Load Forwarding Restrictions” on page 98 for more | information.)
NM> Not sure how to interpret this. The interesting cases here are:
In the context of saving a 128-bit xmm register and reading the halves into two 64-big integer registers, I think it means make sure you use the instruction which includes the 0x66 prefix octet (which specifies that the 128 bits are two 64-bit values rather than four 32-bit values).
I don't see a 4x32 version of MOVDQA in the original xmm book, just the 2x64, so it shouldn't be an issue for this application. If there were, you'd want to be sure to use the '66 0F 6F /r' version and not the putative '0F 6F /r' version.
It is more of an issue when dealing with packed floats vs packed doubles. Eg, the XORPS and XORPD both do a 128-bit bit-for-bit XOR, but if you use the XORPS version in code otherwise dealing with packed doubles, or visa-versa, the pipeline will stall.
There is a similar issue when mixing float or double instructions with non-floating-point loads and stores.
I think that, internally, they use different register files for packed doubles and packed singles. Or, more generally, packed 64-bit-at-a-time vs packed-32-bit-at-a-time. But that is conjecture.
-JimC