"NM" == Niels Möller nisse@lysator.liu.se writes:
NM> my best guess is that it's the NM> moves of data between regular registers and xmm registers that NM> somehow stall.
IIRC, the advice I've seen is to always move data between the integer registers and the xmm registers via the stack.
All of the relevant gcc- and llvm-produced code I've looked (at least over the last few months; I can't remember too far back) follows that pattern.
Yes, The 47414_15h_sw_opt_guide.pdf, in §10.4 says:
,----< §10.4, p169 of 47414_15h_sw_opt_guide.pdf¹ > | Optimization | | When moving data from a GPR to an XMM register, use separate store and | load instructions to move the data first from the source register to a | temporary location in memory and then from memory into the destination | register, taking the memory latency into account when scheduling both | stages of the load-store sequence. | | When moving data from an XMM register to a general-purpose register, | use the VMOVD instruction. | | Whenever possible, use loads and stores of the same data length. (See | 6.3, ‘Store-to-Load Forwarding Restrictions” on page 98 for more | information.) `----
VMOVD, obviosuly, doesn’t apply for fam10 and earlier; I didn’t look through my archive to find the sw_opt_guide for earlier processors, though.
1] http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf
-JimC