Michael Weiser michael.weiser@gmx.de writes:
Attached is the new patch that unconditionally switches from vldm to vld1.32 but keeps vstm in favour of vst1.8 on little-endian for stores.
Thanks! Applied now.
From that point of view, the slight performance hit for vld1.32 but keeping of vstm on LE seems the best compromise, at least for the benchmarked set of machines.
I agree. One could consider having several variants and do code selection depending on processor flavor. But I don't think that's worth the effort if difference is just a percent or so.
Do you have any ideas how it might be that the wandboard, tinkerboard and rpi4 show speedups with vst1.8 for one algorithm but slowdowns for the other and even contradict each other in that? Does it make sense to dig into that some more or should we leave it be for now?
I'd guess the algorithms differ in the details in how vst1.8 is scheduled, and that's why vst1.8 is more or less efficient.
Regards, /Niels