Re: Getting closer to nettle-2.6

4 Jan 2013


      ...
...
...
...
...
"NM" == Niels Möller nisse@lysator.liu.se writes:
NM> Thanks for the hint. Maybe I can try that, it sounds like a fairly easy
NM> fix. If I can get the code run at three instructions per cycle, that
NM> would be a pretty nice speedup on amd processors.
Indeed.
...
...
| Whenever possible, use loads and stores of the same data length. (See
| 6.3, ‘Store-to-Load Forwarding Restrictions” on page 98 for more
| information.)
NM> Not sure how to interpret this. The interesting cases here are:
In the context of saving a 128-bit xmm register and reading the halves
into two 64-big integer registers, I think it means make sure you use
the instruction which includes the 0x66 prefix octet (which specifies
that the 128 bits are two 64-bit values rather than four 32-bit values).
I don't see a 4x32 version of MOVDQA in the original xmm book, just the
2x64, so it shouldn't be an issue for this application.  If there were,
you'd want to be sure to use the '66 0F 6F /r' version and not the
putative '0F 6F /r' version.
It is more of an issue when dealing with packed floats vs packed doubles.
Eg, the XORPS and XORPD both do a 128-bit bit-for-bit XOR, but if you
use the XORPS version in code otherwise dealing with packed doubles, or
visa-versa, the pipeline will stall.
There is a similar issue when mixing float or double instructions with
non-floating-point loads and stores.
I think that, internally, they use different register files for packed
doubles and packed singles.  Or, more generally, packed 64-bit-at-a-time
vs packed-32-bit-at-a-time.  But that is conjecture.
-JimC
-- 
James Cloos cloos@jhcloos.com         OpenPGP: 1024D/ED7DAEA6

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Getting closer to nettle-2.6