Re: Getting closer to nettle-2.6

3 Jan 2013


      James Cloos cloos@jhcloos.com writes:
...
,----< §10.4, p169 of 47414_15h_sw_opt_guide.pdf¹ >
| Optimization
| 
| When moving data from a GPR to an XMM register, use separate store and
| load instructions to move the data first from the source register to a
| temporary location in memory and then from memory into the destination
| register, taking the memory latency into account when scheduling both
| stages of the load-store sequence.
Thanks for the hint. Maybe I can try that, it sounds like a fairly easy
fix. If I can get the code run at three instructions per cycle, that
would be a pretty nice speedup on amd processors.
...
| Whenever possible, use loads and stores of the same data length. (See
| 6.3, ‘Store-to-Load Forwarding Restrictions” on page 98 for more
| information.)
Not sure how to interpret this. The interesting cases here are:
1. Writing the 64 low bits of an xmm register, (movq with memory
   destination) and reading it back into a gpr.
2. Writing a 128-bit xmm register (movaps), and reading it back into two gpr
   registers.
And then the opposite direction.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Getting closer to nettle-2.6