Nettle

5 Feb 2004

      I used Apple's Shark tool which displays load stalls, cycle counts,
loop alignment and more. It warned for loading and storing from the
same memory address within a single PPC970 (G5) "bundle", and that
caused me to reconsider the memory I/O in the loop.
This was with standard compiler flags so I didn't get any loop
unrolling etc so there's probably plenty of headroom for further
tuning. Other things which may be beneficial is to special-case for
aligned memory buffers and read 32-bit chunks to/from memory at a time
instead of XOR:ing individual bytes. That also applies to methods like
memxor().
Ideally, since there are trade-offs for different CPUs even within the
same family (in case of PPC: 604, G3, G4 (AltiVec), G5 (64-bit) etc)
the library should pick the best implementation at run-time and not
when compiled. For Pike it's not feasible to require a G4 or G5 but we
still want to use vectorized code when possible.
/ Jonas Walldén
Previous text:
...
2004-02-05 13:05:
Subject: Nettle

Nice, I'll try that. Have you looked at the assembler output? It would
be interesting to know if there's any room for improvement by
handtuning the assembler code.
/ Niels Möller (vässar rödpennan)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Nettle