On Fri, Sep 25, 2020 at 7:43 AM Maamoun TK maamoun.tk@googlemail.com wrote:
...
I'm not sure where it fits under powerpc64. The code doesn't need any cryptographic extensions, but it depends on vector instructions as well as VSX registers (for the unaligned load and store instructions). So I'd need advice both on the directory hierarchy and compile time configuration, and appropriate runtime tests for fat builds.
The VSX instructions are introduced in Power ISA v.2.06 so since you have used VSX instructions lxvw4x/stxvw4x the minimum processor you are targeting is POWER7 We can add new config option like "--enable-power-vsx" that enable this optimization.
I believe the 64-bit adds (addudm) and subtracts (subudm) require POWER8. POWER7 provides vector unsigned long long (and friends) and the 64-bit loads, but you need POWER8 to do something useful with them.
Or, the 64-bit adds can be performed manually using vector unsigned int with code to manage carry or borrow. It allows you to drop back to POWER4. ChaCha20 is still profitable.
typedef vector unsigned int uint32x4_p;
inline uint32x4_p VecAdd64(const uint32x4_p vec1, const uint32x4_p vec2) { // The carry mask selects carrys for elements 1 and 3 and sets // remaining elements to 0. The results is then shifted so the // carried values are added to elements 0 and 2. #if defined(NETTLE_BIG_ENDIAN) const uint32x4_p zero = {0, 0, 0, 0}; const uint32x4_p mask = {0, 1, 0, 1}; #else const uint32x4_p zero = {0, 0, 0, 0}; const uint32x4_p mask = {1, 0, 1, 0}; #endif
uint32x4_p cy = vec_addc(vec1, vec2); uint32x4_p res = vec_add(vec1, vec2); cy = vec_and(mask, cy); cy = vec_sld (cy, zero, 4); return vec_add(res, cy); #endif }
Here's the core of a subtract:
uint32x4_p bw = vec_subc(vec1, vec2); uint32x4_p res = vec_sub(vec1, vec2); bw = vec_andc(mask, bw); bw = vec_sld (bw, zero, 4); return vec_sub(res, bw);
Jeff