nettle-bugs September 2024

nettle-bugs@lists.lysator.liu.se

3 participants
3 discussions

Start a nNew thread

[PATCH v2] powerpc64/sha256: fix loading overreads by loading less and shifting
by Eric Richter 13 Sep '24

13 Sep '24

2 2

[PATCH] Add support for elf_aux_info() on OpenBSD
by Brad Smith 09 Sep '24

09 Sep '24

Signed-off-by: Brad Smith <brad(a)comstyle.com> --- configure.ac | 2 +- fat-arm64.c | 11 ++++++++++- fat-ppc.c | 12 ++++++------ 3 files changed, 17 insertions(+), 8 deletions(-) diff --git a/configure.ac b/configure.ac index 4f27e663..1f53a890 100644 --- a/configure.ac +++ b/configure.ac @@ -212,7 +212,7 @@ LSH_FUNC_ALLOCA # getenv_secure is used for fat overrides, # getline is used in the testsuite -AC_CHECK_FUNCS(secure_getenv getline) +AC_CHECK_FUNCS(secure_getenv getline elf_aux_info) ASM_WORDS_BIGENDIAN=unknown AC_C_BIGENDIAN([AC_DEFINE([WORDS_BIGENDIAN], 1) diff --git a/fat-arm64.c b/fat-arm64.c index 45cce0cc..36136078 100644 --- a/fat-arm64.c +++ b/fat-arm64.c @@ -51,6 +51,8 @@ #if USE_GETAUXVAL # include <asm/hwcap.h> # include <sys/auxv.h> +#elif defined(HAVE_ELF_AUX_INFO) +# include <sys/auxv.h> #elif defined(__OpenBSD__) # include <sys/sysctl.h> # include <machine/cpu.h> @@ -133,8 +135,15 @@ get_arm64_features (struct arm64_features *features) } else { +#if USE_GETAUXVAL || defined(HAVE_ELF_AUX_INFO) + unsigned long hwcap = 0; + #if USE_GETAUXVAL - unsigned long hwcap = getauxval(AT_HWCAP); + hwcap = getauxval(AT_HWCAP); +#elif defined(HAVE_ELF_AUX_INFO) + elf_aux_info(AT_HWCAP, &hwcap, sizeof(hwcap)); +#endif + features->have_aes = ((hwcap & (HWCAP_ASIMD | HWCAP_AES)) == (HWCAP_ASIMD | HWCAP_AES)); features->have_pmull diff --git a/fat-ppc.c b/fat-ppc.c index aaccc116..c4dae3a1 100644 --- a/fat-ppc.c +++ b/fat-ppc.c @@ -48,14 +48,14 @@ # include <asm/cputable.h> # include <sys/auxv.h> # endif -#elif defined(__FreeBSD__) +#elif defined(__FreeBSD__) || defined(__OpenBSD__) # include <machine/cpu.h> # ifdef PPC_FEATURE2_HAS_VEC_CRYPTO # define PPC_FEATURE2_VEC_CRYPTO PPC_FEATURE2_HAS_VEC_CRYPTO # endif -# if __FreeBSD__ >= 12 +# ifdef HAVE_ELF_AUX_INFO # include <sys/auxv.h> -# else +# elif !defined(__OpenBSD__) # include <sys/sysctl.h> # endif #endif @@ -129,11 +129,11 @@ get_ppc_features (struct ppc_features *features) # if USE_GETAUXVAL hwcap = getauxval(AT_HWCAP); hwcap2 = getauxval(AT_HWCAP2); -# elif defined(__FreeBSD__) -# if __FreeBSD__ >= 12 +# elif defined(__FreeBSD__) || defined(__OpenBSD__) +# ifdef HAVE_ELF_AUX_INFO elf_aux_info(AT_HWCAP, &hwcap, sizeof(hwcap)); elf_aux_info(AT_HWCAP2, &hwcap2, sizeof(hwcap2)); -# else +# elif !defined(__OpenBSD__) size_t len; len = sizeof(hwcap); sysctlbyname("hw.cpu_features", &hwcap, &len, NULL, 0); -- 2.46.0

2 5

[PATCH] powerpc64/sha256: fix loading overreads by loading less and shifting
by Eric Richter 06 Sep '24

06 Sep '24

Originally, the 16 input words were loaded with 16 individual vector load instructions. This has a side effect where the last three loads would overread 1/2/3 extra words. Fix the overread by replacing unnecessary overlapped reads with shifts. As a consequence, the constant registers for 4,8,12 can be removed, and also gain about 1~2% in performance. Signed-off-by: Eric Richter <erichte(a)linux.ibm.com> --- Some additional comments, since I agree that the loads should be fixed, but a bit of a performance anomaly appeared: I thought I had tested this exact method in a previous iteration of the patches, and saw either no change or a minor performance hit. This time around I am seeing a slight performance improvement: (This is output from a script that compares 10-iteration benchmark runs, the values in parens are the difference from just changing the loads to shifts) alg/mode min avg max ====================== ============== ============== ============== sha256 update 460.64 (+2.45) 467.99 (+6.63) 468.89 (+7.13) hmac-sha256 64 bytes 126.39 (+0.94) 126.86 (+0.14) 127.42 (+0.08) hmac-sha256 256 bytes 272.14 (+2.97) 273.07 (+2.07) 273.60 (+1.44) hmac-sha256 1024 bytes 391.52 (+2.00) 393.17 (+2.03) 395.58 (+3.13) hmac-sha256 4096 bytes 440.61 (+3.22) 442.46 (+4.50) 444.15 (+5.73) hmac-sha256 single msg 454.75 (+1.23) 457.51 (+3.85) 459.70 (+5.91) However, removing the loads for the now unused TC registers now hurts performance by about the amount gained from using the shifts: Adding this change: - li TC4, 4 - li TC8, 8 - li TC12, 12 li TC16, 16 yields the following performance values (parens are difference against the code that did not remove the li's but did use shifts): alg/mode min avg max ====================== ============== ============== ============== sha256 update 462.47 (+1.83) 462.75 (-5.24) 462.87 (-6.02) hmac-sha256 64 bytes 126.01 (-0.38) 126.69 (-0.17) 127.00 (-0.42) hmac-sha256 256 bytes 270.13 (-2.01) 271.62 (-1.44) 272.99 (-0.61) hmac-sha256 1024 bytes 390.79 (-0.73) 391.84 (-1.33) 393.04 (-2.54) hmac-sha256 4096 bytes 438.60 (-2.01) 438.83 (-3.63) 439.30 (-4.85) hmac-sha256 single msg 454.52 (-0.23) 454.63 (-2.88) 454.76 (-4.94) If I add an align after the TC16 load: li TC16, 16 + .align 4 C Load state values lxvw4x VSR(VSA), 0, STATE C VSA contains A,B,C,D lxvw4x VSR(VSE), TC16, STATE C VSE contains E,F,G,H performance comes back to about where it was before removing the TC loads (parens are compared against the previous run): alg/mode min avg max ====================== ============== ============== ============== sha256 update 468.29 (+5.82) 468.74 (+5.99) 468.86 (+5.99) hmac-sha256 64 bytes 126.51 (+0.50) 127.26 (+0.57) 127.78 (+0.78) hmac-sha256 256 bytes 273.17 (+3.04) 274.51 (+2.89) 276.24 (+3.25) hmac-sha256 1024 bytes 393.27 (+2.48) 395.18 (+3.34) 397.87 (+4.83) hmac-sha256 4096 bytes 441.08 (+2.48) 443.21 (+4.38) 445.09 (+5.79) hmac-sha256 single msg 454.41 (-0.11) 458.30 (+3.67) 460.82 (+6.06) I suspect with four li instructions, those are issued 4x in parallel, and then the subsequent (slower) lxvw4x instructions are queued 2x. By removing the other three li instructions, that li is queued with the first lxvw4x, but not the second -- causing a stall as the second lxv has to wait for the parallel queue of the li + lxv before, as it depends on the li completing first. This is only my running theory, but unfortunately I don't currently have the bandwidth to do a deep dive for a <2% performance mystery (that said, I do intend to return to this when I get the chance, I suspect other areas of the code could also use a similar level of scrutiny). Do you have any thoughts or advice on this, or is the proposed approach acceptable for now? Additional note: I did also try rearranging the LOAD macros with the shifts, as well as moving around the requisite byte-swap vperms, but did not receive any performance benefits. It appears doing the load, vperm, shift, addi in that order appears to be the fastest order. powerpc64/p8/sha256-compress-n.asm | 33 +++++++++++++----------------- 1 file changed, 14 insertions(+), 19 deletions(-) diff --git a/powerpc64/p8/sha256-compress-n.asm b/powerpc64/p8/sha256-compress-n.asm index e08ae132..d3416e2a 100644 --- a/powerpc64/p8/sha256-compress-n.asm +++ b/powerpc64/p8/sha256-compress-n.asm @@ -44,10 +44,7 @@ define(`T1', `r8') define(`TK', `r9') define(`COUNT', `r10') define(`TC0', `0') C Index instructions allow literal 0 instead of a GPR -define(`TC4', `r11') -define(`TC8', `r12') -define(`TC12', `r14') -define(`TC16', `r15') +define(`TC16', `r11') C State registers define(`VSA', `v0') @@ -187,24 +184,24 @@ define(`LOAD', ` define(`DOLOADS', ` IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)') LOAD(0, TC0) - LOAD(1, TC4) - LOAD(2, TC8) - LOAD(3, TC12) + vsldoi IV(1), IV(0), IV(0), 4 + vsldoi IV(2), IV(0), IV(0), 8 + vsldoi IV(3), IV(0), IV(0), 12 addi INPUT, INPUT, 16 LOAD(4, TC0) - LOAD(5, TC4) - LOAD(6, TC8) - LOAD(7, TC12) + vsldoi IV(5), IV(4), IV(4), 4 + vsldoi IV(6), IV(4), IV(4), 8 + vsldoi IV(7), IV(4), IV(4), 12 addi INPUT, INPUT, 16 LOAD(8, TC0) - LOAD(9, TC4) - LOAD(10, TC8) - LOAD(11, TC12) + vsldoi IV(9), IV(8), IV(8), 4 + vsldoi IV(10), IV(8), IV(8), 8 + vsldoi IV(11), IV(8), IV(8), 12 addi INPUT, INPUT, 16 LOAD(12, TC0) - LOAD(13, TC4) - LOAD(14, TC8) - LOAD(15, TC12) + vsldoi IV(13), IV(12), IV(12), 4 + vsldoi IV(14), IV(12), IV(12), 8 + vsldoi IV(15), IV(12), IV(12), 12 addi INPUT, INPUT, 16 ') @@ -245,10 +242,8 @@ PROLOGUE(_nettle_sha256_compress_n) stdx r14, T0, SP stdx r15, T1, SP - li TC4, 4 - li TC8, 8 - li TC12, 12 li TC16, 16 + .align 4 C Load state values lxvw4x VSR(VSA), 0, STATE C VSA contains A,B,C,D -- 2.46.0

2 2

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

nettle-bugs September 2024