Hi,
I've been trying out the sha_ni instructions available on some newer x86_64 processors.
Below replacement for sha1-compress.asm seems to run on roughly 2 cycles/byte when I benchmark it on an "AMD Ryzen 7 1700X" cpu in the gcc compile farm. Still sligthly slower than openssl, to squeeze out a few more cycles, it might help to change the sha1_compress interface to let it process more than one 64-byte block at a time.
I hope to be able to wire it up via fat-x86_64.c reasonably soon. In the mean time, if anyone wants to try it out, just change the sha1-compress.asm symlink to point to this file.
Regards, /Niels
-----8<--------
C x86_64/sha_ni/sha1-compress.asm
ifelse(< Copyright (C) 2018 Niels Möller
This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/.
)
C Register usage.
C Arguments define(<STATE>,<%rdi>)dnl define(<INPUT>,<%rsi>)dnl
define(<MSG0>,<%xmm0>) define(<MSG1>,<%xmm1>) define(<MSG2>,<%xmm2>) define(<MSG3>,<%xmm3>) define(<ABCD>,<%xmm4>) define(<E0>,<%xmm5>) define(<E1>,<%xmm6>) define(<ABCD_ORIG>, <%xmm7>) define(<E_ORIG>, <%xmm8>) define(<SWAP_MASK>,<%xmm9>)
C QROUND(M0, M1, M2, M3, E0, E1, TYPE) define(<QROUND>, < sha1nexte $1, $5 movdqa ABCD, $6 sha1msg2 $1, $2 sha1rnds4 <$>$7, $5, ABCD sha1msg1 $1, $4 pxor $1, $3
)
.file "sha1-compress.asm"
C _nettle_sha1_compress(uint32_t *state, uint8_t *input)
.text ALIGN(16) .Lswap_mask: .byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 PROLOGUE(_nettle_sha1_compress) C save all registers that need to be saved W64_ENTRY(2, 10) movups (STATE), ABCD movd 16(STATE), E0 movups (INPUT), MSG0 movdqa .Lswap_mask(%rip), SWAP_MASK pshufd $0x1b, ABCD, ABCD pshufd $0x1b, E0, E0 movdqa ABCD, ABCD_ORIG movdqa E0, E_ORIG pshufb SWAP_MASK, MSG0
paddd MSG0, E0 movdqa ABCD, E1 sha1rnds4 $0, E0, ABCD C Rounds 0-3
movups 16(INPUT), MSG1 pshufb SWAP_MASK, MSG1
sha1nexte MSG1, E1 movdqa ABCD, E0 sha1rnds4 $0, E1, ABCD C Rounds 4-7 sha1msg1 MSG1, MSG0
movups 32(INPUT), MSG2 pshufb SWAP_MASK, MSG2
sha1nexte MSG2, E0 movdqa ABCD, E1 sha1rnds4 $0, E0, ABCD C Rounds 8-11 sha1msg1 MSG2, MSG1 pxor MSG2, MSG0
movups 48(INPUT), MSG3 pshufb SWAP_MASK, MSG3
QROUND(MSG3, MSG0, MSG1, MSG2, E1, E0, 0) C Rounds 12-15 QROUND(MSG0, MSG1, MSG2, MSG3, E0, E1, 0) C Rounds 16-19
QROUND(MSG1, MSG2, MSG3, MSG0, E1, E0, 1) C Rounds 20-23 QROUND(MSG2, MSG3, MSG0, MSG1, E0, E1, 1) C Rounds 24-27 QROUND(MSG3, MSG0, MSG1, MSG2, E1, E0, 1) C Rounds 28-31 QROUND(MSG0, MSG1, MSG2, MSG3, E0, E1, 1) C Rounds 32-35 QROUND(MSG1, MSG2, MSG3, MSG0, E1, E0, 1) C Rounds 36-39
QROUND(MSG2, MSG3, MSG0, MSG1, E0, E1, 2) C Rounds 40-43 QROUND(MSG3, MSG0, MSG1, MSG2, E1, E0, 2) C Rounds 44-47 QROUND(MSG0, MSG1, MSG2, MSG3, E0, E1, 2) C Rounds 48-51 QROUND(MSG1, MSG2, MSG3, MSG0, E1, E0, 2) C Rounds 52-55 QROUND(MSG2, MSG3, MSG0, MSG1, E0, E1, 2) C Rounds 56-59
QROUND(MSG3, MSG0, MSG1, MSG2, E1, E0, 3) C Rounds 60-63 QROUND(MSG0, MSG1, MSG2, MSG3, E0, E1, 3) C Rounds 64-67
sha1nexte MSG1, E1 movdqa ABCD, E0 sha1msg2 MSG1, MSG2 sha1rnds4 $3, E1, ABCD C Rounds 68-71 pxor MSG1, MSG3
sha1nexte MSG2, E0 movdqa ABCD, E1 sha1msg2 MSG2, MSG3 sha1rnds4 $3, E0, ABCD C Rounds 72-75
sha1nexte MSG3, E1 movdqa ABCD, E0 sha1rnds4 $3, E1, ABCD C Rounds 76-79
sha1nexte E_ORIG, E0 paddd ABCD_ORIG, ABCD
pshufd $0x1b, ABCD, ABCD movups ABCD, (STATE) pshufd $0x1b, E0, E0 movd E0, 16(STATE)
W64_EXIT(2, 10) ret EPILOGUE(_nettle_sha1_compress)
nisse@lysator.liu.se (Niels Möller) writes:
Below replacement for sha1-compress.asm seems to run on roughly 2 cycles/byte when I benchmark it on an "AMD Ryzen 7 1700X" cpu in the gcc compile farm. Still sligthly slower than openssl, to squeeze out a few more cycles, it might help to change the sha1_compress interface to let it process more than one 64-byte block at a time.
I hope to be able to wire it up via fat-x86_64.c reasonably soon. In the mean time, if anyone wants to try it out, just change the sha1-compress.asm symlink to point to this file.
Enabled via fat-x86_64 now, and pushed to a branch named x86_64-sha_ni-sha1.
I intend to merge to master soon.
Testing and benchmarking appreciated.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
I've been trying out the sha_ni instructions available on some newer x86_64 processors.
And now that the gcc67 machine is up again, I got my sha256 implementation working too. Pushed to branch x86_64-sha_ni-sha256.
Not yet wired up in fat builds, but can be tested with --enable-x86-sha-ni to configure.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
nisse@lysator.liu.se (Niels Möller) writes:
I've been trying out the sha_ni instructions available on some newer x86_64 processors.
And now that the gcc67 machine is up again, I got my sha256 implementation working too. Pushed to branch x86_64-sha_ni-sha256.
Not yet wired up in fat builds, but can be tested with --enable-x86-sha-ni to configure.
Now wired up for fat builds, changes pushed to the same branch.
Regards, /Niels
On 13/03/18 07:40, Niels Möller wrote:
nisse (Niels Möller) writes:
nisse (Niels Möller) writes:
I've been trying out the sha_ni instructions available on some newer x86_64 processors.
And now that the gcc67 machine is up again, I got my sha256 implementation working too. Pushed to branch x86_64-sha_ni-sha256.
Not yet wired up in fat builds, but can be tested with --enable-x86-sha-ni to configure.
Now wired up for fat builds, changes pushed to the same branch.
Regards, /Niels
I have a new machine with Intel KabyLake CPU + GPU which apparently has AES and related crypto support available. Running Debian sid with GCC-6, 7, and 8 all available.
Is there anything you would like in the way of tests or benchmarking done with this hardware and environment? Just let me know what build and/or test commands you want run, and on which git branch.
AYJ
Amos Jeffries squid3@treenet.co.nz writes:
Is there anything you would like in the way of tests or benchmarking done with this hardware and environment? Just let me know what build and/or test commands you want run, and on which git branch.
It would be nice if you could verify the code on branch x86_64-sha_ni-sha256. Build with and without --enable-fat (and if you don't want to mess with setting LD_LIBRARY_PATH=.lib, I'd recommend also using --disable-shared).
Run make check and
NETTLE_FAT_VERBOSE=1 ./examples/nettle-benchmark
and see if results look right (NETTLE_FAT_VERBOSE, naturally has effect only in fat builds).
If you like, also compare the performance with the nettle-3.4 release.
Regards, /Niels
On 13/03/18 08:44, Jeffrey Walton wrote:
Check /proc/cpuinfo for the sha_ni flag. If present, then you can test the SHA extensions.
SHA extensions made their debut in Goldmont. They are also available in Goldmont+. They were scheduled for one of the lakes but they did not make it in.
I have a Goldmont machine for testing SHA but it is a turd. It is a Celeron J3455 (https://www.amazon.com/dp/B01LYCDG4H).
Jeff
Ah, okay. That Goldmont info matches my /proc/cpuinfo. No "sha_ni" listed :-(, just the aes-ni instruction set.
AYJ
On Mon, Mar 12, 2018 at 4:23 PM, Amos Jeffries squid3@treenet.co.nz wrote:
On 13/03/18 08:44, Jeffrey Walton wrote:
Check /proc/cpuinfo for the sha_ni flag. If present, then you can test the SHA extensions.
SHA extensions made their debut in Goldmont. They are also available in Goldmont+. They were scheduled for one of the lakes but they did not make it in.
I have a Goldmont machine for testing SHA but it is a turd. It is a Celeron J3455 (https://www.amazon.com/dp/B01LYCDG4H).
Ah, okay. That Goldmont info matches my /proc/cpuinfo. No "sha_ni" listed :-(, just the aes-ni instruction set.
Yeah, if I recall correctly, SHA was supposed to be in Kaby Lake. It looks like it slipped, and SHA was added to the non-turd machines at Cannon Lake. Also see https://en.wikipedia.org/wiki/Cannon_Lake_(microarchitecture)
Jeff
On Mon, Mar 12, 2018 at 2:40 PM, Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes: ...
Now wired up for fat builds, changes pushed to the same branch.
Looks good on a Celeron J3455 (https://www.amazon.com/dp/B01LYCDG4H):
Without --enable-fat
md2 update 6.88 md4 update 570.47 md5 update 383.59 openssl md5 update 444.94 sha1 update 238.53 openssl sha1 update 1323.53 sha224 update 110.07 sha256 update 110.25 sha384 update 173.90 sha512 update 174.35 sha512-224 update 174.30 sha512-256 update 174.08
With --enable-fat
md2 update 6.89 md4 update 569.68 md5 update 382.82 openssl md5 update 444.76 sha1 update 1192.25 openssl sha1 update 1324.47 sha224 update 494.33 sha256 update 495.22 sha384 update 173.87 sha512 update 174.33
Jeff
Jeffrey Walton noloader@gmail.com writes:
On Mon, Mar 12, 2018 at 2:40 PM, Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes: ...
Now wired up for fat builds, changes pushed to the same branch.
Looks good on a Celeron J3455 (https://www.amazon.com/dp/B01LYCDG4H):
Without --enable-fat
md2 update 6.88 md4 update 570.47 md5 update 383.59 openssl md5 update 444.94 sha1 update 238.53 openssl sha1 update 1323.53 sha224 update 110.07 sha256 update 110.25 sha384 update 173.90 sha512 update 174.35 sha512-224 update 174.30 sha512-256 update 174.08
With --enable-fat
md2 update 6.89 md4 update 569.68 md5 update 382.82 openssl md5 update 444.76 sha1 update 1192.25 openssl sha1 update 1324.47 sha224 update 494.33 sha256 update 495.22 sha384 update 173.87 sha512 update 174.33
So you get 5 times speedup of sha1 and 4.5 times for sha256. Nice!
On gcc67 (AMD Ryzen 5 2400G), I measure 3 times and 4.8 times speedup, respectively.
Now, I think there are opportunities for improvements also for sha1 and sha256 without sha_ni, but that's a more difficult project, to carefully take data dependencies into account, and deal with hard-to-predict x86 scheduling.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se