I made a merge request in the main repo that enables optimized GHASH on AArch64 architecture. The implementation is based on Niels Möller's enhanced algorithm which yields more speedup on AArch64 arch in comparison with intel algorithm. Using the Karatsuba algorithm with Intel algorithm yielded an overhead so I dropped its benchmark result. I'll attach the file of Intel algorithm implementation here since it's not include in the MR.
Here is the benchmark result on AArch64:
*---------------------------------------------------------------------------------------------* | C version | Intel algorithm | Niels Möller's enhanced algorithm | | 208 Mbyte/s | 2781 Mbyte/s | 3255 Mbyte/s | *---------------------------------------------------------------------------------------------*
This is +17% performance boost of the enhanced algorithm over the Intel algorithm, it's not as impressive as PowerPC benchmark result but it did a great job on AArch64 considering PMULL instruction doesn't have the assistance that vpmsumd offers by multiply four polynomials then summing.
I tried to avoid using the stack in this implementation so I wrote a procedure to handle leftovers by just using the registers, let me know if there's a room for improvement here.
regards, Mamone
C arm/v8/gcm-hash.asm
ifelse(` Copyright (C) 2020 Niels Möller and Mamone Tarsha This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
C gcm_set_key() assigns H value in the middle element of the table define(`H_Idx', `128')
.file "gcm-hash.asm"
.text
C void gcm_init_key (union gcm_block *table)
C This function populates the gcm table as the following layout C ******************************************************************************* C | H1M = (H1 div x⁶⁴)||((H1 mod x⁶⁴) × (x⁶⁴+x⁶³+x⁶²+x⁵⁷)) div x⁶⁴ | C | H1L = (H1 mod x⁶⁴)||(((H1 mod x⁶⁴) × (x⁶³+x⁶²+x⁵⁷)) mod x⁶⁴) + (H1 div x⁶⁴) | C | | C | H2M = (H2 div x⁶⁴)||((H2 mod x⁶⁴) × (x⁶⁴+x⁶³+x⁶²+x⁵⁷)) div x⁶⁴ | C | H2L = (H2 mod x⁶⁴)||(((H2 mod x⁶⁴) × (x⁶³+x⁶²+x⁵⁷)) mod x⁶⁴) + (H2 div x⁶⁴) | C | | C | H3M = (H3 div x⁶⁴)||((H3 mod x⁶⁴) × (x⁶⁴+x⁶³+x⁶²+x⁵⁷)) div x⁶⁴ | C | H3L = (H3 mod x⁶⁴)||(((H3 mod x⁶⁴) × (x⁶³+x⁶²+x⁵⁷)) mod x⁶⁴) + (H3 div x⁶⁴) | C | | C | H4M = (H3 div x⁶⁴)||((H4 mod x⁶⁴) × (x⁶⁴+x⁶³+x⁶²+x⁵⁷)) div x⁶⁴ | C | H4L = (H3 mod x⁶⁴)||(((H4 mod x⁶⁴) × (x⁶³+x⁶²+x⁵⁷)) mod x⁶⁴) + (H4 div x⁶⁴) | C *******************************************************************************
define(`TABLE', `x0')
define(`ZERO', `v0') define(`EMSB', `v1') define(`POLY', `v2') define(`B', `v3')
define(`H', `v4') define(`HQ', `q4') define(`H_t', `v5') define(`H2', `v6') define(`H2_t', `v7') define(`H3', `v16') define(`H3_t', `v17') define(`H4', `v18') define(`H4_t', `v19') define(`H_m', `v20') define(`H_m1', `v21') define(`H_h', `v22') define(`H_l', `v23') define(`RP', `v24') define(`Ml', `v25') define(`Mh', `v26')
PROLOGUE(_nettle_gcm_init_key) ldr HQ,[TABLE,#16*H_Idx] dup EMSB.16b,H.b[0] rev64 H.16b,H.16b mov x9,#0xC200000000000000 mov x10,#1 mov POLY.d[0],x9 mov POLY.d[1],x10 sshr EMSB.16b,EMSB.16b,#7 and EMSB.16b,EMSB.16b,POLY.16b ushr B.2d,H.2d,#63 and B.16b,B.16b,POLY.16b ext B.16b,B.16b,B.16b,#8 shl H.2d,H.2d,#1 orr H.16b,H.16b,B.16b eor H.16b,H.16b,EMSB.16b
eor ZERO.16b,ZERO.16b,ZERO.16b dup POLY.2d,POLY.d[0] ext H_t.16b,H.16b,H.16b,#8
pmull H_m.1q,H.1d,H_t.1d pmull2 H_m1.1q,H.2d,H_t.2d pmull H_h.1q,H.1d,H.1d pmull2 H_l.1q,H.2d,H.2d
eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b
pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor H2_t.16b,H_h.16b,RP.16b ext H2.16b,H2_t.16b,H2_t.16b,#8
st1 {H.16b,H_t.16b,H2.16b,H2_t.16b},[TABLE],#64
pmull H_m.1q,H.1d,H2_t.1d pmull2 H_m1.1q,H.2d,H2_t.2d pmull H_h.1q,H.1d,H2.1d pmull2 H_l.1q,H.2d,H2.2d
eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b
pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor H3_t.16b,H_h.16b,RP.16b ext H3.16b,H3_t.16b,H3_t.16b,#8
pmull H_m.1q,H2.1d,H2_t.1d pmull2 H_m1.1q,H2.2d,H2_t.2d pmull H_h.1q,H2.1d,H2.1d pmull2 H_l.1q,H2.2d,H2.2d
eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b
pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor H4_t.16b,H_h.16b,RP.16b ext H4.16b,H4_t.16b,H4_t.16b,#8
st1 {H3.16b,H3_t.16b,H4.16b,H4_t.16b},[TABLE]
ret EPILOGUE(_nettle_gcm_init_key)
define(`TABLE', `x0') define(`X', `x1') define(`LENGTH', `x2') define(`DATA', `x3')
define(`POLY', `v0') define(`ZERO', `v1')
define(`D', `v2') define(`C0', `v3') define(`C0D', `d3') define(`C1', `v4') define(`C2', `v5') define(`C3', `v6') define(`RP', `v7') define(`H', `v16') define(`H_t', `v17') define(`H2', `v18') define(`H2_t', `v19') define(`H3', `v20') define(`H3_t', `v21') define(`H4', `v22') define(`H4_t', `v23') define(`H_m', `v24') define(`H_m1', `v25') define(`H_h', `v26') define(`H_l', `v27') define(`H_m2', `v28') define(`H_m3', `v29') define(`H_h2', `v30') define(`H_l2', `v31') define(`Ml', `v4') define(`Mh', `v5')
C void gcm_hash (const struct gcm_key *key, union gcm_block *x, C size_t length, const uint8_t *data)
PROLOGUE(_nettle_gcm_hash) mov x10,#0xC200000000000000 mov POLY.d[0],x10 dup POLY.2d,POLY.d[0] eor ZERO.16b,ZERO.16b,ZERO.16b
ld1 {D.16b},[X] rev64 D.16b,D.16b
ands x10,LENGTH,#-64 b.eq L2x
add x9,TABLE,64 ld1 {H.16b,H_t.16b,H2.16b,H2_t.16b},[TABLE] ld1 {H3.16b,H3_t.16b,H4.16b,H4_t.16b},[x9]
L4x_loop: ld1 {C0.16b,C1.16b,C2.16b,C3.16b},[DATA],#64 rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b rev64 C2.16b,C2.16b rev64 C3.16b,C3.16b
eor C0.16b,C0.16b,D.16b
pmull H_m.1q,C1.1d,H3_t.1d pmull2 H_m1.1q,C1.2d,H3_t.2d pmull H_h.1q,C1.1d,H3.1d pmull2 H_l.1q,C1.2d,H3.2d
pmull H_m2.1q,C2.1d,H2_t.1d pmull2 H_m3.1q,C2.2d,H2_t.2d pmull H_h2.1q,C2.1d,H2.1d pmull2 H_l2.1q,C2.2d,H2.2d
eor H_m.16b,H_m.16b,H_m2.16b eor H_m1.16b,H_m1.16b,H_m3.16b eor H_h.16b,H_h.16b,H_h2.16b eor H_l.16b,H_l.16b,H_l2.16b
pmull H_m2.1q,C3.1d,H_t.1d pmull2 H_m3.1q,C3.2d,H_t.2d pmull H_h2.1q,C3.1d,H.1d pmull2 H_l2.1q,C3.2d,H.2d
eor H_m.16b,H_m.16b,H_m2.16b eor H_m1.16b,H_m1.16b,H_m3.16b eor H_h.16b,H_h.16b,H_h2.16b eor H_l.16b,H_l.16b,H_l2.16b
pmull H_m2.1q,C0.1d,H4_t.1d pmull2 H_m3.1q,C0.2d,H4_t.2d pmull H_h2.1q,C0.1d,H4.1d pmull2 H_l2.1q,C0.2d,H4.2d
eor H_m.16b,H_m.16b,H_m2.16b eor H_m1.16b,H_m1.16b,H_m3.16b eor H_h.16b,H_h.16b,H_h2.16b eor H_l.16b,H_l.16b,H_l2.16b
eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b
pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor D.16b,H_h.16b,RP.16b ext D.16b,D.16b,D.16b,#8
subs x10,x10,64 b.ne L4x_loop
and LENGTH,LENGTH,#63
L2x: tst LENGTH,#-32 b.eq L1x
ld1 {H.16b,H_t.16b,H2.16b,H2_t.16b},[TABLE]
ld1 {C0.16b,C1.16b},[DATA],#32 rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b
eor C0.16b,C0.16b,D.16b
pmull H_m.1q,C1.1d,H_t.1d pmull2 H_m1.1q,C1.2d,H_t.2d pmull H_h.1q,C1.1d,H.1d pmull2 H_l.1q,C1.2d,H.2d
pmull H_m2.1q,C0.1d,H2_t.1d pmull2 H_m3.1q,C0.2d,H2_t.2d pmull H_h2.1q,C0.1d,H2.1d pmull2 H_l2.1q,C0.2d,H2.2d
eor H_m.16b,H_m.16b,H_m2.16b eor H_m1.16b,H_m1.16b,H_m3.16b eor H_h.16b,H_h.16b,H_h2.16b eor H_l.16b,H_l.16b,H_l2.16b
eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b
pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor D.16b,H_h.16b,RP.16b ext D.16b,D.16b,D.16b,#8
and LENGTH,LENGTH,#31
L1x: tst LENGTH,#-16 b.eq Lmod
ld1 {H.16b,H_t.16b},[TABLE]
ld1 {C0.16b},[DATA],#16 rev64 C0.16b,C0.16b
eor C0.16b,C0.16b,D.16b
pmull H_m.1q,C0.1d,H_t.1d pmull2 H_m1.1q,C0.2d,H_t.2d pmull H_h.1q,C0.1d,H.1d pmull2 H_l.1q,C0.2d,H.2d
eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b
pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor D.16b,H_h.16b,RP.16b ext D.16b,D.16b,D.16b,#8
Lmod: tst LENGTH,#15 b.eq Ldone
ld1 {H.16b,H_t.16b},[TABLE]
tbz LENGTH,3,Lmod_8 ldr C0D,[DATA],#8 rev64 C0.16b,C0.16b mov x10,#0 mov C0.d[1],x10 Lmod_8: tst LENGTH,#7 b.eq Lmod_8_done mov x9,#0 mov x8,#64 and x7,LENGTH,#7 Lmod_8_loop: mov x10,#0 ldrb w10,[DATA],#1 sub x8,x8,#8 lsl x10,x10,x8 orr x9,x9,x10 subs x7,x7,#1 b.ne Lmod_8_loop tbz LENGTH,3,Lmod_8_load mov C0.d[1],x9 b Lmod_8_done Lmod_8_load: mov x10,#0 mov C0.d[0],x9 mov C0.d[1],x10 Lmod_8_done: eor C0.16b,C0.16b,D.16b
pmull H_m.1q,C0.1d,H_t.1d pmull2 H_m1.1q,C0.2d,H_t.2d pmull H_h.1q,C0.1d,H.1d pmull2 H_l.1q,C0.2d,H.2d
eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b
pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor D.16b,H_h.16b,RP.16b ext D.16b,D.16b,D.16b,#8
Ldone: rev64 D.16b,D.16b st1 {D.16b},[X] ret EPILOGUE(_nettle_gcm_hash)
I forgot to mention that I made the benchmark test on gcc17 in GCC Farm.
regards, Mamone
On Tue, Dec 15, 2020 at 12:12 AM Maamoun TK maamoun.tk@googlemail.com wrote:
I made a merge request in the main repo that enables optimized GHASH on AArch64 architecture. The implementation is based on Niels Möller's enhanced algorithm which yields more speedup on AArch64 arch in comparison with intel algorithm. Using the Karatsuba algorithm with Intel algorithm yielded an overhead so I dropped its benchmark result. I'll attach the file of Intel algorithm implementation here since it's not include in the MR.
Here is the benchmark result on AArch64:
*---------------------------------------------------------------------------------------------* | C version | Intel algorithm | Niels Möller's enhanced algorithm | | 208 Mbyte/s | 2781 Mbyte/s | 3255 Mbyte/s |
*---------------------------------------------------------------------------------------------*
This is +17% performance boost of the enhanced algorithm over the Intel algorithm, it's not as impressive as PowerPC benchmark result but it did a great job on AArch64 considering PMULL instruction doesn't have the assistance that vpmsumd offers by multiply four polynomials then summing.
I tried to avoid using the stack in this implementation so I wrote a procedure to handle leftovers by just using the registers, let me know if there's a room for improvement here.
regards, Mamone
C arm/v8/gcm-hash.asm
ifelse(` Copyright (C) 2020 Niels Möller and Mamone Tarsha This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
C gcm_set_key() assigns H value in the middle element of the table define(`H_Idx', `128')
.file "gcm-hash.asm"
.text
C void gcm_init_key (union gcm_block *table)
C This function populates the gcm table as the following layout C
C | H1M = (H1 div x⁶⁴)||((H1 mod x⁶⁴) × (x⁶⁴+x⁶³+x⁶²+x⁵⁷)) div x⁶⁴ | C | H1L = (H1 mod x⁶⁴)||(((H1 mod x⁶⁴) × (x⁶³+x⁶²+x⁵⁷)) mod x⁶⁴) + (H1 div x⁶⁴) | C | | C | H2M = (H2 div x⁶⁴)||((H2 mod x⁶⁴) × (x⁶⁴+x⁶³+x⁶²+x⁵⁷)) div x⁶⁴ | C | H2L = (H2 mod x⁶⁴)||(((H2 mod x⁶⁴) × (x⁶³+x⁶²+x⁵⁷)) mod x⁶⁴) + (H2 div x⁶⁴) | C | | C | H3M = (H3 div x⁶⁴)||((H3 mod x⁶⁴) × (x⁶⁴+x⁶³+x⁶²+x⁵⁷)) div x⁶⁴ | C | H3L = (H3 mod x⁶⁴)||(((H3 mod x⁶⁴) × (x⁶³+x⁶²+x⁵⁷)) mod x⁶⁴) + (H3 div x⁶⁴) | C | | C | H4M = (H3 div x⁶⁴)||((H4 mod x⁶⁴) × (x⁶⁴+x⁶³+x⁶²+x⁵⁷)) div x⁶⁴ | C | H4L = (H3 mod x⁶⁴)||(((H4 mod x⁶⁴) × (x⁶³+x⁶²+x⁵⁷)) mod x⁶⁴) + (H4 div x⁶⁴) | C
define(`TABLE', `x0')
define(`ZERO', `v0') define(`EMSB', `v1') define(`POLY', `v2') define(`B', `v3')
define(`H', `v4') define(`HQ', `q4') define(`H_t', `v5') define(`H2', `v6') define(`H2_t', `v7') define(`H3', `v16') define(`H3_t', `v17') define(`H4', `v18') define(`H4_t', `v19') define(`H_m', `v20') define(`H_m1', `v21') define(`H_h', `v22') define(`H_l', `v23') define(`RP', `v24') define(`Ml', `v25') define(`Mh', `v26')
PROLOGUE(_nettle_gcm_init_key) ldr HQ,[TABLE,#16*H_Idx] dup EMSB.16b,H.b[0] rev64 H.16b,H.16b mov x9,#0xC200000000000000 mov x10,#1 mov POLY.d[0],x9 mov POLY.d[1],x10 sshr EMSB.16b,EMSB.16b,#7 and EMSB.16b,EMSB.16b,POLY.16b ushr B.2d,H.2d,#63 and B.16b,B.16b,POLY.16b ext B.16b,B.16b,B.16b,#8 shl H.2d,H.2d,#1 orr H.16b,H.16b,B.16b eor H.16b,H.16b,EMSB.16b
eor ZERO.16b,ZERO.16b,ZERO.16b dup POLY.2d,POLY.d[0] ext H_t.16b,H.16b,H.16b,#8 pmull H_m.1q,H.1d,H_t.1d pmull2 H_m1.1q,H.2d,H_t.2d pmull H_h.1q,H.1d,H.1d pmull2 H_l.1q,H.2d,H.2d eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor H2_t.16b,H_h.16b,RP.16b ext H2.16b,H2_t.16b,H2_t.16b,#8 st1 {H.16b,H_t.16b,H2.16b,H2_t.16b},[TABLE],#64 pmull H_m.1q,H.1d,H2_t.1d pmull2 H_m1.1q,H.2d,H2_t.2d pmull H_h.1q,H.1d,H2.1d pmull2 H_l.1q,H.2d,H2.2d eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor H3_t.16b,H_h.16b,RP.16b ext H3.16b,H3_t.16b,H3_t.16b,#8 pmull H_m.1q,H2.1d,H2_t.1d pmull2 H_m1.1q,H2.2d,H2_t.2d pmull H_h.1q,H2.1d,H2.1d pmull2 H_l.1q,H2.2d,H2.2d eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor H4_t.16b,H_h.16b,RP.16b ext H4.16b,H4_t.16b,H4_t.16b,#8 st1 {H3.16b,H3_t.16b,H4.16b,H4_t.16b},[TABLE] ret
EPILOGUE(_nettle_gcm_init_key)
define(`TABLE', `x0') define(`X', `x1') define(`LENGTH', `x2') define(`DATA', `x3')
define(`POLY', `v0') define(`ZERO', `v1')
define(`D', `v2') define(`C0', `v3') define(`C0D', `d3') define(`C1', `v4') define(`C2', `v5') define(`C3', `v6') define(`RP', `v7') define(`H', `v16') define(`H_t', `v17') define(`H2', `v18') define(`H2_t', `v19') define(`H3', `v20') define(`H3_t', `v21') define(`H4', `v22') define(`H4_t', `v23') define(`H_m', `v24') define(`H_m1', `v25') define(`H_h', `v26') define(`H_l', `v27') define(`H_m2', `v28') define(`H_m3', `v29') define(`H_h2', `v30') define(`H_l2', `v31') define(`Ml', `v4') define(`Mh', `v5')
C void gcm_hash (const struct gcm_key *key, union gcm_block *x, C size_t length, const uint8_t *data)
PROLOGUE(_nettle_gcm_hash) mov x10,#0xC200000000000000 mov POLY.d[0],x10 dup POLY.2d,POLY.d[0] eor ZERO.16b,ZERO.16b,ZERO.16b
ld1 {D.16b},[X] rev64 D.16b,D.16b ands x10,LENGTH,#-64 b.eq L2x add x9,TABLE,64 ld1 {H.16b,H_t.16b,H2.16b,H2_t.16b},[TABLE] ld1 {H3.16b,H3_t.16b,H4.16b,H4_t.16b},[x9]
L4x_loop: ld1 {C0.16b,C1.16b,C2.16b,C3.16b},[DATA],#64 rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b rev64 C2.16b,C2.16b rev64 C3.16b,C3.16b
eor C0.16b,C0.16b,D.16b pmull H_m.1q,C1.1d,H3_t.1d pmull2 H_m1.1q,C1.2d,H3_t.2d pmull H_h.1q,C1.1d,H3.1d pmull2 H_l.1q,C1.2d,H3.2d pmull H_m2.1q,C2.1d,H2_t.1d pmull2 H_m3.1q,C2.2d,H2_t.2d pmull H_h2.1q,C2.1d,H2.1d pmull2 H_l2.1q,C2.2d,H2.2d eor H_m.16b,H_m.16b,H_m2.16b eor H_m1.16b,H_m1.16b,H_m3.16b eor H_h.16b,H_h.16b,H_h2.16b eor H_l.16b,H_l.16b,H_l2.16b pmull H_m2.1q,C3.1d,H_t.1d pmull2 H_m3.1q,C3.2d,H_t.2d pmull H_h2.1q,C3.1d,H.1d pmull2 H_l2.1q,C3.2d,H.2d eor H_m.16b,H_m.16b,H_m2.16b eor H_m1.16b,H_m1.16b,H_m3.16b eor H_h.16b,H_h.16b,H_h2.16b eor H_l.16b,H_l.16b,H_l2.16b pmull H_m2.1q,C0.1d,H4_t.1d pmull2 H_m3.1q,C0.2d,H4_t.2d pmull H_h2.1q,C0.1d,H4.1d pmull2 H_l2.1q,C0.2d,H4.2d eor H_m.16b,H_m.16b,H_m2.16b eor H_m1.16b,H_m1.16b,H_m3.16b eor H_h.16b,H_h.16b,H_h2.16b eor H_l.16b,H_l.16b,H_l2.16b eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor D.16b,H_h.16b,RP.16b ext D.16b,D.16b,D.16b,#8 subs x10,x10,64 b.ne L4x_loop and LENGTH,LENGTH,#63
L2x: tst LENGTH,#-32 b.eq L1x
ld1 {H.16b,H_t.16b,H2.16b,H2_t.16b},[TABLE] ld1 {C0.16b,C1.16b},[DATA],#32 rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b eor C0.16b,C0.16b,D.16b pmull H_m.1q,C1.1d,H_t.1d pmull2 H_m1.1q,C1.2d,H_t.2d pmull H_h.1q,C1.1d,H.1d pmull2 H_l.1q,C1.2d,H.2d pmull H_m2.1q,C0.1d,H2_t.1d pmull2 H_m3.1q,C0.2d,H2_t.2d pmull H_h2.1q,C0.1d,H2.1d pmull2 H_l2.1q,C0.2d,H2.2d eor H_m.16b,H_m.16b,H_m2.16b eor H_m1.16b,H_m1.16b,H_m3.16b eor H_h.16b,H_h.16b,H_h2.16b eor H_l.16b,H_l.16b,H_l2.16b eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor D.16b,H_h.16b,RP.16b ext D.16b,D.16b,D.16b,#8 and LENGTH,LENGTH,#31
L1x: tst LENGTH,#-16 b.eq Lmod
ld1 {H.16b,H_t.16b},[TABLE] ld1 {C0.16b},[DATA],#16 rev64 C0.16b,C0.16b eor C0.16b,C0.16b,D.16b pmull H_m.1q,C0.1d,H_t.1d pmull2 H_m1.1q,C0.2d,H_t.2d pmull H_h.1q,C0.1d,H.1d pmull2 H_l.1q,C0.2d,H.2d eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor D.16b,H_h.16b,RP.16b ext D.16b,D.16b,D.16b,#8
Lmod: tst LENGTH,#15 b.eq Ldone
ld1 {H.16b,H_t.16b},[TABLE] tbz LENGTH,3,Lmod_8 ldr C0D,[DATA],#8 rev64 C0.16b,C0.16b mov x10,#0 mov C0.d[1],x10
Lmod_8: tst LENGTH,#7 b.eq Lmod_8_done mov x9,#0 mov x8,#64 and x7,LENGTH,#7 Lmod_8_loop: mov x10,#0 ldrb w10,[DATA],#1 sub x8,x8,#8 lsl x10,x10,x8 orr x9,x9,x10 subs x7,x7,#1 b.ne Lmod_8_loop tbz LENGTH,3,Lmod_8_load mov C0.d[1],x9 b Lmod_8_done Lmod_8_load: mov x10,#0 mov C0.d[0],x9 mov C0.d[1],x10 Lmod_8_done: eor C0.16b,C0.16b,D.16b
pmull H_m.1q,C0.1d,H_t.1d pmull2 H_m1.1q,C0.2d,H_t.2d pmull H_h.1q,C0.1d,H.1d pmull2 H_l.1q,C0.2d,H.2d eor H_m.16b,H_m.16b,H_m1.16b pmull RP.1q,H_l.1d,POLY.1d ext Ml.16b,ZERO.16b,H_m.16b,#8 ext Mh.16b,H_m.16b,ZERO.16b,#8 ext RP.16b,RP.16b,RP.16b,#8 eor H_l.16b,H_l.16b,Ml.16b eor H_h.16b,H_h.16b,Mh.16b eor H_l.16b,H_l.16b,RP.16b pmull2 RP.1q,H_l.2d,POLY.2d eor H_h.16b,H_h.16b,H_l.16b eor D.16b,H_h.16b,RP.16b ext D.16b,D.16b,D.16b,#8
Ldone: rev64 D.16b,D.16b st1 {D.16b},[X] ret EPILOGUE(_nettle_gcm_hash)
Maamoun TK maamoun.tk@googlemail.com writes:
I made a merge request in the main repo that enables optimized GHASH on AArch64 architecture.
Nice! I've had a quick first look. For the organization, I think aarch64 assembly should go in it's own directory, arm64/, like it's done for x86 and sparc.
I wonder which assembly files we should use if target host is aarch64, but ABI=32? I guess the arm/v6/ code can be used unconditionally. Can we also use arm/neon/ code unconditionally?
Do you agre with aiming for a release pretty soon, including the new powerpc64 code, but no aarch64 code?
Regards, /Niels
I wonder which assembly files we should use if target host is aarch64, but ABI=32? I guess the arm/v6/ code can be used unconditionally. Can we also use arm/neon/ code unconditionally?
It seems gcc for aarch64 doesn't support building 32-bit binaries, maybe we should remove the check of ABI since 64-bit is the only option. I tried adding arm/v6 and arm/neon unconditionally, both yield a bunch of errors such as the integer register is r4 instead of w4 or x4 plus getting a few unknown mnemonics.
Do you agre with aiming for a release pretty soon, including the new
powerpc64 code, but no aarch64 code?
Isn't starting a new version with both powerpc64 and aarch64 changes is more reasonable? I'm not sure here, if there are a few commits before powerpc64 patches then it makes sense to wrap up the current version with powerpc64 code. It's up to you to decide, you can also consider an AES modes optimizations for S390x arch which I'll drop its patch in the next few days.
regards, Mamone
I created a couple of merge requests in the repo, with those MRs merged I think the powerpc code is stable to be included in the upcoming version of nettle.
regards, Mamone
On Thu, Dec 17, 2020 at 12:28 PM Maamoun TK maamoun.tk@googlemail.com wrote:
I wonder which assembly files we should use if target host is aarch64,
but ABI=32? I guess the arm/v6/ code can be used unconditionally. Can we also use arm/neon/ code unconditionally?
It seems gcc for aarch64 doesn't support building 32-bit binaries, maybe we should remove the check of ABI since 64-bit is the only option. I tried adding arm/v6 and arm/neon unconditionally, both yield a bunch of errors such as the integer register is r4 instead of w4 or x4 plus getting a few unknown mnemonics.
Do you agre with aiming for a release pretty soon, including the new
powerpc64 code, but no aarch64 code?
Isn't starting a new version with both powerpc64 and aarch64 changes is more reasonable? I'm not sure here, if there are a few commits before powerpc64 patches then it makes sense to wrap up the current version with powerpc64 code. It's up to you to decide, you can also consider an AES modes optimizations for S390x arch which I'll drop its patch in the next few days.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
I created a couple of merge requests in the repo, with those MRs merged I think the powerpc code is stable to be included in the upcoming version of nettle.
Thanks. I've merged the "Use 32-bit offset to load data".
For the other one, https://git.lysator.liu.se/nettle/nettle/-/merge_requests/15 "Use signal to detect CPU features when getauxval() isn't available", can you explain for which systems is that needed? In the current code, you handle gnu/linux (depends on glibc, I guess), freebsd and aix.
I hesitate adding signal code, because it seems a bit dangerous and brittle for a library to modify signal handlers. In particular, I worry about what happens to other threads, since sigaction modifies the process-global signal handler.
The fat setup code is otherwise threadsafe, under the assumption that writes to a function pointer variable is atomic on the relevant architecture. In the unlikely case that we get concurrent calls to fat_init, both threads will get to the same conclusion and store identical values in the target variables, so then it shouldn't matter in which order (and how late) writes propagate to other cores.
If there's some way to setup (and restore) a thread-local signal handler for SIGILL, that would be safer, but I don't know if that's at all possible.
Regards, /Niels
On Sat, Dec 19, 2020 at 11:27 AM Niels Möller nisse@lysator.liu.se wrote:
For the other one, https://git.lysator.liu.se/nettle/nettle/-/merge_requests/15 "Use signal to detect CPU features when getauxval() isn't available", can you explain for which systems is that needed? In the current code, you handle gnu/linux (depends on glibc, I guess), freebsd and aix.
I hesitate adding signal code, because it seems a bit dangerous and brittle for a library to modify signal handlers. In particular, I worry about what happens to other threads, since sigaction modifies the process-global signal handler.
The fat setup code is otherwise threadsafe, under the assumption that writes to a function pointer variable is atomic on the relevant architecture. In the unlikely case that we get concurrent calls to fat_init, both threads will get to the same conclusion and store identical values in the target variables, so then it shouldn't matter in which order (and how late) writes propagate to other cores.
If there's some way to setup (and restore) a thread-local signal handler for SIGILL, that would be safer, but I don't know if that's at all possible.
fat-ppc.c uses getauxval() function to detect cpu features for Linux systems, the problem is that getauxval was introduced in glibc v2.16 which released in 2012 so in case fat option enabled, the build will fail for older glibc versions. To get around that, I implemented cpu features detection using signal in case an old glibc version been used but as you mentioned the signals work as process-based in UNIX and that could be problematic in this case for certain circumstances. However, I'm not aware of any approach to achieve thread-safety signal handling and even if such approach exists I think it won't be tempting to complicate such procedure in order to detect cpu features in this way. Do you have any suggestions here or we have to look for alternative solutions?
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
fat-ppc.c uses getauxval() function to detect cpu features for Linux systems, the problem is that getauxval was introduced in glibc v2.16 which released in 2012 so in case fat option enabled, the build will fail for older glibc versions.
I agree it's not so nice that the build fails on old systems. Do you have any idea how common such old systems might be?
Maybe add a configure check for getauxval, and either fail at configure time if --enable-fat is specified but we can't support it, or fall back to assuming that none of the optional features are present at runtime?
Some preprocessor check of glibc version in fat-ppc.c could work too, if that's simpler.
Regards, /Niels
On Sat, Dec 19, 2020 at 9:05 PM Niels Möller nisse@lysator.liu.se wrote:
Do you have any idea how common such old systems might be?
I don't have a specific number but I think using that old versions of glibc is uncommon specially for POWER8 and above processors considering those versions are more than 8 years old.
Maybe add a configure check for getauxval, and either fail at configure
time if --enable-fat is specified but we can't support it, or fall back to assuming that none of the optional features are present at runtime?
Some preprocessor check of glibc version in fat-ppc.c could work too, if that's simpler.
That's what I ended up with, I made a new merge request for these changes and closed the old one.
regards, Mamone
On Sun, Dec 20, 2020 at 12:14 PM Maamoun TK maamoun.tk@googlemail.com wrote:
On Sat, Dec 19, 2020 at 9:05 PM Niels Möller nisse@lysator.liu.se wrote:
Do you have any idea how common such old systems might be?
I don't have a specific number but I think using that old versions of glibc is uncommon specially for POWER8 and above processors considering those versions are more than 8 years old.
PPC64LE Linux is the primary focus of Linux on Power. The PPC64LE ABI specifies Power8 as the minimum ISA. GLIBC 2.16 or higher will be available on all such PPC64LE Linux systems. I doubt that an older, PPC64 Linux big endian system would install the latest libgcrypt and that configuration only is in maintenance mode for customers.
Again, it's your choice, but I would not invest a lot of effort to support such a rare, old configuration that is unlikely to use a new release of libgcrypt.
Thanks, David
Maamoun TK maamoun.tk@googlemail.com writes:
Some preprocessor check of glibc version in fat-ppc.c could work too, if that's simpler.
That's what I ended up with, I made a new merge request for these changes and closed the old one.
Thanks, looks pretty good. I added a few minor comments on the mr (https://git.lysator.liu.se/nettle/nettle/-/merge_requests/16 for reference).
Regards, /Niels
On Mon, Dec 21, 2020 at 9:29 AM Niels Möller nisse@lysator.liu.se wrote:
Thanks, looks pretty good. I added a few minor comments on the mr (https://git.lysator.liu.se/nettle/nettle/-/merge_requests/16 for reference).
Thank you, I made a commit with the changes.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
Thank you, I made a commit with the changes.
Thanks! Merged now.
Regards, /Niels
Maamoun TK maamoun.tk@googlemail.com writes:
It seems gcc for aarch64 doesn't support building 32-bit binaries, maybe we should remove the check of ABI since 64-bit is the only option.
Ok, that's a bit confusing. There's a command line flag for it, not -m32 but -mabi=ilp32, but that doesn't work out of the box with my (debian-packaged) cross compiler. Searching turns up this old (2015) email saying that gcc support is work-in-progress: https://gcc.gnu.org/legacy-ml/gcc-help/2015-02/msg00034.html
I would suggest keeping the ABI check, but leave asm_path empty (or maybe use asm_path=arm), until we have figured out how to build and test for that configuration.
Regards, /Niels
On Fri, Dec 18, 2020 at 11:31 AM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
It seems gcc for aarch64 doesn't support building 32-bit binaries, maybe we should remove the check of ABI since 64-bit is the only option.
Ok, that's a bit confusing. There's a command line flag for it, not -m32 but -mabi=ilp32, but that doesn't work out of the box with my (debian-packaged) cross compiler. Searching turns up this old (2015) email saying that gcc support is work-in-progress: https://gcc.gnu.org/legacy-ml/gcc-help/2015-02/msg00034.html
Also see https://gcc.gnu.org/legacy-ml/gcc-help/2016-06/msg00097.html
Jeff
nisse@lysator.liu.se (Niels Möller) writes:
Maamoun TK maamoun.tk@googlemail.com writes:
I made a merge request in the main repo that enables optimized GHASH on AArch64 architecture.
Nice! I've had a quick first look. For the organization, I think aarch64 assembly should go in it's own directory, arm64/, like it's done for x86 and sparc.
I've made a new branch "arm64" with the configure changes. If you think that looks ok, can you add your new ghash code on top of that?
(I'd like to make a similar branch for S390x. It would be good to also get S390x into the ci system, before adding s390x-specific assembly. I hope that should be easy to do with the same cross setup as for arm, arm64, mips, etc).
I wonder which assembly files we should use if target host is aarch64, but ABI=32? I guess the arm/v6/ code can be used unconditionally. Can we also use arm/neon/ code unconditionally?
The reference manual says
Armv8 can support the following levels of support for Advanced SIMD and floating-point instructions:
* Full SIMD and floating-point support without exception trapping.
* Full SIMD and floating-point support with exception trapping.
* No floating-point or SIMD support. This option is licensed only for implementations targeting specialized markets.
As far as I understand, that means Neon should be always available, in both 32-bit and 64-bit mode.
Regards, /Niels
On Tue, Jan 5, 2021 at 8:23 AM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
...
The reference manual says
Armv8 can support the following levels of support for Advanced SIMD and floating-point instructions:
Full SIMD and floating-point support without exception trapping.
Full SIMD and floating-point support with exception trapping.
No floating-point or SIMD support. This option is licensed only for implementations targeting specialized markets.
As far as I understand, that means Neon should be always available, in both 32-bit and 64-bit mode.
NEON is called ASIMD under ARMv8. It is part of the base machine, like SSE2 is part of x86_64.
Jeff
On Tue, Jan 5, 2021 at 3:23 PM Niels Möller nisse@lysator.liu.se wrote:
I've made a new branch "arm64" with the configure changes. If you think that looks ok, can you add your new ghash code on top of that?
Great. I'll add the ghash code to the branch once I finish the big-endian support.
(It would be good to also get S390x into the ci system, before adding s390x-specific assembly. I hope that should be easy to do with the same cross setup as for arm, arm64, mips, etc).
This is not possible since qemu doesn't support cipher functions, it implements subcode 0 (query) without actual encipher/decipher operations, take a look here https://git.qemu.org/?p=qemu.git;a=commit;h=be2b567018d987591647935a7c9648e9...
I had a talk with David Edelsohn for this issue, I concluded that there is no support for cipher functions on qemu and it's unlikely to happen anytime soon. However, I updated the testutils to cover the s390x-specific assembly so the patch can easily be tested manually by executing 'make check'. I also have tested every aspect of this patch to make sure everything will go well once it's merged.
I wonder which assembly files we should use if target host is aarch64,
but ABI=32? I guess the arm/v6/ code can be used unconditionally. Can we also use arm/neon/ code unconditionally?
The reference manual says
Armv8 can support the following levels of support for Advanced SIMD and floating-point instructions:
Full SIMD and floating-point support without exception trapping.
Full SIMD and floating-point support with exception trapping.
No floating-point or SIMD support. This option is licensed only for implementations targeting specialized markets.
As far as I understand, that means Neon should be always available, in both 32-bit and 64-bit mode.
I'll investigate how we can build the existing NEON implementations on 64-bit systems.
regards, Mamone
Hello Maamoun,
On Tue, Jan 05, 2021 at 05:52:35PM +0200, Maamoun TK wrote:
I've made a new branch "arm64" with the configure changes. If you think that looks ok, can you add your new ghash code on top of that?
Great. I'll add the ghash code to the branch once I finish the big-endian support.
I've dusted off the pine64s I mentioned before. Both are running Gentoo, one little-endian, the other big-endian. I'd be happy to give anything you throw my way a whirl on real hardware.
# uname -a Linux v 4.16.0-rc5-00012-g7cfbc0d114ca #1 SMP Tue Mar 13 18:55:14 CET 2018 aarch64_be GNU/Linux (A newer kernel is coming.) # file /usr/lib64/libnettle.so.8.0 /usr/lib64/libnettle.so.8.0: ELF 64-bit MSB shared object, ARM aarch64, version 1 (SYSV), dynamically linked, stripped
Regarding CI: I've recently updated my buildroot-based armv[567]b container images.[1] Something similar should be doable for aarch64_be.
[1] https://hub.docker.com/r/michaelweisernettleci/buildroot
Thank you, I will keep you updated about progress of big-endian support for GHASH on arm64 arch so we can test the patch on real device before sending it to Niels.
regards, Mamone
On Tue, Jan 5, 2021 at 8:00 PM Michael Weiser michael.weiser@gmx.de wrote:
Hello Maamoun,
On Tue, Jan 05, 2021 at 05:52:35PM +0200, Maamoun TK wrote:
I've made a new branch "arm64" with the configure changes. If you think that looks ok, can you add your new ghash code on top of that?
Great. I'll add the ghash code to the branch once I finish the big-endian support.
I've dusted off the pine64s I mentioned before. Both are running Gentoo, one little-endian, the other big-endian. I'd be happy to give anything you throw my way a whirl on real hardware.
# uname -a Linux v 4.16.0-rc5-00012-g7cfbc0d114ca #1 SMP Tue Mar 13 18:55:14 CET 2018 aarch64_be GNU/Linux (A newer kernel is coming.) # file /usr/lib64/libnettle.so.8.0 /usr/lib64/libnettle.so.8.0: ELF 64-bit MSB shared object, ARM aarch64, version 1 (SYSV), dynamically linked, stripped
Regarding CI: I've recently updated my buildroot-based armv[567]b container images.[1] Something similar should be doable for aarch64_be.
[1] https://hub.docker.com/r/michaelweisernettleci/buildroot
Thanks, Michael
Hello Maamoun,
On Tue, Jan 05, 2021 at 09:04:59PM +0200, Maamoun TK wrote:
Thank you, I will keep you updated about progress of big-endian support for GHASH on arm64 arch so we can test the patch on real device before sending it to Niels.
I've added aarch64_be buildroot toolchain container images to https://hub.docker.com/r/michaelweisernettleci/buildroot. Tags are michaelweisernettleci/buildroot:2020.11.1-aarch64_be-glibc and michaelweisernettleci/buildroot:2020.11.1-aarch64_be-uclibc.
I've also updated the arm CI branch[1] with with an aarch64_be build[2] that runs the testsuite through qemu-user.
[1] https://gitlab.com/michaelweiser/nettle/-/tree/arm-ci-fat [2] https://gitlab.com/michaelweiser/nettle/-/blob/arm-ci-fat/.gitlab-ci.yml#L17...
The BE pine64 board is also all updated now and standing by.
I have tuned the ghash patch to support big-endian mode but I'm really having difficulties testing it out through emulating, I'll attach the patch here so you can test it but I'm not sure how I can fix the bugs on big-endian system if any, you can feel free to send debugging info or setup a remote ssh connection so we can get it work properly.
The patch is built on top of the master branch.
regards, Mamone
On Sun, Jan 10, 2021 at 10:45 PM Michael Weiser michael.weiser@gmx.de wrote:
Hello Maamoun,
On Tue, Jan 05, 2021 at 09:04:59PM +0200, Maamoun TK wrote:
Thank you, I will keep you updated about progress of big-endian support
for
GHASH on arm64 arch so we can test the patch on real device before
sending
it to Niels.
I've added aarch64_be buildroot toolchain container images to https://hub.docker.com/r/michaelweisernettleci/buildroot. Tags are michaelweisernettleci/buildroot:2020.11.1-aarch64_be-glibc and michaelweisernettleci/buildroot:2020.11.1-aarch64_be-uclibc.
I've also updated the arm CI branch[1] with with an aarch64_be build[2] that runs the testsuite through qemu-user.
[1] https://gitlab.com/michaelweiser/nettle/-/tree/arm-ci-fat [2] https://gitlab.com/michaelweiser/nettle/-/blob/arm-ci-fat/.gitlab-ci.yml#L17...
The BE pine64 board is also all updated now and standing by.
HTH, Michael
Hello Mamone,
On Mon, Jan 11, 2021 at 11:39:43PM +0200, Maamoun TK wrote:
I have tuned the ghash patch to support big-endian mode but I'm really having difficulties testing it out through emulating, I'll attach the patch here so you can test it but I'm not sure how I can fix the bugs on big-endian system if any, you can feel free to send debugging info or setup a remote ssh connection so we can get it work properly.
Out of curiosity as I can't seem to find the beginning of the discussion: Is there anyone but me with an actual use-case for big-endian arm64 here? If not, I'd hate to cause a lot of effort for you and would certainly put in the effort to get this going myself.
The patch is built on top of the master branch.
First it failed to compile gcm-hash.o with error "No rule to make target" which turned out to be caused by a missing arm64/machine.m4. After I added an empty file there it compiled fine on aarch64 and the testsuite succeeded on the actual hardware as well as under qemu-aarch64 user mode emulation (both LE).
On aarch64_be it fails to compile with the following error message:
gcm-hash.s:113: Error: unknown mnemonic `zip' -- `zip v23.2d,v2.2d,v22.2d' gcm-hash.s:119: Error: unknown mnemonic `zip' -- `zip v25.2d,v3.2d,v22.2d' gcm-hash.s:129: Error: unknown mnemonic `zip' -- `zip v27.2d,v4.2d,v22.2d' gcm-hash.s:137: Error: unknown mnemonic `zip' -- `zip v29.2d,v5.2d,v22.2d'
This happens with gcc 10.2.0 on my hardware board as well as cross gcc 9.3.0 of Buildroot 2020.11.1 in a container.
I did a search of the aarch64 instruction set and saw that there's zip1 and zip2 instructions. So as a first test I just changed zip to zip1 which made it compile. As was to be expected, the testsuite failed though.
Before you try and get me up to speed on what the routine is supposed to be doing there's also an option for you to get a cross toolchain and emulator for your own tests without too much effort. Here's how I cross-compile nettle and run the testsuite using rootless podman (docker should do just as well) on my x86_64 box:
cd ~/Downloads mkdir nettle cd nettle git clone https://git.lysator.liu.se/nettle/nettle cd nettle git apply ~/arm64_ghash.patch ./.bootstrap podman run -it -v ~/Downloads/nettle:/nettle michaelweisernettleci/buildroot:2020.11.1-aarch64_be-glibc-gdb cd /nettle/ mkdir build-aarch64_be cd build-aarch64_be/ ../nettle/configure --host=$(cat /buildroot/triple) --enable-armv8-a-crypto make -j4 make -j4 check EMULATOR=/buildroot/qemu
Unfortunately, because in this case qemu-aarch64_be is running the testsuite binaries under emulation and doesn't support the ptrace syscall (and containers usually don't either), you can't just run it under an aarch64_be native gdb to see what it's executing.
One option would be to boot a full BE system image with kernel in qemu-system-aarch64 including a native gdb. But that's a bit of a hassle (building a rootfs and kernel e.g. using buildroot, getting it to boot in qemu, accessing it via console or network, ...)
qemu-user can however serve as a gdb server similar to qemu-system[1]. [1] https://qemu.readthedocs.io/en/latest/system/gdb.html
As luck would have it, above container image contains an x86_64-native gdb targeting aarch64_be. So you can start the testsuite test under qemu with the -g option and a port to listen on for the gdb remote debugging connection and then fire up gdb and connect there. After that you can debug as usual, single-step and look at register values:
root@6c85515d3939:/nettle/build-aarch64_be/testsuite# /buildroot/qemu -E LD_LIBRARY_PATH=../.lib -g 9000 ./gcm-test & [1] 4205 root@6c85515d3939:/nettle/build-aarch64_be/testsuite# aarch64_be-buildroot-linux-gnu-gdb ./gcm-test GNU gdb (GDB) 8.3.1 [...] Reading symbols from ./gcm-test... (gdb) break main Breakpoint 1 at 0x4037b0: file ../../nettle/testsuite/testutils.c, line 123. (gdb) target remote localhost:9000 Remote debugging using localhost:9000 warning: remote target does not support file transfer, attempting to access files from local filesystem. warning: Unable to find dynamic linker breakpoint function. GDB will be unable to debug shared library initializers and track explicitly loaded dynamic code. 0x0000004000802040 in ?? () (gdb) c Continuing. warning: Could not load shared library symbols for 3 libraries, e.g. /usr/lib64/libgmp.so.10. Use the "info sharedlibrary" command to see the complete listing. Do you need "set solib-search-path" or "set sysroot"?
Breakpoint 1, main (argc=1, argv=0x4000800d58) at ../../nettle/testsuite/testutils.c:123 123 if (argc > 1) (gdb) b _nettle_gcm_init_key Breakpoint 2 at 0x40008b69f4: file gcm-hash.s, line 93. (gdb) c Continuing.
Breakpoint 2, _nettle_gcm_init_key () at gcm-hash.s:93 93 ldr q2,[x0,#16*128] (gdb) s 94 dup v0.16b,v2.b[0] (gdb) 96 mov x1,#0xC200000000000000 (gdb) 97 mov x2,#1 (gdb) 98 mov v6.d[0],x1 (gdb) 99 mov v6.d[1],x2 (gdb) 100 sshr v0.16b,v0.16b,#7 (gdb) 101 and v0.16b,v0.16b,v6.16b (gdb) 102 ushr v1.2d,v2.2d,#63 (gdb) 103 and v1.16b,v1.16b,v6.16b (gdb) 104 ext v1.16b,v1.16b,v1.16b,#8 (gdb) 105 shl v2.2d,v2.2d,#1 (gdb) 106 orr v2.16b,v2.16b,v1.16b (gdb) 107 eor v2.16b,v2.16b,v0.16b (gdb) 109 dup v6.2d,v6.d[0] (gdb) 113 PMUL_PARAM v2,v23,v24 ^--- doesn't seem to expand the macro here (gdb) 115 PMUL v2,v23,v24 (gdb) 117 REDUCTION v3 (gdb) i r x0 0x423390 4338576 x1 0xc200000000000000 -4467570830351532032 [...] x30 0x406c44 4222020 sp 0x4000800ad0 0x4000800ad0 pc 0x40008b6a5c 0x40008b6a5c <_nettle_gcm_init_key+104> cpsr 0x80000000 -2147483648 fpsr 0x0 0 fpcr 0x0 0
The trick to see and single-step the individual instructions of the macro seems to be disp/i $pc combined with stepi:
(gdb) disp/i $pc 1: x/i $pc => 0x40008b6a30 <_nettle_gcm_init_key+60>: pmull2 v20.1q, v2.2d, v6.2d (gdb) stepi 0x00000040008b6a34 113 PMUL_PARAM v2,v23,v24 1: x/i $pc => 0x40008b6a34 <_nettle_gcm_init_key+64>: ext v22.16b, v2.16b, v2.16b, #8 (gdb) 0x00000040008b6a38 113 PMUL_PARAM v2,v23,v24 1: x/i $pc => 0x40008b6a38 <_nettle_gcm_init_key+68>: eor v22.16b, v22.16b, v20.16b (gdb) 0x00000040008b6a3c 113 PMUL_PARAM v2,v23,v24 1: x/i $pc => 0x40008b6a3c <_nettle_gcm_init_key+72>: zip1 v23.2d, v2.2d, v22.2d
From here I would now continue to compare register contents after each
instruction on LE and BE to see where it's going wrong.
How would you like to proceed? Shall I dig into it or do you want to? :)
BTW: In case you want to build the image yourself, the diff to the Dockerfile.aarch64[3] is this:
diff --git a/Dockerfile.aarch64 b/Dockerfile.aarch64 index 36af2c5..5b51c17 100644 --- a/Dockerfile.aarch64 +++ b/Dockerfile.aarch64 @@ -41,6 +41,7 @@ RUN br_libc="${BR_LIBC}" ; \ echo "BR2_TOOLCHAIN_BUILDROOT_${libcopt}=y" ; \ echo 'BR2_KERNEL_HEADERS_4_19=y' ; \ echo 'BR2_PACKAGE_GMP=y' ; \ + echo 'BR2_PACKAGE_HOST_GDB=y' ; \ echo 'BR2_PER_PACKAGE_DIRECTORIES=y' ; \ ) > .config && \ make olddefconfig && \ @@ -75,7 +76,7 @@ MAINTAINER Michael Weiser michael.weiser@gmx.de RUN apt-get update -qq -y && \ apt-get dist-upgrade -y && \ apt-get autoremove -y && \ - apt-get install -y autoconf dash g++ make qemu-user && \ + apt-get install -y autoconf dash g++ libncurses6 libexpat1 make qemu-user && \ apt-get clean all && \ rm -rf /var/lib/apt/lists/*
[3] https://github.com/michaelweiser-nettle-ci/docker-buildroot/blob/master/Dock...
The command to build the image is:
podman build -f Dockerfile.aarch64 --build-arg BR_LIBC=glibc -t buildroot:2020.11.1-aarch64_be-glibc-gdb .
Hi Michael,
On Wed, Jan 13, 2021 at 8:00 PM Michael Weiser michael.weiser@gmx.de wrote:
Out of curiosity as I can't seem to find the beginning of the discussion: Is there anyone but me with an actual use-case for big-endian arm64 here? If not, I'd hate to cause a lot of effort for you and would certainly put in the effort to get this going myself.
It would be nice to get the implementation of the enhanced algorithm working for both endian modes as it yields a good performance boost. Also, there is no much effort here, the only thing I'm struggling with is to get the binary built for Aarch64_be, I'm using Ubuntu on x86_64 as host and it seems there is no official package to cross compile for Aarch64_be.
The patch is built on top of the master branch.
First it failed to compile gcm-hash.o with error "No rule to make target" which turned out to be caused by a missing arm64/machine.m4. After I added an empty file there it compiled fine on aarch64 and the testsuite succeeded on the actual hardware as well as under qemu-aarch64 user mode emulation (both LE).
On aarch64_be it fails to compile with the following error message:
gcm-hash.s:113: Error: unknown mnemonic `zip' -- `zip v23.2d,v2.2d,v22.2d' gcm-hash.s:119: Error: unknown mnemonic `zip' -- `zip v25.2d,v3.2d,v22.2d' gcm-hash.s:129: Error: unknown mnemonic `zip' -- `zip v27.2d,v4.2d,v22.2d' gcm-hash.s:137: Error: unknown mnemonic `zip' -- `zip v29.2d,v5.2d,v22.2d'
This happens with gcc 10.2.0 on my hardware board as well as cross gcc 9.3.0 of Buildroot 2020.11.1 in a container.
I did a search of the aarch64 instruction set and saw that there's zip1 and zip2 instructions. So as a first test I just changed zip to zip1 which made it compile. As was to be expected, the testsuite failed though.
You are on the right track so far.
Before you try and get me up to speed on what the routine is supposed to be doing there's also an option for you to get a cross toolchain and emulator for your own tests without too much effort. Here's how I cross-compile nettle and run the testsuite using rootless podman (docker should do just as well) on my x86_64 box:
cd ~/Downloads mkdir nettle cd nettle git clone https://git.lysator.liu.se/nettle/nettle cd nettle git apply ~/arm64_ghash.patch ./.bootstrap podman run -it -v ~/Downloads/nettle:/nettle michaelweisernettleci/buildroot:2020.11.1-aarch64_be-glibc-gdb cd /nettle/ mkdir build-aarch64_be cd build-aarch64_be/ ../nettle/configure --host=$(cat /buildroot/triple) --enable-armv8-a-crypto make -j4 make -j4 check EMULATOR=/buildroot/qemu
I tried that but I'm having difficulty getting it work, it seems there is a problem in my system configuration that prevents podman establishing a socket for connection, I spend some time looking for alternative solutions with no chance. Do you have any other solutions? all what I can think of is either setup ssh connection or work together to get it work if you are into it!
regards, Mamone
Hello Mamone,
On Mon, Jan 18, 2021 at 06:27:40PM +0200, Maamoun TK wrote:
It would be nice to get the implementation of the enhanced algorithm working for both endian modes as it yields a good performance boost. Also, there is no much effort here, the only thing I'm struggling with is to get the binary built for Aarch64_be, I'm using Ubuntu on x86_64 as host and it seems there is no official package to cross compile for Aarch64_be.
Yes, there are no packages for aarch64_be in any mainstream distribution I'm aware of. Buildroot and Gentoo are the ones I know that can target it, Yocto likely as well. All are compile-yourself-distributions and not for the faint of heart. Also, I've just learned that Buildroot has made a concious decision not to produce native toolchains for the target. So you can only ever cross-compile nettle to it, run it on an actual board or under qemu and then go back to the cross-compiler on the host.
I did a search of the aarch64 instruction set and saw that there's zip1 and zip2 instructions. So as a first test I just changed zip to zip1 which made it compile. As was to be expected, the testsuite failed though.
You are on the right track so far.
I've poked at the code a bit more and seemingly made the key init function work by eliminiating all the BE specific macros and instead adjusting the load from memory to produce the same register content. At least register values and the final output to memory look the same in an x/64xb $x0-64 and x64/xb $x0 for the first test cases in gcm-test (which they did not before).
137 PMUL_PARAM v5,v29,v30 (gdb) 139 st1 {v27.16b,v28.16b,v29.16b,v30.16b},[x0] (gdb) 141 ret (gdb) x/64xb $x0-64 0xaaaaaaac5390: 0x77 0x58 0x14 0xdf 0xa9 0x97 0xd2 0xcd [.. all the same on BE and LE ...] 0xaaaaaaac53c8: 0x0d 0x12 0x63 0x69 0x37 0x20 0xd3 0xfe (gdb) x/64xb $x0 0xaaaaaaac53d0: 0xf9 0xfa 0x22 0xc3 0x02 0xe7 0x95 0x86 [.. all the same on BE and LE ...] 0xaaaaaaac5408: 0x45 0x91 0xbd 0x48 0x73 0xd9 0x8b 0x5c (gdb)
The problem here once more seems to be that after a 128bit LE load which is later used as two 64bit operands, not only the bytes of the operands are reversed (which you already counter by rev64'ing them, I gather) but the operands (doublewords) also end up transposed in the register. This is something the rest of the routine expects but is only true on LE. So I adjusted for it on BE in a very pedestrian way:
diff --git a/arm64/v8/gcm-hash.asm b/arm64/v8/gcm-hash.asm index 1c14db54..74cd656a 100644 --- a/arm64/v8/gcm-hash.asm +++ b/arm64/v8/gcm-hash.asm @@ -55,17 +55,10 @@ C common macros: .endm
.macro REDUCTION out -IF_BE(` - pmull T.1q,F.1d,POLY.1d - ext \out().16b,F.16b,F.16b,#8 - eor R.16b,R.16b,T.16b - eor \out().16b,\out().16b,R.16b -',` pmull T.1q,F.1d,POLY.1d eor R.16b,R.16b,T.16b ext R.16b,R.16b,R.16b,#8 eor \out().16b,F.16b,R.16b -') .endm
C void gcm_init_key (union gcm_block *table) @@ -108,19 +101,11 @@ define(`H4M', `v29') define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2 -IF_BE(` - pmull2 Hp.1q,\in().2d,POLY.2d - ext Hm.16b,\in().16b,\in().16b,#8 - eor Hm.16b,Hm.16b,Hp.16b - zip \param1().2d,\in().2d,Hm.2d - zip2 \param2().2d,\in().2d,Hm.2d -',` pmull2 Hp.1q,\in().2d,POLY.2d eor Hm.16b,\in().16b,Hp.16b ext \param1().16b,Hm.16b,\in().16b,#8 ext \param2().16b,\in().16b,Hm.16b,#8 ext \param1().16b,\param1().16b,\param1().16b,#8 -') .endm
PROLOGUE(_nettle_gcm_init_key) @@ -128,6 +113,10 @@ PROLOGUE(_nettle_gcm_init_key) dup EMSB.16b,H.b[0] IF_LE(` rev64 H.16b,H.16b +',` + mov x1,H.d[0] + mov H.d[0],H.d[1] + mov H.d[1],x1 ') mov x1,#0xC200000000000000 mov x2,#1
If my understanding is correct, we could avoid the doubleword swap for both LE and BE if we were to load using ld1 to {H.b16} instead (with a precalculation of the offset because ld1 won't take an immediate offset that high, correct?). But then the rest of the routine would need to change its expectation what H.d[0] and H.d[1] contain, respectively, because they will no longer be transposed by neither the load on LE nor an explicit swap on BE.
Somehow I have a feeling, I'm terribly missing the actual point here, though. Are the zip instructions likely to give even further speedup beyond the LE version? Could this be exploited for LE as well by adjusting the loading scheme even more?
Also, it's not fully working yet. Before digging deeper I wanted to give a bit of an update and get guidance as to how to proceed.
podman run -it -v ~/Downloads/nettle:/nettle
I tried that but I'm having difficulty getting it work, it seems there is a problem in my system configuration that prevents podman establishing a socket for connection, I spend some time looking for alternative solutions with no chance. Do you have any other solutions? all what I can think of is either setup ssh connection or work together to get it work if you are into it!
I mulled this over from all directions. Access to the actual board is somewhat complicated by the limits of my available Internet connections (CGNAT being one, missing DMZ functionality on the routers another). It can certainly be done, I just would need some time to set it up.
But I have made the cross-compiling and -debugging setup of the container available on a vserver on the Net. Send me a mail directly with an SSH ID public key if you'd like to try this out and I'll send you instructions for login and use. We could meet up there in a tmux/screen session and work on it together as well.
I have also tried to extract the buildroot toolchain from the image and run it on my Gentoo box as well as Debian. It even seems relocatable, so you can just put it anywhere and add it to PATH and it'll work. If you want, I can put a tarball with the toolchain and qemu wrappers up on a web server somewhere for you to grab. (I just thought, a container image would be the easier delivery method nowadays. :)
Otherwise, what's your error message from podman? It's got no deamon, so it shouldn't need a socket to connect to it like docker does. Out to the Internet for image download it's also a standard client and respects environment variables for proxies as usual.
rootless podman (running as your standard user instead of root) can take a bit of tweaking before it stops throwing error messages but once that's done it works nicely. I've never actually run podman as root by luck of late birth with regards to containers.
Here's my command sequence on a Ubuntu 20.04 VM that's never seen rootless podman before as per https://www.vultr.com/docs/how-to-install-and-use-podman-on-ubuntu-20-04 (literally the first hit on search, can't vouch for the packages from opensuse though):
michael@demo:~$ podman
Command 'podman' not found, did you mean:
command 'pod2man' from deb perl (5.30.0-9ubuntu0.2)
Try: sudo apt install <deb name>
michael@demo:~$ source /etc/os-release michael@demo:~$ sudo sh -c "echo 'deb http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stabl... /' > /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list" michael@demo:~$ wget -nv https://download.opensuse.org/repositories/devel:kubic:libcontainers:stable/... -O- | sudo apt-key add - 2021-01-19 21:13:19 URL:https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stab... [1093/1093] -> "-" [1] OK michael@demo:~$ sudo apt-get update -qq michael@demo:~$ sudo apt-get -qq --yes install podman fuse-overlayfs slirp4netns [...] michael@demo:~$ podman run -it michaelweisernettleci/buildroot:2020.11.1-aarch64_be-glibc-gdb Completed short name "michaelweisernettleci/buildroot" with unqualified-search registries (origin: /etc/containers/registries.conf) Trying to pull docker.io/michaelweisernettleci/buildroot:2020.11.1-aarch64_be-glibc-gdb... Getting image source signatures Copying blob 6c33745f49b4 done Copying blob ff35d554f2d5 done Copying blob 3927b287d6b9 done Copying blob 6bbc022f227c done Copying config 21663e44fe done Writing manifest to image destination Storing signatures root@06e70f1e12e4:/# aarch64_be-buildroot-linux-gnu-gcc -v Using built-in specs. COLLECT_GCC=/buildroot/output/host/bin/aarch64_be-buildroot-linux-gnu-gcc.br_real COLLECT_LTO_WRAPPER=/buildroot/output/host/bin/../libexec/gcc/aarch64_be-buildroot-linux-gnu/9.3.0/lto-wrapper Target: aarch64_be-buildroot-linux-gnu Configured with: ./configure --prefix=/buildroot/output/per-package/host-gcc-final/host [...] --enable-shared --disable-libgomp --silent Thread model: posix gcc version 9.3.0 (Buildroot 2020.11.1) root@06e70f1e12e4:/# git clone https://git.lysator.liu.se/nettle/nettle bash: git: command not found root@06e70f1e12e4:/# apt-get update Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB] Get:2 http://deb.debian.org/debian buster InRelease [121 kB] [...] root@06e70f1e12e4:/# apt-get install git Reading package lists... Done Building dependency tree Reading state information... Done The following additional packages will be installed: ca-certificates git-man krb5-locales less libbsd0 libcurl3-gnutls [...] root@06e70f1e12e4:/# git clone https://git.lysator.liu.se/nettle/nettle Cloning into 'nettle'... warning: redirecting to https://git.lysator.liu.se/nettle/nettle.git/ remote: Enumerating objects: 721, done. remote: Counting objects: 100% (721/721), done. remote: Compressing objects: 100% (349/349), done. remote: Total 21095 (delta 479), reused 593 (delta 372), pack-reused 20374 Receiving objects: 100% (21095/21095), 5.90 MiB | 3.47 MiB/s, done. Resolving deltas: 100% (15748/15748), done. root@06e70f1e12e4:/#
That was a lot easier than even I expected. Necessary stuff like entries in /etc/subuid are automatically added by useradd as standard nowadays without podman even being installed:
michael@demo:~$ cat /etc/subuid michael:100000:65536
Hope that helps.
If all else fails and it's not too trying for your patience I'm up for making it work iteratively by trial, error and discussion as above. ;)
Hello Michael,
On Tue, Jan 19, 2021 at 11:45 PM Michael Weiser michael.weiser@gmx.de wrote:
Yes, there are no packages for aarch64_be in any mainstream distribution I'm aware of. Buildroot and Gentoo are the ones I know that can target it, Yocto likely as well. All are compile-yourself-distributions and not for the faint of heart. Also, I've just learned that Buildroot has made a concious decision not to produce native toolchains for the target. So you can only ever cross-compile nettle to it, run it on an actual board or under qemu and then go back to the cross-compiler on the host.
I'm trying to install Gentoo on VMware by walking through this receip https://medium.com/@steensply/vmware-installation-of-gentoo-linux-from-scrat... I'm in the middle of receip now but there a lot of instruction there so I'm gonna get the os working in the end.
I did a search of the aarch64 instruction set and saw that there's zip1 and zip2 instructions. So as a first test I just changed zip to zip1 which made it compile. As was to be expected, the testsuite failed though.
You are on the right track so far.
I've poked at the code a bit more and seemingly made the key init function work by eliminiating all the BE specific macros and instead adjusting the load from memory to produce the same register content. At least register values and the final output to memory look the same in an x/64xb $x0-64 and x64/xb $x0 for the first test cases in gcm-test (which they did not before).
137 PMUL_PARAM v5,v29,v30 (gdb) 139 st1 {v27.16b,v28.16b,v29.16b,v30.16b},[x0] (gdb) 141 ret (gdb) x/64xb $x0-64 0xaaaaaaac5390: 0x77 0x58 0x14 0xdf 0xa9 0x97 0xd2 0xcd [.. all the same on BE and LE ...] 0xaaaaaaac53c8: 0x0d 0x12 0x63 0x69 0x37 0x20 0xd3 0xfe (gdb) x/64xb $x0 0xaaaaaaac53d0: 0xf9 0xfa 0x22 0xc3 0x02 0xe7 0x95 0x86 [.. all the same on BE and LE ...] 0xaaaaaaac5408: 0x45 0x91 0xbd 0x48 0x73 0xd9 0x8b 0x5c (gdb)
Here how I get the vector instruction operate on registers in LE mode, i'll take this instruction as example: pmull v0.1q,v1.1d,v2.1d Input represented as indexes v1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 v2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 the instruction byte-reverse each of 64-bit parts of register so the instruction consider the register as follow v1: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 v2: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 so what I did in LE mode is reverse the 64-bit parts before execute the doublework operation using rev64 instruction, according to that the pmull output will be 128-bit byte-reversed Output represented as indexes v0: 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
What I'm assuming in BE mode is operations are performed in normal way in registers side so no need to reverse the inputs in addition to get normal output hence the macros "REDUCTION" and "PMUL_PARAM" have differences in their structure, it's not matter of zip instruction perform better but how to handle the weird situation in LE mode.
The problem here once more seems to be that after a 128bit LE load which is later used as two 64bit operands, not only the bytes of the operands are reversed (which you already counter by rev64'ing them, I gather) but the operands (doublewords) also end up transposed in the register. This is something the rest of the routine expects but is only true on LE. So I adjusted for it on BE in a very pedestrian way:
diff --git a/arm64/v8/gcm-hash.asm b/arm64/v8/gcm-hash.asm index 1c14db54..74cd656a 100644 --- a/arm64/v8/gcm-hash.asm +++ b/arm64/v8/gcm-hash.asm @@ -55,17 +55,10 @@ C common macros: .endm
.macro REDUCTION out -IF_BE(`
- pmull T.1q,F.1d,POLY.1d
- ext \out().16b,F.16b,F.16b,#8
- eor R.16b,R.16b,T.16b
- eor \out().16b,\out().16b,R.16b
-',` pmull T.1q,F.1d,POLY.1d eor R.16b,R.16b,T.16b ext R.16b,R.16b,R.16b,#8 eor \out().16b,F.16b,R.16b -') .endm
C void gcm_init_key (union gcm_block *table)
@@ -108,19 +101,11 @@ define(`H4M', `v29') define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2 -IF_BE(`
- pmull2 Hp.1q,\in().2d,POLY.2d
- ext Hm.16b,\in().16b,\in().16b,#8
- eor Hm.16b,Hm.16b,Hp.16b
- zip \param1().2d,\in().2d,Hm.2d
- zip2 \param2().2d,\in().2d,Hm.2d
-',` pmull2 Hp.1q,\in().2d,POLY.2d eor Hm.16b,\in().16b,Hp.16b ext \param1().16b,Hm.16b,\in().16b,#8 ext \param2().16b,\in().16b,Hm.16b,#8 ext \param1().16b,\param1().16b,\param1().16b,#8 -') .endm
PROLOGUE(_nettle_gcm_init_key) @@ -128,6 +113,10 @@ PROLOGUE(_nettle_gcm_init_key) dup EMSB.16b,H.b[0] IF_LE(` rev64 H.16b,H.16b +',`
- mov x1,H.d[0]
- mov H.d[0],H.d[1]
- mov H.d[1],x1
') mov x1,#0xC200000000000000 mov x2,#1
If my understanding is correct, we could avoid the doubleword swap for both LE and BE if we were to load using ld1 to {H.b16} instead (with a precalculation of the offset because ld1 won't take an immediate offset that high, correct?). But then the rest of the routine would need to change its expectation what H.d[0] and H.d[1] contain, respectively, because they will no longer be transposed by neither the load on LE nor an explicit swap on BE.
Somehow I have a feeling, I'm terribly missing the actual point here, though. Are the zip instructions likely to give even further speedup beyond the LE version? Could this be exploited for LE as well by adjusting the loading scheme even more?
If my assumption about how instruction operates in BE mode is right so yes this is not the actual point.
But I have made the cross-compiling and -debugging setup of the container available on a vserver on the Net. Send me a mail directly with an SSH ID public key if you'd like to try this out and I'll send you instructions for login and use. We could meet up there in a tmux/screen session and work on it together as well.
Let's try the second solution before we get to this.
I have also tried to extract the buildroot toolchain from the image and run it on my Gentoo box as well as Debian. It even seems relocatable, so you can just put it anywhere and add it to PATH and it'll work. If you want, I can put a tarball with the toolchain and qemu wrappers up on a web server somewhere for you to grab. (I just thought, a container image would be the easier delivery method nowadays. :)
I would like to try this method in case my gentoo installation failed or just been easier to extract your uploaded packages and add it to PATH. Update: while I'm writing this message I got: no space left of device. It seems I set low numbers while partitioning the device. Let's try the above method before I start over to install gentoo.
Otherwise, what's your error message from podman? It's got no deamon, so it shouldn't need a socket to connect to it like docker does. Out to the Internet for image download it's also a standard client and respects environment variables for proxies as usual.
I got Error: error creating network namespace for container. I think I can fix it by tracing the problem but let's try the other methods first as I think it's gonna be simpler for me..
regards, Mamone
Hello Mamone,
On Wed, Jan 20, 2021 at 10:25:19PM +0200, Maamoun TK wrote:
I'm trying to install Gentoo on VMware by walking through this receip https://medium.com/@steensply/vmware-installation-of-gentoo-linux-from-scrat... I'm in the middle of receip now but there a lot of instruction there so I'm gonna get the os working in the end.
As far as I can tell that recipe only encompasses basic installation. You'd additionally need to run crossdev to create a cross-toolchain and then install qemu as well. Gentoo has a very steep learning curve. There's no benefit compared to buildroot for our use-case here, IMO.
Here how I get the vector instruction operate on registers in LE mode, i'll take this instruction as example: pmull v0.1q,v1.1d,v2.1d Input represented as indexes v1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 v2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 the instruction byte-reverse each of 64-bit parts of register so the instruction consider the register as follow v1: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 v2: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 so what I did in LE mode is reverse the 64-bit parts before execute the doublework operation using rev64 instruction, according to that the pmull output will be 128-bit byte-reversed Output represented as indexes v0: 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
What I'm assuming in BE mode is operations are performed in normal way in registers side so no need to reverse the inputs in addition to get normal output hence the macros "REDUCTION" and "PMUL_PARAM" have differences in their structure, it's not matter of zip instruction perform better but how to handle the weird situation in LE mode.
I've tried for a number of hours to make this work today. Always when I added correct handling of the transposed doublewords to one macro, another broke down. To me the problem comes down to this: ldr HQ,[TABLE...] and st1.16b are fighting each other and can't be brought together without a lot of additional instructions (at least not by me).
Longer story: ldr does a 128bit load. This loads bytes in exactly reverse order into the register on LE and BE. As you describe above, the macros for LE expect a state which is neither of those: The bytes transposed as if BE but the doublewords as loaded on LE. For BE this poses the oppositve problem: It natively loads bytes in the order LE has to reproduce using rev64 but then needs to reproduce the doubleword order of LE for the LE routines to work or basically have native BE routines.
That's what my last pedestrian change did. After today I'd perhaps write it like this (untested):
@@ -125,10 +135,12 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key) ldr HQ,[TABLE,#16*H_Idx] - dup EMSB.16b,H.b[0] IF_LE(` rev64 H.16b,H.16b +',` + ext H.16b,H.16b,H.16b,#8 ') + dup EMSB.16b,H.b[7] mov x1,#0xC200000000000000 mov x2,#1 mov POLY.d[0],x1
When trying to cater to the current layout on LE, all the other vectors which are later used in conjunction with H to be reversed as well. That leads to this diff to your initial patch:
@@ -125,14 +135,21 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key) ldr HQ,[TABLE,#16*H_Idx] - dup EMSB.16b,H.b[0] IF_LE(` + dup EMSB.16b,H.b[0] rev64 H.16b,H.16b +',` + dup EMSB.16b,H.b[15] ') mov x1,#0xC200000000000000 mov x2,#1 +IF_LE(` mov POLY.d[0],x1 mov POLY.d[1],x2 +',` + mov POLY.d[1],x1 + mov POLY.d[0],x2 +') sshr EMSB.16b,EMSB.16b,#7 and EMSB.16b,EMSB.16b,POLY.16b ushr B.2d,H.2d,#63 @@ -142,7 +159,11 @@ IF_LE(` orr H.16b,H.16b,B.16b eor H.16b,H.16b,EMSB.16b
+IF_LE(` dup POLY.2d,POLY.d[0] +',` + dup POLY.2d,POLY.d[1] +')
C --- calculate H^2 = H*H ---
The difference in index in dup EMSB nicely shows the doubleword transposition compared to LE. If on LE the dup was done after the rev64, it'd be H.b[7] vs. H.b[15].
With this layout PMUL_PARAM can work on H and POLY but then needs to use pmull instead of pmull2 because the relevant data is in the other doublewords compared to LE. On the other hand, since the output of PMUL_PARAM is to be stored using st1.16b it must not have the doublewords transposed ("load-inverted" I termed it in the comments ;). That leads to the following adjustment and makes the first 16bytes of TABLE identical to LE:
@@ -109,11 +118,12 @@ define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2 IF_BE(` - pmull2 Hp.1q,\in().2d,POLY.2d + pmull Hp.1q,\in().1d,POLY.1d ext Hm.16b,\in().16b,\in().16b,#8 eor Hm.16b,Hm.16b,Hp.16b - zip \param1().2d,\in().2d,Hm.2d - zip2 \param2().2d,\in().2d,Hm.2d + C output must be in native register order (not load-inverted) for st1.16b to work + zip2 \param1().2d,\in().2d,Hm.2d + zip1 \param2().2d,\in().2d,Hm.2d ',` pmull2 Hp.1q,\in().2d,POLY.2d eor Hm.16b,\in().16b,Hp.16b
In PMUL is where it breaks down, at least for my brain: Its first call is handed H (which has doublewords "transposed" from the initial ldr) and H1M and H1L (which must not have doublewords transposed so st1.16b writes them to memory in correct order). It wants to pmull/pmull2 them which requires corresponding doublewords at the same index. So we'd need to temporarily transpose \in for that:
@@ -46,25 +46,34 @@ define(`R1', `v19')
C common macros: .macro PMUL in, param1, param2 - pmull F.1q,\param2().1d,\in().1d - pmull2 F1.1q,\param2().2d,\in().2d - pmull R.1q,\param1().1d,\in().1d - pmull2 R1.1q,\param1().2d,\in().2d + C PMUL_PARAM left us with \param1 and \param2 in native register order but + C \in is load-inverted from initial load of H using ldr, something must give +IF_BE(` + ext T.16b,\in().16b,\in().16b,#8 +',` + mov T.16b,\in().16b +') + pmull F.1q,\param2().1d,T.1d + pmull2 F1.1q,\param2().2d,T.2d + pmull R.1q,\param1().1d,T.1d + pmull2 R1.1q,\param1().2d,T.2d eor F.16b,F.16b,F1.16b eor R.16b,R.16b,R1.16b .endm
If we finally artificially restore the doubleword transposition in REDUCE for H2 and H3 we're all set for the next calls:
.macro REDUCTION out IF_BE(` - pmull T.1q,F.1d,POLY.1d ext \out().16b,F.16b,F.16b,#8 - eor R.16b,R.16b,T.16b - eor \out().16b,\out().16b,R.16b + pmull2 T.1q,\out().2d,POLY.2d ',` pmull T.1q,F.1d,POLY.1d +') eor R.16b,R.16b,T.16b ext R.16b,R.16b,R.16b,#8 eor \out().16b,F.16b,R.16b +C artificially restore load inversion for PMUL_PARAM :-( +IF_BE(` + ext \out().16b,\out().16b,\out().16b,#8 ') .endm
So all we're doing is catering to the quirk of the very first ldr operation. The easiest solution seems to me to align all types of load and store operations with each other or counteract their quirks right after or before executing them. That way we end up with identical register contents on LE and BE and don't have to maintain separate implementations.
That'd be in line with what we ended up with on arm32 NEON as well. memxor3.asm does do the dance of working with different register content but there it's only bitwise operations and the load and store operations have identical behaviour.
The advantage of the current implementation with transposed doublewords and only the LE routines seems to me that overhead on LE and BE would be about the same.
Do you think it makes sense to try and adjust the code to work with the BE layout natively and have a full 128bit reverse after ldr-like loads on LE instead (considering that 99,999% of aarch64 users will run LE)?
Otherwise, what's your error message from podman? It's got no deamon, so it shouldn't need a socket to connect to it like docker does. Out to the Internet for image download it's also a standard client and respects environment variables for proxies as usual.
I got Error: error creating network namespace for container. I think I can fix it by tracing the problem but let's try the other methods first as I think it's gonna be simpler for me..
I found this error on the Net in conjunction with a Debian/Ubuntu security-related custom kernel knob for disabling unprivileged user namespaces that was enabled by default once. I tested that with Ubuntu 18.04, 20.04 and 20.10 yesterday and it's disabled (i.e. namespaces for unprivileged users enabled) on all of them. You can still have a look at your setting in /proc/sys/kernel/unprivileged_userns_clone or with sysctl kernel.unprivileged_userns_clone. It needs to be set to 1 for rootless podman to work.
You're not by any chance running the Windows Subsystem for Linux (WSL)? https://github.com/containers/podman/issues/3288#issuecomment-501356136 :)
Or inside another container at a hosting service? https://github.com/containers/podman/issues/4056
Otherwise I have no idea what could be causing that and have never seen that error.
On Fri, Jan 22, 2021 at 1:45 AM Michael Weiser michael.weiser@gmx.de wrote:
Longer story: ldr does a 128bit load. This loads bytes in exactly reverse order into the register on LE and BE. As you describe above, the macros for LE expect a state which is neither of those: The bytes transposed as if BE but the doublewords as loaded on LE. For BE this poses the oppositve problem: It natively loads bytes in the order LE has to reproduce using rev64 but then needs to reproduce the doubleword order of LE for the LE routines to work or basically have native BE routines.
That's what my last pedestrian change did. After today I'd perhaps write it like this (untested):
@@ -125,10 +135,12 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key) ldr HQ,[TABLE,#16*H_Idx]
- dup EMSB.16b,H.b[0]
IF_LE(` rev64 H.16b,H.16b +',`
- ext H.16b,H.16b,H.16b,#8
')
- dup EMSB.16b,H.b[7] mov x1,#0xC200000000000000 mov x2,#1 mov POLY.d[0],x1
When trying to cater to the current layout on LE, all the other vectors which are later used in conjunction with H to be reversed as well. That leads to this diff to your initial patch:
@@ -125,14 +135,21 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key) ldr HQ,[TABLE,#16*H_Idx]
- dup EMSB.16b,H.b[0]
IF_LE(`
- dup EMSB.16b,H.b[0] rev64 H.16b,H.16b
+',`
- dup EMSB.16b,H.b[15]
') mov x1,#0xC200000000000000 mov x2,#1 +IF_LE(` mov POLY.d[0],x1 mov POLY.d[1],x2 +',`
- mov POLY.d[1],x1
- mov POLY.d[0],x2
+') sshr EMSB.16b,EMSB.16b,#7 and EMSB.16b,EMSB.16b,POLY.16b ushr B.2d,H.2d,#63 @@ -142,7 +159,11 @@ IF_LE(` orr H.16b,H.16b,B.16b eor H.16b,H.16b,EMSB.16b
+IF_LE(` dup POLY.2d,POLY.d[0] +',`
- dup POLY.2d,POLY.d[1]
+')
C --- calculate H^2 = H*H ---
The difference in index in dup EMSB nicely shows the doubleword transposition compared to LE. If on LE the dup was done after the rev64, it'd be H.b[7] vs. H.b[15].
I see what you did here, but I'm confused about ld1 and st1 instructions so let me clarify one thing before going on, how do ld1 and st1 load and store from/into memory in BE mode? If they perform in a normal way then there is no point of using ldr at all, I just used it because it handles imm offset. so to replace this line "ldr HQ,[TABLE,#16*H_Idx]" we can just add the offset to the register that hold the address "add x1,TABLE,#16*H_Idx" then load the H value by using ld1 "ld1 {H.16b},[x1]" in this way we can still have to deal with LE as transposed doublewords and with BE in normal way (not transposed doublewords or transposed quadword).
regards, Mamone
Hello Mamone,
On Fri, Jan 22, 2021 at 10:14:36PM +0200, Maamoun TK wrote:
The difference in index in dup EMSB nicely shows the doubleword transposition compared to LE. If on LE the dup was done after the rev64, it'd be H.b[7] vs. H.b[15].
I see what you did here, but I'm confused about ld1 and st1 instructions so let me clarify one thing before going on, how do ld1 and st1 load and store from/into memory in BE mode? If they perform in a normal way then there is no point of using ldr at all, I just used it because it handles imm offset. so to replace this line "ldr HQ,[TABLE,#16*H_Idx]" we can just add the offset to the register that hold the address "add x1,TABLE,#16*H_Idx" then
I've just retested and reread some ARM documents. Here's a patch that uses ld1.16b and thus eliminates almost all special BE treatment but subsequently has to leave in all the rev64s as well. This has the testsuite passing on BE and (still) LE. My take at an explanation below.
diff --git a/arm64/v8/gcm-hash.asm b/arm64/v8/gcm-hash.asm index 1c14db54..8c8a370e 100644 --- a/arm64/v8/gcm-hash.asm +++ b/arm64/v8/gcm-hash.asm @@ -55,17 +55,10 @@ C common macros: .endm
.macro REDUCTION out -IF_BE(` - pmull T.1q,F.1d,POLY.1d - ext \out().16b,F.16b,F.16b,#8 - eor R.16b,R.16b,T.16b - eor \out().16b,\out().16b,R.16b -',` pmull T.1q,F.1d,POLY.1d eor R.16b,R.16b,T.16b ext R.16b,R.16b,R.16b,#8 eor \out().16b,F.16b,R.16b -') .endm
C void gcm_init_key (union gcm_block *table) @@ -108,27 +101,20 @@ define(`H4M', `v29') define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2 -IF_BE(` - pmull2 Hp.1q,\in().2d,POLY.2d - ext Hm.16b,\in().16b,\in().16b,#8 - eor Hm.16b,Hm.16b,Hp.16b - zip \param1().2d,\in().2d,Hm.2d - zip2 \param2().2d,\in().2d,Hm.2d -',` pmull2 Hp.1q,\in().2d,POLY.2d eor Hm.16b,\in().16b,Hp.16b ext \param1().16b,Hm.16b,\in().16b,#8 ext \param2().16b,\in().16b,Hm.16b,#8 ext \param1().16b,\param1().16b,\param1().16b,#8 -') .endm
PROLOGUE(_nettle_gcm_init_key) - ldr HQ,[TABLE,#16*H_Idx] + C LSB vector load: x1+0 into H.b[0] and x1+15 into H.b[15] + add x1,TABLE,#16*H_Idx + ld1 {H.16b},[x1] dup EMSB.16b,H.b[0] -IF_LE(` + C treat H as two MSB doublewords rev64 H.16b,H.16b -') mov x1,#0xC200000000000000 mov x2,#1 mov POLY.d[0],x1 @@ -221,9 +207,7 @@ PROLOGUE(_nettle_gcm_hash) mov POLY.d[0],x4
ld1 {D.16b},[X] -IF_LE(` rev64 D.16b,D.16b -')
ands x4,LENGTH,#-64 b.eq L2x @@ -234,12 +218,10 @@ IF_LE(`
L4x_loop: ld1 {C0.16b,C1.16b,C2.16b,C3.16b},[DATA],#64 -IF_LE(` rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b rev64 C2.16b,C2.16b rev64 C3.16b,C3.16b -')
eor C0.16b,C0.16b,D.16b
@@ -262,10 +244,8 @@ L2x: ld1 {H1M.16b,H1L.16b,H2M.16b,H2L.16b},[TABLE]
ld1 {C0.16b,C1.16b},[DATA],#32 -IF_LE(` rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b -')
eor C0.16b,C0.16b,D.16b
@@ -283,9 +263,7 @@ L1x: ld1 {H1M.16b,H1L.16b},[TABLE]
ld1 {C0.16b},[DATA],#16 -IF_LE(` rev64 C0.16b,C0.16b -')
eor C0.16b,C0.16b,D.16b
@@ -335,9 +313,7 @@ Lmod_8_done: REDUCTION D
Ldone: -IF_LE(` rev64 D.16b,D.16b -') st1 {D.16b},[X] ret EPILOGUE(_nettle_gcm_hash)
My understanding is that ld1 and st1 are "single-element structure" operations. (Identical to vld1 in arm32 NEON we discussed recently for chacha and salsa2 asm.) That means they load a number of elements of a given type from consecutive memory locations into the corresponding vector register indices.
ld1 {v0.4s},[x0] would load four 32bit words from consecutive memory locations and put them into v0.s[0] through v0.s[3]. So x0+0..3 (bytes) would go into v0.s[0], x0+4..7 would to into v0.s[1] and so on. Endianness would apply to the internal byte order of the elements, so each word would be loaded MSB-first in BE-mode and LSB-first in LE-mode.
So, given memory content such as:
x0 + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 byte 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
We should get on BE:
MSB LSB v0.s[0]: 0 1 2 3 v0.s[1]: 4 5 6 7 v0.s[2]: 8 9 10 11 v0.s[3]: 12 13 14 15
Or looked at as byte-vectors:
|v0.s[0]|v0.s[1]| v0.s[2] | v0.s[3] | v0.b[0] v0.b[15] v0.16b: 3 2 1 0 7 6 5 4 11 10 9 8 15 14 13 12
On LE we should get:
MSB LSB v0.d[0]: 3 2 1 0 v0.d[1]: 7 6 5 4 v0.d[2]: 11 10 9 8 v0.d[3]: 15 14 13 12
v0.b[0] v0.b[15] v0.16b: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
This was just meant as intro. I've not actually tested this. I hope I got it right and not just added to everyone's confusion (mine included). :/
Back to ld1.16b: This now loads a vector of 16 bytes consecutively. Since bytes have no endianness there will be no changes in order on either LE and BE modes. The register content will look the same on both:
x0 + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 byte: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 v0.b[0] v0.b[15] v0.16b: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
So larger datatypes loaded that way should be stored little-endian in memory to make sense as e.g. .d[0] after such a load. Or we need to rev64 them.
load the H value by using ld1 "ld1 {H.16b},[x1]" in this way we can still have to deal with LE as transposed doublewords and with BE in normal way (not transposed doublewords or transposed quadword).
After sending my last email I realised that the doublewords aren't actually transposed with BE as such. They're just transposed compared to the original LE routine because the ldr instruction loads in completely reversed order in each mode and the LE routine does convert the internal byte order of the doublewords to BE but not the overall order of the 128bit quadword because it doesn't need to and regards them as a vector of two doublewords anyway.
ld1.16b doesn't change that at all. It just behaves the same on LE and BE. So we'll always load vectors of bytes. And it'll always be an LSB load. And if we want to treat them as big-endian doublewords we have to adjust them accordingly. That's why we now also need all the rev64s on BE above.
That opens another topic: As you may have noticed I haven't got the slightest idea of what the code is actually doing. Assembly also isn't my first language either. I'm only mechanically trying to get BE mode to produce the same results as LE.
This made me realise that I haven't the faintest idea what we're getting as input and producing as output either. :/ So are we working on blocks of bytes and producing blocks of bytes and just treating them as big-endian 64bit doublewords internally to exploit availability of instructions that can work on these types or could we actually declare the elements of TABLE to be quadwords in host endianness? Then we could actually throw ld1.2d at them and eliminate all the rev64s.
Duh, I think we can regardless, at least for BE:
diff --git a/arm64/v8/gcm-hash.asm b/arm64/v8/gcm-hash.asm index 1c14db54..642e3840 100644 --- a/arm64/v8/gcm-hash.asm +++ b/arm64/v8/gcm-hash.asm @@ -55,17 +55,10 @@ C common macros: .endm
.macro REDUCTION out -IF_BE(` - pmull T.1q,F.1d,POLY.1d - ext \out().16b,F.16b,F.16b,#8 - eor R.16b,R.16b,T.16b - eor \out().16b,\out().16b,R.16b -',` pmull T.1q,F.1d,POLY.1d eor R.16b,R.16b,T.16b ext R.16b,R.16b,R.16b,#8 eor \out().16b,F.16b,R.16b -') .endm
C void gcm_init_key (union gcm_block *table) @@ -108,27 +101,20 @@ define(`H4M', `v29') define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2 -IF_BE(` - pmull2 Hp.1q,\in().2d,POLY.2d - ext Hm.16b,\in().16b,\in().16b,#8 - eor Hm.16b,Hm.16b,Hp.16b - zip \param1().2d,\in().2d,Hm.2d - zip2 \param2().2d,\in().2d,Hm.2d -',` pmull2 Hp.1q,\in().2d,POLY.2d eor Hm.16b,\in().16b,Hp.16b ext \param1().16b,Hm.16b,\in().16b,#8 ext \param2().16b,\in().16b,Hm.16b,#8 ext \param1().16b,\param1().16b,\param1().16b,#8 -') .endm
PROLOGUE(_nettle_gcm_init_key) - ldr HQ,[TABLE,#16*H_Idx] - dup EMSB.16b,H.b[0] + add x1,TABLE,#16*H_Idx + ld1 {H.2d},[x1] IF_LE(` rev64 H.16b,H.16b ') + dup EMSB.16b,H.b[7] mov x1,#0xC200000000000000 mov x2,#1 mov POLY.d[0],x1 @@ -220,7 +206,7 @@ PROLOGUE(_nettle_gcm_hash) mov x4,#0xC200000000000000 mov POLY.d[0],x4
- ld1 {D.16b},[X] + ld1 {D.2d},[X] IF_LE(` rev64 D.16b,D.16b ') @@ -233,7 +219,7 @@ IF_LE(` ld1 {H3M.16b,H3L.16b,H4M.16b,H4L.16b},[x5]
L4x_loop: - ld1 {C0.16b,C1.16b,C2.16b,C3.16b},[DATA],#64 + ld1 {C0.2d,C1.2d,C2.2d,C3.2d},[DATA],#64 IF_LE(` rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b @@ -261,7 +247,7 @@ L2x:
ld1 {H1M.16b,H1L.16b,H2M.16b,H2L.16b},[TABLE]
- ld1 {C0.16b,C1.16b},[DATA],#32 + ld1 {C0.2d,C1.2d},[DATA],#32 IF_LE(` rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b @@ -282,7 +268,7 @@ L1x:
ld1 {H1M.16b,H1L.16b},[TABLE]
- ld1 {C0.16b},[DATA],#16 + ld1 {C0.2d},[DATA],#16 IF_LE(` rev64 C0.16b,C0.16b ') @@ -335,9 +321,7 @@ Lmod_8_done: REDUCTION D
Ldone: -IF_LE(` rev64 D.16b,D.16b -') st1 {D.16b},[X] ret EPILOGUE(_nettle_gcm_hash)
Please excuse my laboured and longwinded thinking. ;) I really have to start thinking in vectors also.
This also works for the whole TABLE and gives host-endianness storage there (where ld1.16b should have caused it to be little-endian before, if that's at all relevant):
diff --git a/arm64/v8/gcm-hash.asm b/arm64/v8/gcm-hash.asm index 1c14db54..bd6820b3 100644 --- a/arm64/v8/gcm-hash.asm +++ b/arm64/v8/gcm-hash.asm @@ -55,17 +55,10 @@ C common macros: .endm
.macro REDUCTION out -IF_BE(` - pmull T.1q,F.1d,POLY.1d - ext \out().16b,F.16b,F.16b,#8 - eor R.16b,R.16b,T.16b - eor \out().16b,\out().16b,R.16b -',` pmull T.1q,F.1d,POLY.1d eor R.16b,R.16b,T.16b ext R.16b,R.16b,R.16b,#8 eor \out().16b,F.16b,R.16b -') .endm
C void gcm_init_key (union gcm_block *table) @@ -108,27 +101,20 @@ define(`H4M', `v29') define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2 -IF_BE(` - pmull2 Hp.1q,\in().2d,POLY.2d - ext Hm.16b,\in().16b,\in().16b,#8 - eor Hm.16b,Hm.16b,Hp.16b - zip \param1().2d,\in().2d,Hm.2d - zip2 \param2().2d,\in().2d,Hm.2d -',` pmull2 Hp.1q,\in().2d,POLY.2d eor Hm.16b,\in().16b,Hp.16b ext \param1().16b,Hm.16b,\in().16b,#8 ext \param2().16b,\in().16b,Hm.16b,#8 ext \param1().16b,\param1().16b,\param1().16b,#8 -') .endm
PROLOGUE(_nettle_gcm_init_key) - ldr HQ,[TABLE,#16*H_Idx] - dup EMSB.16b,H.b[0] + add x1,TABLE,#16*H_Idx + ld1 {H.2d},[x1] IF_LE(` rev64 H.16b,H.16b ') + dup EMSB.16b,H.b[7] mov x1,#0xC200000000000000 mov x2,#1 mov POLY.d[0],x1 @@ -154,7 +140,7 @@ IF_LE(`
PMUL_PARAM H2,H2M,H2L
- st1 {H1M.16b,H1L.16b,H2M.16b,H2L.16b},[TABLE],#64 + st1 {H1M.2d,H1L.2d,H2M.2d,H2L.2d},[TABLE],#64
C --- calculate H^3 = H^1*H^2 ---
@@ -172,7 +158,7 @@ IF_LE(`
PMUL_PARAM H4,H4M,H4L
- st1 {H3M.16b,H3L.16b,H4M.16b,H4L.16b},[TABLE] + st1 {H3M.2d,H3L.2d,H4M.2d,H4L.2d},[TABLE]
ret EPILOGUE(_nettle_gcm_init_key) @@ -220,7 +206,7 @@ PROLOGUE(_nettle_gcm_hash) mov x4,#0xC200000000000000 mov POLY.d[0],x4
- ld1 {D.16b},[X] + ld1 {D.2d},[X] IF_LE(` rev64 D.16b,D.16b ') @@ -229,11 +215,11 @@ IF_LE(` b.eq L2x
add x5,TABLE,#64 - ld1 {H1M.16b,H1L.16b,H2M.16b,H2L.16b},[TABLE] - ld1 {H3M.16b,H3L.16b,H4M.16b,H4L.16b},[x5] + ld1 {H1M.2d,H1L.2d,H2M.2d,H2L.2d},[TABLE] + ld1 {H3M.2d,H3L.2d,H4M.2d,H4L.2d},[x5]
L4x_loop: - ld1 {C0.16b,C1.16b,C2.16b,C3.16b},[DATA],#64 + ld1 {C0.2d,C1.2d,C2.2d,C3.2d},[DATA],#64 IF_LE(` rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b @@ -259,9 +245,9 @@ L2x: tst LENGTH,#-32 b.eq L1x
- ld1 {H1M.16b,H1L.16b,H2M.16b,H2L.16b},[TABLE] + ld1 {H1M.2d,H1L.2d,H2M.2d,H2L.2d},[TABLE]
- ld1 {C0.16b,C1.16b},[DATA],#32 + ld1 {C0.2d,C1.2d},[DATA],#32 IF_LE(` rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b @@ -280,9 +266,9 @@ L1x: tst LENGTH,#-16 b.eq Lmod
- ld1 {H1M.16b,H1L.16b},[TABLE] + ld1 {H1M.2d,H1L.2d},[TABLE]
- ld1 {C0.16b},[DATA],#16 + ld1 {C0.2d},[DATA],#16 IF_LE(` rev64 C0.16b,C0.16b ') @@ -297,7 +283,7 @@ Lmod: tst LENGTH,#15 b.eq Ldone
- ld1 {H1M.16b,H1L.16b},[TABLE] + ld1 {H1M.2d,H1L.2d},[TABLE]
tbz LENGTH,3,Lmod_8 ldr C0D,[DATA],#8 @@ -338,6 +324,6 @@ Ldone: IF_LE(` rev64 D.16b,D.16b ') - st1 {D.16b},[X] + st1 {D.2d},[X] ret EPILOGUE(_nettle_gcm_hash)
And as always after all this guesswork I have found a likely very relevant comment in gcm.c:
/* Shift uses big-endian representation. */ #if WORDS_BIGENDIAN reduce = shift_table[x->u64[1] & 0xff];
Is that it? Or is TABLE just internal to the routine and we can store there however we please? (Apart from H at TABLE[128] initialised for us by gcm_set_key and stored BE?)
Hello Michael,
On Sat, Jan 23, 2021 at 2:45 AM Michael Weiser michael.weiser@gmx.de wrote:
I've just retested and reread some ARM documents. Here's a patch that uses ld1.16b and thus eliminates almost all special BE treatment but subsequently has to leave in all the rev64s as well. This has the testsuite passing on BE and (still) LE. My take at an explanation below.
diff --git a/arm64/v8/gcm-hash.asm b/arm64/v8/gcm-hash.asm index 1c14db54..8c8a370e 100644 --- a/arm64/v8/gcm-hash.asm +++ b/arm64/v8/gcm-hash.asm @@ -55,17 +55,10 @@ C common macros: .endm
.macro REDUCTION out -IF_BE(`
- pmull T.1q,F.1d,POLY.1d
- ext \out().16b,F.16b,F.16b,#8
- eor R.16b,R.16b,T.16b
- eor \out().16b,\out().16b,R.16b
-',` pmull T.1q,F.1d,POLY.1d eor R.16b,R.16b,T.16b ext R.16b,R.16b,R.16b,#8 eor \out().16b,F.16b,R.16b -') .endm
C void gcm_init_key (union gcm_block *table)
@@ -108,27 +101,20 @@ define(`H4M', `v29') define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2 -IF_BE(`
- pmull2 Hp.1q,\in().2d,POLY.2d
- ext Hm.16b,\in().16b,\in().16b,#8
- eor Hm.16b,Hm.16b,Hp.16b
- zip \param1().2d,\in().2d,Hm.2d
- zip2 \param2().2d,\in().2d,Hm.2d
-',` pmull2 Hp.1q,\in().2d,POLY.2d eor Hm.16b,\in().16b,Hp.16b ext \param1().16b,Hm.16b,\in().16b,#8 ext \param2().16b,\in().16b,Hm.16b,#8 ext \param1().16b,\param1().16b,\param1().16b,#8 -') .endm
PROLOGUE(_nettle_gcm_init_key)
- ldr HQ,[TABLE,#16*H_Idx]
- C LSB vector load: x1+0 into H.b[0] and x1+15 into H.b[15]
- add x1,TABLE,#16*H_Idx
- ld1 {H.16b},[x1] dup EMSB.16b,H.b[0]
-IF_LE(`
- C treat H as two MSB doublewords rev64 H.16b,H.16b
-') mov x1,#0xC200000000000000 mov x2,#1 mov POLY.d[0],x1 @@ -221,9 +207,7 @@ PROLOGUE(_nettle_gcm_hash) mov POLY.d[0],x4
ld1 {D.16b},[X]
-IF_LE(` rev64 D.16b,D.16b -')
ands x4,LENGTH,#-64 b.eq L2x
@@ -234,12 +218,10 @@ IF_LE(`
L4x_loop: ld1 {C0.16b,C1.16b,C2.16b,C3.16b},[DATA],#64 -IF_LE(` rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b rev64 C2.16b,C2.16b rev64 C3.16b,C3.16b -')
eor C0.16b,C0.16b,D.16b
@@ -262,10 +244,8 @@ L2x: ld1 {H1M.16b,H1L.16b,H2M.16b,H2L.16b},[TABLE]
ld1 {C0.16b,C1.16b},[DATA],#32
-IF_LE(` rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b -')
eor C0.16b,C0.16b,D.16b
@@ -283,9 +263,7 @@ L1x: ld1 {H1M.16b,H1L.16b},[TABLE]
ld1 {C0.16b},[DATA],#16
-IF_LE(` rev64 C0.16b,C0.16b -')
eor C0.16b,C0.16b,D.16b
@@ -335,9 +313,7 @@ Lmod_8_done: REDUCTION D
Ldone: -IF_LE(` rev64 D.16b,D.16b -') st1 {D.16b},[X] ret EPILOGUE(_nettle_gcm_hash)
I have one question here, do operations on doublewords transpose both doubleword parts in BE mode? for example pmull instruction transpose doublewords on LE mode when operated, in BE I don't expect the same behavior hence we can't get this patch working on BE mode. The core of pmull instruction is shift and xor operations so we can't perform pmull instruction on byte-reversed doublewords as it's gonna produce wrong results.
My understanding is that ld1 and st1 are "single-element structure"
operations. (Identical to vld1 in arm32 NEON we discussed recently for chacha and salsa2 asm.) That means they load a number of elements of a given type from consecutive memory locations into the corresponding vector register indices.
ld1 {v0.4s},[x0] would load four 32bit words from consecutive memory locations and put them into v0.s[0] through v0.s[3]. So x0+0..3 (bytes) would go into v0.s[0], x0+4..7 would to into v0.s[1] and so on. Endianness would apply to the internal byte order of the elements, so each word would be loaded MSB-first in BE-mode and LSB-first in LE-mode.
So, given memory content such as:
x0 + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 byte 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
We should get on BE:
MSB LSB
v0.s[0]: 0 1 2 3 v0.s[1]: 4 5 6 7 v0.s[2]: 8 9 10 11 v0.s[3]: 12 13 14 15
Or looked at as byte-vectors:
|v0.s[0]|v0.s[1]| v0.s[2] | v0.s[3] | v0.b[0] v0.b[15]
v0.16b: 3 2 1 0 7 6 5 4 11 10 9 8 15 14 13 12
On LE we should get:
MSB LSB
v0.d[0]: 3 2 1 0 v0.d[1]: 7 6 5 4 v0.d[2]: 11 10 9 8 v0.d[3]: 15 14 13 12
v0.b[0] v0.b[15]
v0.16b: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
This was just meant as intro. I've not actually tested this. I hope I got it right and not just added to everyone's confusion (mine included). :/
Back to ld1.16b: This now loads a vector of 16 bytes consecutively. Since bytes have no endianness there will be no changes in order on either LE and BE modes. The register content will look the same on both:
x0 + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 byte: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 v0.b[0] v0.b[15] v0.16b: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
So larger datatypes loaded that way should be stored little-endian in memory to make sense as e.g. .d[0] after such a load. Or we need to rev64 them.
load the H value by using ld1 "ld1 {H.16b},[x1]" in this way we can still have to deal with LE as transposed doublewords and with BE in normal way (not transposed doublewords or transposed quadword).
After sending my last email I realised that the doublewords aren't actually transposed with BE as such. They're just transposed compared to the original LE routine because the ldr instruction loads in completely reversed order in each mode and the LE routine does convert the internal byte order of the doublewords to BE but not the overall order of the 128bit quadword because it doesn't need to and regards them as a vector of two doublewords anyway.
ld1.16b doesn't change that at all. It just behaves the same on LE and BE. So we'll always load vectors of bytes. And it'll always be an LSB load. And if we want to treat them as big-endian doublewords we have to adjust them accordingly. That's why we now also need all the rev64s on BE above.
That opens another topic: As you may have noticed I haven't got the slightest idea of what the code is actually doing. Assembly also isn't my first language either. I'm only mechanically trying to get BE mode to produce the same results as LE.
This made me realise that I haven't the faintest idea what we're getting as input and producing as output either. :/ So are we working on blocks of bytes and producing blocks of bytes and just treating them as big-endian 64bit doublewords internally to exploit availability of instructions that can work on these types or could we actually declare the elements of TABLE to be quadwords in host endianness? Then we could actually throw ld1.2d at them and eliminate all the rev64s.
Duh, I think we can regardless, at least for BE:
diff --git a/arm64/v8/gcm-hash.asm b/arm64/v8/gcm-hash.asm index 1c14db54..642e3840 100644 --- a/arm64/v8/gcm-hash.asm +++ b/arm64/v8/gcm-hash.asm @@ -55,17 +55,10 @@ C common macros: .endm
.macro REDUCTION out -IF_BE(`
- pmull T.1q,F.1d,POLY.1d
- ext \out().16b,F.16b,F.16b,#8
- eor R.16b,R.16b,T.16b
- eor \out().16b,\out().16b,R.16b
-',` pmull T.1q,F.1d,POLY.1d eor R.16b,R.16b,T.16b ext R.16b,R.16b,R.16b,#8 eor \out().16b,F.16b,R.16b -') .endm
C void gcm_init_key (union gcm_block *table)
@@ -108,27 +101,20 @@ define(`H4M', `v29') define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2 -IF_BE(`
- pmull2 Hp.1q,\in().2d,POLY.2d
- ext Hm.16b,\in().16b,\in().16b,#8
- eor Hm.16b,Hm.16b,Hp.16b
- zip \param1().2d,\in().2d,Hm.2d
- zip2 \param2().2d,\in().2d,Hm.2d
-',` pmull2 Hp.1q,\in().2d,POLY.2d eor Hm.16b,\in().16b,Hp.16b ext \param1().16b,Hm.16b,\in().16b,#8 ext \param2().16b,\in().16b,Hm.16b,#8 ext \param1().16b,\param1().16b,\param1().16b,#8 -') .endm
PROLOGUE(_nettle_gcm_init_key)
- ldr HQ,[TABLE,#16*H_Idx]
- dup EMSB.16b,H.b[0]
- add x1,TABLE,#16*H_Idx
- ld1 {H.2d},[x1]
IF_LE(` rev64 H.16b,H.16b ')
- dup EMSB.16b,H.b[7] mov x1,#0xC200000000000000 mov x2,#1 mov POLY.d[0],x1
@@ -220,7 +206,7 @@ PROLOGUE(_nettle_gcm_hash) mov x4,#0xC200000000000000 mov POLY.d[0],x4
- ld1 {D.16b},[X]
- ld1 {D.2d},[X]
IF_LE(` rev64 D.16b,D.16b ') @@ -233,7 +219,7 @@ IF_LE(` ld1 {H3M.16b,H3L.16b,H4M.16b,H4L.16b},[x5]
L4x_loop:
- ld1 {C0.16b,C1.16b,C2.16b,C3.16b},[DATA],#64
- ld1 {C0.2d,C1.2d,C2.2d,C3.2d},[DATA],#64
IF_LE(` rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b @@ -261,7 +247,7 @@ L2x:
ld1 {H1M.16b,H1L.16b,H2M.16b,H2L.16b},[TABLE]
- ld1 {C0.16b,C1.16b},[DATA],#32
- ld1 {C0.2d,C1.2d},[DATA],#32
IF_LE(` rev64 C0.16b,C0.16b rev64 C1.16b,C1.16b @@ -282,7 +268,7 @@ L1x:
ld1 {H1M.16b,H1L.16b},[TABLE]
- ld1 {C0.16b},[DATA],#16
- ld1 {C0.2d},[DATA],#16
IF_LE(` rev64 C0.16b,C0.16b ') @@ -335,9 +321,7 @@ Lmod_8_done: REDUCTION D
Ldone: -IF_LE(` rev64 D.16b,D.16b -') st1 {D.16b},[X] ret EPILOGUE(_nettle_gcm_hash)
I like your ideas so far as you're shrinking the gap between both endianness code but if my previous concern is right we still can't get this patch works too.
Please excuse my laboured and longwinded thinking. ;) I really have to start thinking in vectors also.
Actually, I'm impressed how you get and handle all these ideas in your mind and turn around quickly once you get a new one. Dealing with vector registers in aarch64 is really challenging, both x86_64 and PowerPC don't drag the endianness issues to vector registers, it's only applied to memory and once the data loaded from memory into vector register all endianness concerns are ended. Although PowerPC supports both endianness modes, AltiVec instructions operate the same on vector registers on both modes. It's a weird decision made by the Arm side.
And as always after all this guesswork I have found a likely very relevant comment in gcm.c:
/* Shift uses big-endian representation. */ #if WORDS_BIGENDIAN reduce = shift_table[x->u64[1] & 0xff];
Is that it? Or is TABLE just internal to the routine and we can store there however we please? (Apart from H at TABLE[128] initialised for us by gcm_set_key and stored BE?)
The assembly implementation of GHASH has a whole different scheme from C table-lookup implementation, you don't have to worry about any of that.
regards, Mamone
Hello Mamone,
On Sat, Jan 23, 2021 at 08:52:30PM +0200, Maamoun TK wrote:
@@ -280,9 +266,9 @@ L1x: tst LENGTH,#-16 b.eq Lmod
- ld1 {H1M.16b,H1L.16b},[TABLE]
- ld1 {H1M.2d,H1L.2d},[TABLE]
- ld1 {C0.16b},[DATA],#16
- ld1 {C0.2d},[DATA],#16
IF_LE(` rev64 C0.16b,C0.16b ')
behavior hence we can't get this patch working on BE mode. The core of
First off: All three patches from my previous mail had the test gcm-hash passing on LE and BE. I just reconfirmed the last patch with the whole testsuite on LE and BE. So they should be working and cause no regression.
I have one question here, do operations on doublewords transpose both doubleword parts in BE mode? for example pmull instruction transpose doublewords on LE mode when operated, in BE I don't expect the same behavior hence we can't get this patch working on BE mode. The core of pmull instruction is shift and xor operations so we can't perform pmull instruction on byte-reversed doublewords as it's gonna produce wrong results.
I think this directly corresponds to your next question:
Dealing with vector registers in aarch64 is really challenging, both x86_64 and PowerPC don't drag the endianness issues to vector registers, it's only applied to memory and once the data loaded from memory into vector register all endianness concerns are ended. Although PowerPC supports both endianness modes, AltiVec instructions operate the same on vector registers on both modes. It's a weird decision made by the Arm side.
I think there might be a misunderstanding here (possibly caused by my attemps at explaining what ldr does, sorry):
On arm(32) and aarch64, endianness is also exclusively handled on load and store operations. Register layout and operation behaviour is identical in both modes. I think ARM also speaks of "memory endianness" for just that reason. There is no adjustable "CPU endianness". It's always "CPU-native".
So pmull will behave exactly the same in BE and LE mode. We just have to make sure our load operations put the operands in the correct (i.e. CPU-native) representation into the correct vector register indices upon load.
So as an example:
pmull2 v0.1q,v1.2d,v2.2d
will always work on d[2] of v1 and v2 and put the result into all of v0. And it expects its operands there in exactly one format, i.e. the least significant bit at one end and the most-significant bit at the other (and it's the same ends/bits in both memory-endianness modes :). And it will also store to v0 in exactly the same representation in LE and BE mode. Nothing changes with an endianness mode switch.
That's where load and store come in:
ld1 {v1.2d,v2.2d},[x0]
will load v1 and v2 with one-dimensional vectors from memory. So v1.d[0] will be read from x0+0, v1.d[1] from x0+8 (bytes) and v2.d[0] from x0+16 and v2.d[1] from x0+24. That'll also be the same in LE and BE mode because that's the structure of the vector prescribed by the load operation we choose. Endianness will be applied to the individual doublewords but the order in which they're loaded from memory and in which they're put into d[0] and d[1] won't change, because they're vectors.
So if you've actually stored a vector from CPU registers using st1 {v1.2d, v2.2d},[x0] and then load them back using ld1 {v1.2d, v2.2d},[x0] there's nothing else that needs to be done. The individual bytes of the doublewords will be stored LE in memory in LE mode and BE in BE mode but you won't notice. And the order of the doublewords in memory will be the same in both modes.
If you're loading something that isn't stored LE or has no endianness at all, e.g. just a sequence of data bytes (as in DATA in our code) or something that was explicitly stored BE even on an LE CPU (as in TABLE[128] in our code, I gather) but want to treat it as a larger datatype, then you have to define endianness and need to apply correction. That's why we need to rev64 in one mode (e.g. LE) to get the same register-content in both endianness modes if what's loaded isn't actually stored in that endianness in memory.
Another way is to explicitly load a vector of bytes using ld1 {v1.16b, v2.16b},[x0]. Then you can be sure what you get as register content, no matter what memory endianness the CPU is using. If it's really just a sequence of data bytes stored in their correct and necessary order in memory and we only want to apply shifts and logical operations to each of them, we'd be all set.
Here we can also exploit but need to be careful to understand the different views on the register, so the fact that b[0] through b[7] is mapped to d[0] and that b[0] will be the least significant byte in d[0] and b[7] will be MSB. This layout is cpu-native, i.e. also the same in both endianness modes. It's just that an ld1 {v1.16b} will always load a vector of bytes with eight elements as consecutive bytes from memory into b[0] through b[7], so it'll always be an LSB-first load when interpreted as a larger data type. If we then look at that data trough d[0] it will appear reversed if it isn't really a doubleword that was stored little-endian.
That's why an ld1 {v1.b16,v2.b16},[x0] will produce incorrect results with a pmull2 v0.1q,v1.2d,v2.2d in at least one endianness because we're telling one operation that it's dealing with a byte-vector and the other expects us to provide a vector of doublewords. If what we're loading is actually something that was stored as doublewords in current memory endianness, then ld1 {v1.2d,v2.2d} is the correct load operation. If it's data bytes we want to *treat* as a big-endian doubleword, we can use either ld1 {v1.16b,v2.16b} or {v1.2d,v2.2d} but in both cases need to rev64 the register content if memory endianness is LE.
Now what *ldr* does is load a single 128bit quadword. And this will indeed transpose the doublewords in BE mode when looked at through d[0] and d[1]. Because as a big-endian load it will of course load the byte at x0 into the most significant byte of e.g. v2, i.e. v2.d[1], i.e. v2.b[15] and not v2.d[0], i.e. v2.b[7] (as with ld1.2d) or v2.b[0] (as with ld1.16b). Similarly, x0+15 will go into v2.b[0] in BE and v2.b[15] in LE mode. So this will only make sense if what we're loading was actually stored using str as a 128bit quadword in current memory endianness. If it's a sequence of bytes (st1.16b) we want to treat as a vector of doublewords, not only will the bytes appear inverted when looked at through d[0] and d[1] but also what's in d[0] will be in d[1] in the other endianness mode and vice-versa. If it's a vector of doublewords in memory endianness (st1.2d), byte order in the register will be correct in both modes (because it's different in memory) but d[0] and d[1] will still be transposed.
That's where all my rambling about doubleword transposition came from. Does that make sense?
I just found this document from the LLVM guys with pictures! :) https://llvm.org/docs/BigEndianNEON.html
BTW: ARM even goes as far as always storing *instructions* themselves, so the actual opcodes the CPU decodes and executes, little-endian, even in BE binaries. So the instruction fetch and decode stage always operates little-endian. When the instruction is executed it's then just an additional flag that tells load and store instructions how to behave when executed and accessing memory. (I'm actually extrapolation from what I know to be true for classic arm32 but it makes sense for that to be true for aarch64 as well.)
Please excuse my laboured and longwinded thinking. ;) I really have to start thinking in vectors also.
Actually, I'm impressed how you get and handle all these ideas in your mind and turn around quickly once you get a new one.
Uh, thanks, FWIW. :)
I think to gather you (same as me) prefer to think in big-endian representation. As for arm and aarch64, little-endian is the default, do you think, the routine could be changed to move the special endianness treatment using rev64 to BE mode, i.e. avoid them in the standard LE case? It's certainly beyond me but it might give some additional speedup.
Or would it be irrelevant compared to the speedup already given by using pmull in the first place?
@@ -335,9 +321,7 @@ Lmod_8_done: REDUCTION D
Ldone: -IF_LE(` rev64 D.16b,D.16b -') st1 {D.16b},[X] ret EPILOGUE(_nettle_gcm_hash)
I like your ideas so far as you're shrinking the gap between both endianness code but if my previous concern is right we still can't get this patch works too.
As said, the testsuite is passing with all three diffs from my previous mail.
[...] PASS: symbols PASS: dlopen ==================== All 110 tests passed ==================== make[1]: Leaving directory '/home/michael/build-aarch64_be/testsuite' Making check in examples make[1]: Entering directory '/home/michael/build-aarch64_be/examples' TEST_SHLIB_DIR="/home/michael/build-aarch64_be/.lib" \ srcdir="../../nettle/examples" EMULATOR="" EXEEXT="" \ "../../nettle"/run-tests rsa-sign-test rsa-verify-test rsa-encrypt-test xxxxxx xxxxxx PASS: rsa-sign PASS: rsa-verify PASS: rsa-encrypt ================== All 3 tests passed ================== make[1]: Leaving directory '/home/michael/build-aarch64_be/examples' [michael@aarch64-be:~/build-aarch64_be]
[...] PASS: symbols PASS: dlopen ==================== All 110 tests passed ==================== make[1]: Leaving directory '/home/michael/build-aarch64/testsuite' Making check in examples make[1]: Entering directory '/home/michael/build-aarch64/examples' TEST_SHLIB_DIR="/home/michael/build-aarch64/.lib" \ srcdir="../../nettle/examples" EMULATOR="" EXEEXT="" \ "../../nettle"/run-tests rsa-sign-test rsa-verify-test rsa-encrypt-test xxxxxx xxxxxx ee PASS: rsa-sign PASS: rsa-verify PASS: rsa-encrypt ================== All 3 tests passed ================== make[1]: Leaving directory '/home/michael/build-aarch64/examples' [michael@aarch64:~/build-aarch64]
And as always after all this guesswork I have found a likely very relevant comment in gcm.c:
/* Shift uses big-endian representation. */ #if WORDS_BIGENDIAN reduce = shift_table[x->u64[1] & 0xff];
Is that it? Or is TABLE just internal to the routine and we can store there however we please? (Apart from H at TABLE[128] initialised for us by gcm_set_key and stored BE?)
The assembly implementation of GHASH has a whole different scheme from C table-lookup implementation, you don't have to worry about any of that.
Perfect. So whether we use ld1/st1.16b or ld1/st1.2d for TABLE doesn't matter. I wouldn't expect it but we could benchmark whether one is faster than the other though!?
For clarification: How is H, i.e. TABLE[128] defined an interface to gcm_set_key? I see that gcm_set_key calls a cipher function to fill it. So I guess it provides the routine with a sequence of bytes (similar to DATA), i.e. the key, which will be the same on LE and BE and we *treat* it as a big-endian doubleword for the sake of using pmull on it. Correct?
Hello Michael,
On Sun, Jan 24, 2021 at 3:15 PM Michael Weiser michael.weiser@gmx.de wrote:
I think there might be a misunderstanding here (possibly caused by my attemps at explaining what ldr does, sorry):
On arm(32) and aarch64, endianness is also exclusively handled on load and store operations. Register layout and operation behaviour is identical in both modes. I think ARM also speaks of "memory endianness" for just that reason. There is no adjustable "CPU endianness". It's always "CPU-native".
So pmull will behave exactly the same in BE and LE mode. We just have to make sure our load operations put the operands in the correct (i.e. CPU-native) representation into the correct vector register indices upon load.
So as an example:
pmull2 v0.1q,v1.2d,v2.2d
will always work on d[2] of v1 and v2 and put the result into all of v0. And it expects its operands there in exactly one format, i.e. the least significant bit at one end and the most-significant bit at the other (and it's the same ends/bits in both memory-endianness modes :). And it will also store to v0 in exactly the same representation in LE and BE mode. Nothing changes with an endianness mode switch.
That's where load and store come in:
ld1 {v1.2d,v2.2d},[x0]
will load v1 and v2 with one-dimensional vectors from memory. So v1.d[0] will be read from x0+0, v1.d[1] from x0+8 (bytes) and v2.d[0] from x0+16 and v2.d[1] from x0+24. That'll also be the same in LE and BE mode because that's the structure of the vector prescribed by the load operation we choose. Endianness will be applied to the individual doublewords but the order in which they're loaded from memory and in which they're put into d[0] and d[1] won't change, because they're vectors.
So if you've actually stored a vector from CPU registers using st1 {v1.2d, v2.2d},[x0] and then load them back using ld1 {v1.2d, v2.2d},[x0] there's nothing else that needs to be done. The individual bytes of the doublewords will be stored LE in memory in LE mode and BE in BE mode but you won't notice. And the order of the doublewords in memory will be the same in both modes.
If you're loading something that isn't stored LE or has no endianness at all, e.g. just a sequence of data bytes (as in DATA in our code) or something that was explicitly stored BE even on an LE CPU (as in TABLE[128] in our code, I gather) but want to treat it as a larger datatype, then you have to define endianness and need to apply correction. That's why we need to rev64 in one mode (e.g. LE) to get the same register-content in both endianness modes if what's loaded isn't actually stored in that endianness in memory.
Another way is to explicitly load a vector of bytes using ld1 {v1.16b, v2.16b},[x0]. Then you can be sure what you get as register content, no matter what memory endianness the CPU is using. If it's really just a sequence of data bytes stored in their correct and necessary order in memory and we only want to apply shifts and logical operations to each of them, we'd be all set.
Here we can also exploit but need to be careful to understand the different views on the register, so the fact that b[0] through b[7] is mapped to d[0] and that b[0] will be the least significant byte in d[0] and b[7] will be MSB. This layout is cpu-native, i.e. also the same in both endianness modes. It's just that an ld1 {v1.16b} will always load a vector of bytes with eight elements as consecutive bytes from memory into b[0] through b[7], so it'll always be an LSB-first load when interpreted as a larger data type. If we then look at that data trough d[0] it will appear reversed if it isn't really a doubleword that was stored little-endian.
That's why an ld1 {v1.b16,v2.b16},[x0] will produce incorrect results with a pmull2 v0.1q,v1.2d,v2.2d in at least one endianness because we're telling one operation that it's dealing with a byte-vector and the other expects us to provide a vector of doublewords. If what we're loading is actually something that was stored as doublewords in current memory endianness, then ld1 {v1.2d,v2.2d} is the correct load operation. If it's data bytes we want to *treat* as a big-endian doubleword, we can use either ld1 {v1.16b,v2.16b} or {v1.2d,v2.2d} but in both cases need to rev64 the register content if memory endianness is LE.
Now what *ldr* does is load a single 128bit quadword. And this will indeed transpose the doublewords in BE mode when looked at through d[0] and d[1]. Because as a big-endian load it will of course load the byte at x0 into the most significant byte of e.g. v2, i.e. v2.d[1], i.e. v2.b[15] and not v2.d[0], i.e. v2.b[7] (as with ld1.2d) or v2.b[0] (as with ld1.16b). Similarly, x0+15 will go into v2.b[0] in BE and v2.b[15] in LE mode. So this will only make sense if what we're loading was actually stored using str as a 128bit quadword in current memory endianness. If it's a sequence of bytes (st1.16b) we want to treat as a vector of doublewords, not only will the bytes appear inverted when looked at through d[0] and d[1] but also what's in d[0] will be in d[1] in the other endianness mode and vice-versa. If it's a vector of doublewords in memory endianness (st1.2d), byte order in the register will be correct in both modes (because it's different in memory) but d[0] and d[1] will still be transposed.
That's where all my rambling about doubleword transposition came from. Does that make sense?
I just found this document from the LLVM guys with pictures! :) https://llvm.org/docs/BigEndianNEON.html
BTW: ARM even goes as far as always storing *instructions* themselves, so the actual opcodes the CPU decodes and executes, little-endian, even in BE binaries. So the instruction fetch and decode stage always operates little-endian. When the instruction is executed it's then just an additional flag that tells load and store instructions how to behave when executed and accessing memory. (I'm actually extrapolation from what I know to be true for classic arm32 but it makes sense for that to be true for aarch64 as well.)
That explains everything, it also explains why ld1 instruction reverse the byte order according to loading type on BE and always maintain the same order on LE. The non memory related instructions maintain the same behavior as it should no matter what the endianness mode they run on. Thanks for the detailed explanation. This scheme has a couple of advantages: - Taking advantage of performance benefit of LE data layout on both memory and registers side. - Eliminating the overhead caused by transposing data order for every potential load/store operation on LE since it's a more popular mode.
I think to gather you (same as me) prefer to think in big-endian
representation. As for arm and aarch64, little-endian is the default, do you think, the routine could be changed to move the special endianness treatment using rev64 to BE mode, i.e. avoid them in the standard LE case? It's certainly beyond me but it might give some additional speedup.
Or would it be irrelevant compared to the speedup already given by using pmull in the first place?
I don't know how it gonna affect the performance but it's irrelevant margin indeed, TBH I liked the patch with the special endianness treatment but it's up to you to decide!
And as always after all this guesswork I have found a likely very relevant comment in gcm.c:
/* Shift uses big-endian representation. */ #if WORDS_BIGENDIAN reduce = shift_table[x->u64[1] & 0xff];
Is that it? Or is TABLE just internal to the routine and we can store there however we please? (Apart from H at TABLE[128] initialised for us by gcm_set_key and stored BE?)
The assembly implementation of GHASH has a whole different scheme from C table-lookup implementation, you don't have to worry about any of that.
Perfect. So whether we use ld1/st1.16b or ld1/st1.2d for TABLE doesn't matter. I wouldn't expect it but we could benchmark whether one is faster than the other though!?
Yeah, it doesn't matter since gcm_init_key() and gcm_hash() are the only functions that use the table. keeping it ld1/st1.16b is fine, either way there is a table layout at header of the file that gives a sense about the table structure for the assembly implementation scheme.
For clarification: How is H, i.e. TABLE[128] defined an interface to gcm_set_key? I see that gcm_set_key calls a cipher function to fill it. So I guess it provides the routine with a sequence of bytes (similar to DATA), i.e. the key, which will be the same on LE and BE and we *treat* it as a big-endian doubleword for the sake of using pmull on it. Correct?
subkey 'H' value is calculated by enciphering (usually using AES) a sequence of ZERO data, then gcm_set_key() assign the calculated value (subkey 'H') at the middle of TABLE array, that is TABLE[80], the remaining fields of array are meant to be filled by C gcm_init_key() routine to server as assistance subkeys for C table-look implementation. Since the assembly implementation uses a different scheme, we don't need those assistance subkeys so we grab the main subkey (H) value from the middle of the table and hook our needed assistance values on this table in order to be used by gcm_hash(). Hope it makes sense for you, let me know if you want to hear further explanation.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
subkey 'H' value is calculated by enciphering (usually using AES) a sequence of ZERO data, then gcm_set_key() assign the calculated value (subkey 'H') at the middle of TABLE array, that is TABLE[80],
And the reason for it being stored in the *middle* is the "unnatural" gcm bitorder. The C implementation uses the table for the gcm multiplication, using 8 bits at a time from one of the inputs as the table index. Conceptually, the H value belongs at index 1 in the table, 0000 0001 in binary, but in gcm's opposite bitorder world, that corresponds to 1000 0000. If I remember correctly, the implementation using 8 bit indexing, including the table layout, closely follows the original gcm papers.
Regards, /Niels
Hello Mamone,
On Sun, Jan 24, 2021 at 06:44:33PM +0200, Maamoun TK wrote:
representation. As for arm and aarch64, little-endian is the default, do you think, the routine could be changed to move the special endianness treatment using rev64 to BE mode, i.e. avoid them in the standard LE case? It's certainly beyond me but it might give some additional speedup.
Or would it be irrelevant compared to the speedup already given by using pmull in the first place?
I don't know how it gonna affect the performance but it's irrelevant margin indeed, TBH I liked the patch with the special endianness treatment but it's up to you to decide!
As you might expect, I like the one where doubleword vectors are used throughout and stored in host endianness in TABLE because to me it's most intuitive. For DATA my rationale is that if we want to *treat* it as big-endian doublewords we should load it as doublewords to make it clearer why and what we need to adjust afterwards. It also avoids the rev64s with BE. I've added some comments with rationale. I've added a README with an excerpt of last email as well. Attached are the current patches, the first being your original. What do you think?
As said, I'm up for looking into endianness-specific versions of the macros again. But what was supposed to be the LE versions of PMUL and friends has now become the BE-native versions and we'd need to come up with variants of them that make the rev64s unneccessary. Any ideas?
Hello Michael,
On Mon, Jan 25, 2021 at 8:45 PM Michael Weiser michael.weiser@gmx.de wrote:
Attached are the current patches, the first being your original. What do you think?
I liked how the patch ended up so far, just give me one or two days to give the patch additional review before letting it up to Neils.
As said, I'm up for looking into endianness-specific versions of the macros again. But what was supposed to be the LE versions of PMUL and friends has now become the BE-native versions and we'd need to come up with variants of them that make the rev64s unneccessary. Any ideas?
Are you looking for removing rev64s on LE? If so, I don't think we can figure a variant that allows us continue working on an unsorted register value on LE as pmull requires the input to be sorted properly, that is transposed doublewords.
regards, Mamone
Hello Mamone,
On Tue, Jan 26, 2021 at 07:15:22PM +0200, Maamoun TK wrote:
Attached are the current patches, the first being your original. What do you think?
I liked how the patch ended up so far, just give me one or two days to give the patch additional review before letting it up to Neils.
Perfect.
As said, I'm up for looking into endianness-specific versions of the macros again. But what was supposed to be the LE versions of PMUL and friends has now become the BE-native versions and we'd need to come up with variants of them that make the rev64s unneccessary. Any ideas?
Are you looking for removing rev64s on LE? If so, I don't think we can figure a variant that allows us continue working on an unsorted register value on LE as pmull requires the input to be sorted properly, that is transposed doublewords.
Let's leave it as it is then. I've caused enough effort with my little hobby of running an ARM BE system for now. :)
On Wed, Jan 27, 2021 at 12:45 AM Michael Weiser michael.weiser@gmx.de wrote:
I've caused enough effort with my little hobby of running an ARM BE system for now. :)
Thank you for the great work, we're now able to run the optimized gcm core on big-endian arm64 systems. I enjoyed working with you in order to get this done as I also learned many stuff about arm endianness stuff.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
Are you looking for removing rev64s on LE? If so, I don't think we can figure a variant that allows us continue working on an unsorted register value on LE as pmull requires the input to be sorted properly, that is transposed doublewords.
I haven't been following along closely, but it would be if gcm_hash could work with a minimum of data shuffling, and let gsm_init_key move the precomputed data around for best layout.
Regards, /Niels
On Mon, Jan 25, 2021 at 8:45 PM Michael Weiser michael.weiser@gmx.de wrote:
Attached are the current patches.
Everything looks fine to me, I made an additional review and the code seems good for both endianness modes. The patches pass the testsuite on little-endian and big-endian (Thanks to Michael Weiser for providing a ready to go environment to test the patch on big-endian mode) I made one more patch that adds proper copyright and removes unused define.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
Everything looks fine to me, I made an additional review and the code seems good for both endianness modes. The patches pass the testsuite on little-endian and big-endian (Thanks to Michael Weiser for providing a ready to go environment to test the patch on big-endian mode) I made one more patch that adds proper copyright and removes unused define.
Nice!
I've merged the easy parts, machine.m4 and README, onto the arm64 branch. Not crystal clear how the more interesting parts relate, though.
Is 0001-Mamone-s-unmodified-patch.patch the same as https://git.lysator.liu.se/nettle/nettle/-/merge_requests/13? Do you want to update the merge request with recent changes (on top of the current arm64 branch), or should I merge mr13 as is, and then add the other two patches (Michaels's BE support and this "adds proper copyright and removes unused define") on top?
Regards, /Niels
On Sat, Jan 30, 2021 at 6:07 PM Niels Möller nisse@lysator.liu.se wrote:
Is 0001-Mamone-s-unmodified-patch.patch the same as https://git.lysator.liu.se/nettle/nettle/-/merge_requests/13? Do you want to update the merge request with recent changes (on top of the current arm64 branch), or should I merge mr13 as is, and then add the other two patches (Michaels's BE support and this "adds proper copyright and removes unused define") on top?
The merge request is out of date and should be closed. You just need to merge 0001-Mamone-s-unmodified-patch.patch then 0003-aarch64-Adjust-gcm-hash-assembly-for-big-endian-syst.patch on top of the former.
regards, Mamone
This is a new patch to fix the clang build if "armv8-a-crypto" is enabled and should be applied on top of the previous patches.
regards, Mamone
On Sun, Jan 31, 2021 at 1:17 AM Maamoun TK maamoun.tk@googlemail.com wrote:
On Sat, Jan 30, 2021 at 6:07 PM Niels Möller nisse@lysator.liu.se wrote:
Is 0001-Mamone-s-unmodified-patch.patch the same as https://git.lysator.liu.se/nettle/nettle/-/merge_requests/13? Do you want to update the merge request with recent changes (on top of the current arm64 branch), or should I merge mr13 as is, and then add the other two patches (Michaels's BE support and this "adds proper copyright and removes unused define") on top?
The merge request is out of date and should be closed. You just need to merge 0001-Mamone-s-unmodified-patch.patch then 0003-aarch64-Adjust-gcm-hash-assembly-for-big-endian-syst.patch on top of the former.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
This is a new patch to fix the clang build if "armv8-a-crypto" is enabled and should be applied on top of the previous patches.
Thanks, merged all the changes to the arm64 branch. Let me know if there's anything I missed. I have a few comments on the main patch, I'll write that in a separate mail.
Regards, /Niels
Michael Weiser michael.weiser@gmx.de writes:
Subject: [PATCH 1/4] Mamone's unmodified patch
Hi, I've merged this, but I have a couple of comments and questions.
--- a/Makefile.in +++ b/Makefile.in @@ -616,6 +616,7 @@ distdir: $(DISTFILES) set -e; for d in sparc32 sparc64 x86 \ x86_64 x86_64/aesni x86_64/sha_ni x86_64/fat \ arm arm/neon arm/v6 arm/fat \
arm64 arm64/v8 \
Why the name "v8" for the directory, aren't arm64 and v8 more or less synonyms? I think it would make more sense with a name connected to the extension needed for the pmull instructions.
--- /dev/null +++ b/arm64/v8/gcm-hash.asm @@ -0,0 +1,343 @@
+C common macros: +.macro PMUL in, param1, param2
- pmull F.1q,\param2().1d,\in().1d
- pmull2 F1.1q,\param2().2d,\in().2d
- pmull R.1q,\param1().1d,\in().1d
- pmull2 R1.1q,\param1().2d,\in().2d
- eor F.16b,F.16b,F1.16b
- eor R.16b,R.16b,R1.16b
+.endm
For consistency, I'd prefer defining all needed macros using m4.
--- a/configure.ac +++ b/configure.ac @@ -81,6 +81,10 @@ AC_ARG_ENABLE(arm-neon, AC_HELP_STRING([--enable-arm-neon], [Enable ARM Neon assembly. (default=auto)]),, [enable_arm_neon=auto])
+AC_ARG_ENABLE(armv8-a-crypto,
- AC_HELP_STRING([--enable-armv8-a-crypto], [Enable Armv8-A Crypto extension. (default=no)]),,
- [enable_armv8_a_crypto=no])
I think this would be more user-friendle without the "a", --enable-armv8-crypto, or --enable-arm64-crypto. Or do you foresee any collision with an incompatible ARMv8-M crypto extension or the like?
- aarch64*)
if test "$enable_armv8_a_crypto" = yes ; then
if test "$ABI" = 64 ; then
CFLAGS="$CFLAGS -Wa,-march=armv8-a+crypto"
(This looks slightly different after merging all the changes).
I think it's unfortunate to have to modify CFLAGS, and in particular using compiler-specific options. Is there any way to use a pseudoop in the .asm file instead, similar to the .fpu neon used in the arm/neon/ files?
One could also consider introducing a separate ASMFLAGS make variable (suggested earlier by Jeffrey Walton, for other reasons).
Regards, /Niels
Hello Niels,
I think this would be more user-friendle without the "a", --enable-armv8-crypto, or --enable-arm64-crypto. Or do you foresee any collision with an incompatible ARMv8-M crypto extension or the like?
FWIW, I like --enable-arm64-crypto because it would nicely match with a directory arm64/crypto for the source and the idea of enabling the crypto extension for the arm64 target of nettle and be in line with --enable-arm-neon and arm/neon as well.
- aarch64*)
if test "$enable_armv8_a_crypto" = yes ; then
if test "$ABI" = 64 ; then
CFLAGS="$CFLAGS -Wa,-march=armv8-a+crypto"
(This looks slightly different after merging all the changes).
I think it's unfortunate to have to modify CFLAGS, and in particular using compiler-specific options. Is there any way to use a pseudoop in the .asm file instead, similar to the .fpu neon used in the arm/neon/ files?
With binutils gas, both .arch and .arch_extension seem to do what you describe. Based on when they appeared in the manual, both are supported in gas since 2.26[4]. I've done a quick test with 2.35.1. I have successfully tried both
.arch armv8-a+crypto (the -a is required here, otherwise errors are still thrown for uses of pmull with just armv8 or armv8-r)
and
.arch_extension crypto
The testsuite still runs with both on LE and BE cross-compiled and run under qemu-user.
binutils 2.26 also know the crypto extension name and were released January 2016. aarch64 support seems to have been introduced in 2.23 (October 2012) and with 2.25 (July 2015) the crypto flag to the -march command line option was added. (All based on when it appeared in the documentation.) So we'd likely have a dependency on 2.25 by using the -march option already and 2.26 wouldn't be a big step.
[4] https://sourceware.org/binutils/docs-2.26/as/AArch64-Directives.html
All this is gas-specific though, I would assume. Some discussion of compatible extensions to llvm-as seems to have happened in 2018 but I have not researched what came out of it[5]. The recent date and that it's the first search hit and no others link to documentation or such doesn't bode well IMO. It might as well be that llvm-as just knows the pmull instruction and assembles it fine but can't check if the target CPU will be able to run it.
[5] https://lists.llvm.org/pipermail/llvm-dev/2018-September/126346.html
What other assemblers for aarch64 do you have in mind?
On Sun, Jan 31, 2021 at 10:00 PM Michael Weiser michael.weiser@gmx.de wrote:
It might as well be that llvm-as just knows the pmull instruction and assembles it fine but can't check if the target CPU will be able to run it.
llvm-as wouldn't recognize pmull instruction without adding -march=armv8-a+crypto flag at least with the version I use "3.8.1" I tried both .arch armv8-a+crypto and .arch_extension crypto and they worked only for gas while llvm-as produces errors for pmull using.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
llvm-as wouldn't recognize pmull instruction without adding -march=armv8-a+crypto flag at least with the version I use "3.8.1" I tried both .arch armv8-a+crypto and .arch_extension crypto and they worked only for gas while llvm-as produces errors for pmull using.
Is there any documentation for llvm-as? Best I could find is the minimal man page https://www.llvm.org/docs/CommandGuide/llvm-as.html, with no info whatsoever on, e.g., supported pseudoops.
Regards, /Niels
Hello Niels,
On Tue, Feb 02, 2021 at 07:40:44AM +0100, Niels Möller wrote:
llvm-as wouldn't recognize pmull instruction without adding -march=armv8-a+crypto flag at least with the version I use "3.8.1"
3.8.1 was released in 2017. It might not support recent aarch64 additions regarding .arch directive and friends.
I tried both .arch armv8-a+crypto and .arch_extension crypto and they worked only for gas while llvm-as produces errors for pmull using.
Is there any documentation for llvm-as? Best I could find is the minimal man page https://www.llvm.org/docs/CommandGuide/llvm-as.html, with no info whatsoever on, e.g., supported pseudoops.
I think my mentioning of llvm-as was a red herring. Looking at the output of clang -v, llvm-as isn't involved at all. This is supported by the man page stating that llvm-as accepts LLVM assembly and emits LLVM bytecode. It appears, clang implements the assembler internally and we'd need documentation on that. The clang man page even says so):
# man clang | grep assembler Clang also supports the use of an integrated assembler, in which the code generator produces object files directly. This avoids the overhead of generating the ".s" file and of calling the target assembler.
With that info I find [1] which lists the .arch directive including +crypto syntax. armclang seems to be the official ARM toolchain.[2]
[1] https://www.keil.com/support/man/docs/armclang_ref/armclang_ref_hhk151067459... [2] https://developer.arm.com/tools-and-software/embedded/arm-compiler/downloads...
It is unclear to me if it's available upstream as well or an ARM addition to the assembler. I'll try to get clang/llvm installed on my pine64 boards for tests. That might take a few days though. :) I'll see if I can try a prebuilt toolchain in the meantime.
Calling clang on assembly source with extension .s it calls itself with (undocumented) option -cc1as, so likely again the integrated assembler:
# clang -v -c -o t.o t.s clang version 11.0.1 Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/lib/llvm/11/bin Selected GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0 Candidate multilib: .;@m64 Candidate multilib: 32;@m32 Selected multilib: .;@m64 "/usr/lib/llvm/11/bin/clang-11" -cc1as -triple x86_64-pc-linux-gnu -filetype obj -main-file-name t.s -target-cpu x86-64 -fdebug-compilation-dir /home/m -dwarf-debug-producer "clang version 11.0.1" -dwarf-version=4 -mrelocation-model static -o t.o t.s
On Tue, Feb 2, 2021 at 8:00 AM Michael Weiser michael.weiser@gmx.de wrote:
llvm-as wouldn't recognize pmull instruction without adding -march=armv8-a+crypto flag at least with the version I use "3.8.1"
3.8.1 was released in 2017. It might not support recent aarch64 additions regarding .arch directive and friends.
I tried both .arch armv8-a+crypto and .arch_extension crypto and they worked only for gas while llvm-as produces errors for pmull using.
Is there any documentation for llvm-as? Best I could find is the minimal man page https://www.llvm.org/docs/CommandGuide/llvm-as.html, with no info whatsoever on, e.g., supported pseudoops.
I think my mentioning of llvm-as was a red herring. Looking at the output of clang -v, llvm-as isn't involved at all. This is supported by the man page stating that llvm-as accepts LLVM assembly and emits LLVM bytecode. It appears, clang implements the assembler internally and we'd need documentation on that. The clang man page even says so):
Clang always uses its integrated assembler unless you pass -fno-integrated-as. If you use -fno-integrated-as, then be sure you have an assembler that supports the ISA you are targeting. On OS X, GNU's AS may not support the ISA.
Clang's assembler is crippled on OS X. Apple's Clang still does not support pmull or crc instructions.
Jeff
On Tue, Feb 2, 2021 at 8:19 AM Jeffrey Walton noloader@gmail.com wrote:
On Tue, Feb 2, 2021 at 8:00 AM Michael Weiser michael.weiser@gmx.de wrote:
llvm-as wouldn't recognize pmull instruction without adding -march=armv8-a+crypto flag at least with the version I use "3.8.1"
3.8.1 was released in 2017. It might not support recent aarch64 additions regarding .arch directive and friends.
I tried both .arch armv8-a+crypto and .arch_extension crypto and they worked only for gas while llvm-as produces errors for pmull using.
Is there any documentation for llvm-as? Best I could find is the minimal man page https://www.llvm.org/docs/CommandGuide/llvm-as.html, with no info whatsoever on, e.g., supported pseudoops.
I think my mentioning of llvm-as was a red herring. Looking at the output of clang -v, llvm-as isn't involved at all. This is supported by the man page stating that llvm-as accepts LLVM assembly and emits LLVM bytecode. It appears, clang implements the assembler internally and we'd need documentation on that. The clang man page even says so):
Clang always uses its integrated assembler unless you pass -fno-integrated-as. If you use -fno-integrated-as, then be sure you have an assembler that supports the ISA you are targeting. On OS X, GNU's AS may not support the ISA.
Clang's assembler is crippled on OS X. Apple's Clang still does not support pmull or crc instructions.
And I forgot to mention... On OS X, when using a port like MacPorts with GCC... You want to pass -Wa,-q to GCC so GCC uses Clang's integrated assembler. Without -Wa,-q, GCC will try to use GNU's AS.
Jeff
Hi all,
On Tue, Feb 02, 2021 at 08:23:39AM -0500, Jeffrey Walton wrote:
I think my mentioning of llvm-as was a red herring. Looking at the output of clang -v, llvm-as isn't involved at all. This is supported by the man page stating that llvm-as accepts LLVM assembly and emits LLVM bytecode. It appears, clang implements the assembler internally and we'd need documentation on that. The clang man page even says so):
Clang always uses its integrated assembler unless you pass -fno-integrated-as. If you use -fno-integrated-as, then be sure you
I've downloaded binary builds of clang for aarch64 from https://releases.llvm.org/download.html. 3.9.1 was the oldest prebuilt toolchain I could find there and 11.0.0 the most recent.
As expected, a one-liner with just a pmull throws errors with gas and the two clangs:
$ cat t.s pmull v2.1q, v2.1d, v1.1d $ aarch64-unknown-linux-gnu-as -v -o t.o t.s GNU assembler version 2.35.1 (aarch64-unknown-linux-gnu) using BFD version (Gentoo 2.35.1 p2) 2.35.1 t.s: Assembler messages: t.s:1: Error: selected processor does not support `pmull v2.1q,v2.1d,v1.1d' $ clang+llvm-3.9.1-aarch64-linux-gnu/bin/clang -c -o t.o t.s t.s:1:1: error: instruction requires: crypto pmull v2.1q, v2.1d, v1.1d ^ $ clang+llvm-11.0.0-aarch64-linux-gnu/bin/clang -c -o t.o t.s t.s:1:1: error: instruction requires: aes pmull v2.1q, v2.1d, v1.1d ^
This can be solved for all three with the -march option:
$ aarch64-unknown-linux-gnu-as -o t.o t.s -march=armv8-a+crypto $ clang+llvm-3.9.1-aarch64-linux-gnu/bin/clang -c -o t.o t.s -march=armv8-a+crypto $ clang+llvm-11.0.0-aarch64-linux-gnu/bin/clang -c -o t.o t.s -march=armv8-a+crypto $
They also all support the .arch directive:
$ cat t.s .arch armv8-a+crypto pmull v2.1q, v2.1d, v1.1d $ aarch64-unknown-linux-gnu-as -o t.o t.s $ clang+llvm-3.9.1-aarch64-linux-gnu/bin/clang -c -o t.o t.s $ clang+llvm-11.0.0-aarch64-linux-gnu/bin/clang -c -o t.o t.s $
clang does not, however, support the .arch_extension directive. 3.9.1 complains about the directive, 11.0.0 seems to silently ignore it:
$ cat t.s .arch_extension crypto pmull v2.1q, v2.1d, v1.1d $ aarch64-unknown-linux-gnu-as -o t.o t.s $ clang+llvm-3.9.1-aarch64-linux-gnu/bin/clang -c -o t.o t.s t.s:1:1: error: unknown directive .arch_extension crypto ^ t.s:2:1: error: instruction requires: crypto pmull v2.1q, v2.1d, v1.1d ^ $ clang+llvm-11.0.0-aarch64-linux-gnu/bin/clang -c -o t.o t.s t.s:2:1: error: instruction requires: aes pmull v2.1q, v2.1d, v1.1d ^
Michael Weiser michael.weiser@gmx.de writes:
I've downloaded binary builds of clang for aarch64 from https://releases.llvm.org/download.html. 3.9.1 was the oldest prebuilt toolchain I could find there and 11.0.0 the most recent.
[...]
They also all support the .arch directive:
$ cat t.s .arch armv8-a+crypto pmull v2.1q, v2.1d, v1.1d $ aarch64-unknown-linux-gnu-as -o t.o t.s $ clang+llvm-3.9.1-aarch64-linux-gnu/bin/clang -c -o t.o t.s $ clang+llvm-11.0.0-aarch64-linux-gnu/bin/clang -c -o t.o t.s
Thanks for investigating. The .arch pseudoop it is, then.
I've pushed a change to use that, instead of modifying CFLAGS.
Regards, /Niels
Hello Niels,
On Tue, Feb 02, 2021 at 06:09:42PM +0100, Niels Möller wrote:
I've downloaded binary builds of clang for aarch64 from https://releases.llvm.org/download.html. 3.9.1 was the oldest prebuilt toolchain I could find there and 11.0.0 the most recent.
[...]
They also all support the .arch directive:
$ cat t.s .arch armv8-a+crypto pmull v2.1q, v2.1d, v1.1d $ aarch64-unknown-linux-gnu-as -o t.o t.s $ clang+llvm-3.9.1-aarch64-linux-gnu/bin/clang -c -o t.o t.s $ clang+llvm-11.0.0-aarch64-linux-gnu/bin/clang -c -o t.o t.s
Thanks for investigating. The .arch pseudoop it is, then.
I've pushed a change to use that, instead of modifying CFLAGS.
The arm64 branch builds and passes the testsuite on aarch64 and aarch64_be with gcc 10.2 and clang 11.0.1 with and without the optimized assembly routines on my pine64 boards. This is with the .arch directive instead of modifying CFLAGS and the new configure option name --enable-arm64-crypto.
Out of curiosity I've also collected some benchmark numbers for gcm_aes256. (Is that a correct and sensible algorithm for that purpose?)
The speedup from using pmull seems to be around 35% for encrypt/decrypt.
Interestingly, LE is about a cycle per block faster than BE even though it should have quite a few more rev64s to execute than BE. Could this be masked by memory accesses, pipelining or scheduling?
How is the massive speedup in update to be interpreted and that BE here is indeed quite a bit faster than LE? Do I understand correctly that on update only GCM is run on unencrypted data for authentication purposes so that this number really indicates the pure GCM pmull speedup? If so, it would indicate 19-fold speedup and an 8.6% advantage to BE.
What's also curious is that the system's openssl 1.1.1i is consistenly reported an order of magnitude faster than nettle. I guess the major factor is that there's no optimized AES for aarch64 yet in nettle which openssl seems to have. So I built an openssl 1.1.1i without assembly which produced the last benchmark which would support that.
cat /sys/devices/system/cpu/cpufreq/policy0/scaling_governor performance cat /sys/devices/system/cpu/cpufreq/policy0/cpuinfo_max_freq 1152000 LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1.152e9 gcm_aes256
Algorithm mode Mbyte/s cycles/byte cycles/block
aarch64-le gcc 10.2 with arm64-cypto: gcm_aes256 encrypt 29.42 37.34 597.41 gcm_aes256 decrypt 29.43 37.34 597.36 gcm_aes256 update 1417.32 0.78 12.40
openssl gcm_aes256 encrypt 391.93 2.80 44.85 openssl gcm_aes256 decrypt 392.35 2.80 44.80 openssl gcm_aes256 update 1246.04 0.88 14.11
aarch64-be gcc 10.2 with arm64-cypto: gcm_aes256 encrypt 29.35 37.43 598.82 gcm_aes256 decrypt 29.36 37.42 598.77 gcm_aes256 update 1540.34 0.71 11.41
openssl gcm_aes256 encrypt 398.96 2.75 44.06 openssl gcm_aes256 decrypt 397.66 2.76 44.20 openssl gcm_aes256 update 1306.05 0.84 13.46
aarch64-le clang 11.0.1 with arm64-cypto: gcm_aes256 encrypt 28.76 38.20 611.15 gcm_aes256 decrypt 28.76 38.19 611.10 gcm_aes256 update 1416.17 0.78 12.41
openssl gcm_aes256 encrypt 392.32 2.80 44.81 openssl gcm_aes256 decrypt 392.35 2.80 44.80 openssl gcm_aes256 update 1247.72 0.88 14.09
aarch64-be clang 11.0.1 with arm64-cypto: gcm_aes256 encrypt 28.70 38.28 612.53 gcm_aes256 decrypt 28.69 38.29 612.59 gcm_aes256 update 1543.87 0.71 11.39
openssl gcm_aes256 encrypt 399.46 2.75 44.00 openssl gcm_aes256 decrypt 398.90 2.75 44.07 openssl gcm_aes256 update 1317.87 0.83 13.34
aarch64-le gcc 10.2 without arm64-cypto: gcm_aes256 encrypt 21.43 51.27 820.28 gcm_aes256 decrypt 21.43 51.27 820.30 gcm_aes256 update 74.39 14.77 236.30
openssl gcm_aes256 encrypt 391.93 2.80 44.85 openssl gcm_aes256 decrypt 392.17 2.80 44.82 openssl gcm_aes256 update 1245.13 0.88 14.12
aarch64-be gcc 10.2 without arm64-cypto: gcm_aes256 encrypt 21.71 50.60 809.58 gcm_aes256 decrypt 21.72 50.59 809.43 gcm_aes256 update 79.01 13.90 222.47
openssl gcm_aes256 encrypt 398.43 2.76 44.12 openssl gcm_aes256 decrypt 398.67 2.76 44.09 openssl gcm_aes256 update 1309.52 0.84 13.42
aarch64-le clang 11.0.1 without arm64-cypto: gcm_aes256 encrypt 18.98 57.89 926.29 gcm_aes256 decrypt 18.98 57.89 926.22 gcm_aes256 update 53.67 20.47 327.53
openssl gcm_aes256 encrypt 392.16 2.80 44.82 openssl gcm_aes256 decrypt 392.17 2.80 44.82 openssl gcm_aes256 update 1248.30 0.88 14.08
aarch64-be clang 11.0.1 without arm64-cypto: gcm_aes256 encrypt 18.89 58.16 930.49 gcm_aes256 decrypt 18.85 58.28 932.54 gcm_aes256 update 53.67 20.47 327.53
openssl gcm_aes256 encrypt 399.36 2.75 44.02 openssl gcm_aes256 decrypt 398.87 2.75 44.07 openssl gcm_aes256 update 1318.44 0.83 13.33
aarch64-be gcc 10.2 without arm64-crypto and with no-asm openssl: LD_LIBRARY_PATH=../../openssl-1.1.1i:../.lib ./nettle-benchmark -f 1.152e9 gcm_aes256
Algorithm mode Mbyte/s cycles/byte cycles/block
gcm_aes256 encrypt 21.72 50.59 809.43 gcm_aes256 decrypt 21.72 50.59 809.45 gcm_aes256 update 79.02 13.90 222.45
openssl gcm_aes256 encrypt 21.06 52.17 834.70 openssl gcm_aes256 decrypt 21.34 51.49 823.82 openssl gcm_aes256 update 56.18 19.55 312.87
x86_64 Intel Skylake laptop gcc 10.2 fat as sanity check: NETTLE_FAT_VERBOSE=1 LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 4.6e9 aes256 libnettle: fat library initialization. libnettle: cpu features: vendor:intel,aesni libnettle: using aes instructions. libnettle: not using sha_ni instructions. libnettle: intel SSE2 will be used for memxor. sha1_compress: 209.50 cycles salsa20_core: 205.70 cycles sha3_permute: 918.50 cycles (38.27 / round)
Algorithm mode Mbyte/s cycles/byte cycles/block
aes256 ECB encrypt 4856.60 0.90 14.45 aes256 ECB decrypt 4800.03 0.91 14.62 aes256 CBC encrypt 889.91 4.93 78.87 aes256 CBC decrypt 4331.24 1.01 16.21 aes256 (in-place) 3516.29 1.25 19.96 aes256 CTR 3131.58 1.40 22.41 aes256 (in-place) 2826.07 1.55 24.84
openssl aes256 ECB encrypt 4840.40 0.91 14.50 openssl aes256 ECB decrypt 4835.88 0.91 14.51
gcm_aes256 encrypt 585.60 7.49 119.86 gcm_aes256 decrypt 585.29 7.50 119.92 gcm_aes256 update 697.69 6.29 100.60
openssl gcm_aes256 encrypt 4499.49 0.97 15.60 openssl gcm_aes256 decrypt 4498.84 0.98 15.60 openssl gcm_aes256 update 11383.81 0.39 6.17
Just out of curiosity: I assume there's no aesni-pmull-like GCM implementation for x86_64?
Michael Weiser michael.weiser@gmx.de writes:
The arm64 branch builds and passes the testsuite on aarch64 and aarch64_be with gcc 10.2 and clang 11.0.1 with and without the optimized assembly routines on my pine64 boards. This is with the .arch directive instead of modifying CFLAGS and the new configure option name --enable-arm64-crypto.
Thanks for testing! (My own testing was done with cross-compiler and user-level qemu).
Out of curiosity I've also collected some benchmark numbers for gcm_aes256. (Is that a correct and sensible algorithm for that purpose?)
I think that's appropriate for benchmarking gcm_hash, but the "update" numbers are the ones that reflect gcm_hash performance.
The speedup from using pmull seems to be around 35% for encrypt/decrypt.
Interestingly, LE is about a cycle per block faster than BE even though it should have quite a few more rev64s to execute than BE. Could this be masked by memory accesses, pipelining or scheduling?
For the encrypt/decrypt operations, you also run AES (in CTR mode), which works with little-endian data.
How is the massive speedup in update to be interpreted and that BE here is indeed quite a bit faster than LE? Do I understand correctly that on update only GCM is run on unencrypted data for authentication purposes so that this number really indicates the pure GCM pmull speedup?
That's right, the "update" numbers runs only the authentication part of gcm, i.e., gcm_hash. Which is useful for benchmarking gcm_hash, but probably not so relevant for real world applications, since I'd expect it's rare to pass large amounts of "associated data" to gcm.
What's also curious is that the system's openssl 1.1.1i is consistenly reported an order of magnitude faster than nettle. I guess the major factor is that there's no optimized AES for aarch64 yet in nettle which openssl seems to have.
That would be my guess too. And if we look at the update numbers only, the new code appears a bit faster than openssl.
Just out of curiosity: I assume there's no aesni-pmull-like GCM implementation for x86_64?
That's right. There's some assembly code, but using the same algorithm as the C implementation, based on table lookups.
Regards, /Niels
On Tue, 2 Feb 2021, Michael Weiser wrote:
clang does not, however, support the .arch_extension directive. 3.9.1 complains about the directive, 11.0.0 seems to silently ignore it:
$ cat t.s .arch_extension crypto pmull v2.1q, v2.1d, v1.1d $ aarch64-unknown-linux-gnu-as -o t.o t.s $ clang+llvm-3.9.1-aarch64-linux-gnu/bin/clang -c -o t.o t.s t.s:1:1: error: unknown directive .arch_extension crypto ^ t.s:2:1: error: instruction requires: crypto pmull v2.1q, v2.1d, v1.1d ^ $ clang+llvm-11.0.0-aarch64-linux-gnu/bin/clang -c -o t.o t.s t.s:2:1: error: instruction requires: aes pmull v2.1q, v2.1d, v1.1d ^
Clang does actually support .arch_extension for aarch64 in general since Clang 8 - but the "crypto" extension seems to be a bit of a special case, as it expands to a number of other features, including aes and sha2, depending on the base architecture level:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AsmPa...
This routine is called when enabling extensions in .arch, but not in .arch_extension - which is a bug.
So ".arch_extension aes" would work, but setting the extension via the .arch directive is indeed more compatible.
// Martin
Michael Weiser michael.weiser@gmx.de writes:
FWIW, I like --enable-arm64-crypto because it would nicely match with a directory arm64/crypto for the source and the idea of enabling the crypto extension for the arm64 target of nettle and be in line with --enable-arm-neon and arm/neon as well.
I'll rename both the directory and the configure option then.
Regards, /Niels
On Tue, Feb 2, 2021 at 7:22 PM Niels Möller nisse@lysator.liu.se wrote:
Michael Weiser michael.weiser@gmx.de writes:
FWIW, I like --enable-arm64-crypto because it would nicely match with a directory arm64/crypto for the source and the idea of enabling the crypto extension for the arm64 target of nettle and be in line with --enable-arm-neon and arm/neon as well.
I'll rename both the directory and the configure option then.
I agree with the configure option, I also see directories in x86_64 named with corresponding features so "crypto" name makes sense here too.
regards, Mamone
On Sun, Jan 31, 2021 at 10:35 AM Niels Möller nisse@lysator.liu.se wrote:
--- /dev/null +++ b/arm64/v8/gcm-hash.asm @@ -0,0 +1,343 @@
+C common macros: +.macro PMUL in, param1, param2
- pmull F.1q,\param2().1d,\in().1d
- pmull2 F1.1q,\param2().2d,\in().2d
- pmull R.1q,\param1().1d,\in().1d
- pmull2 R1.1q,\param1().2d,\in().2d
- eor F.16b,F.16b,F1.16b
- eor R.16b,R.16b,R1.16b
+.endm
For consistency, I'd prefer defining all needed macros using m4.
The macros in gcm-hash.asm file are dependent on defines in the same file (shared for macros and function implementation) as they are relevant with the implementation context, also moving those macros to another file makes confusion for reader IMO.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
On Sun, Jan 31, 2021 at 10:35 AM Niels Möller nisse@lysator.liu.se wrote:
For consistency, I'd prefer defining all needed macros using m4.
The macros in gcm-hash.asm file are dependent on defines in the same file (shared for macros and function implementation) as they are relevant with the implementation context, also moving those macros to another file makes confusion for reader IMO.
I'm not suggesting moving them to a different file, just changing the definition to use m4 define, something like (untested):
C PMUL(in, param1, param2) define(`PMUL', ` pmull F.1q, $3.1d, $1.1d pmull2 F1.1q, $3.2d, $1.2d pmull R.1q, $2.1d, $1.1d pmull2 R1.1q, $2.2d, $1.2d eor F.16b, F.16b, F1.16b eor R.16b, R.16b, R1.16b ')
With the recently added m4-utils.m4, one could also add some checking with m4_assert_numargs(3) at the start of the macro definition, but that's completely optional (other similar macros currently don't do that).
Regards, /Niels
On Fri, Jan 22, 2021 at 1:45 AM Michael Weiser michael.weiser@gmx.de wrote:
Do you think it makes sense to try and adjust the code to work with the BE layout natively and have a full 128bit reverse after ldr-like loads on LE instead (considering that 99,999% of aarch64 users will run LE)?
If you don't have a use-case we can suspend big-endian support of GCM optimization on aarch64 until we get a request, use case or maybe aarch64_be get more support in the future by main distributions.
regards, Mamone
On Fri, Jan 22, 2021 at 5:48 PM Maamoun TK maamoun.tk@googlemail.com wrote:
On Fri, Jan 22, 2021 at 1:45 AM Michael Weiser michael.weiser@gmx.de wrote:
Do you think it makes sense to try and adjust the code to work with the BE layout natively and have a full 128bit reverse after ldr-like loads on LE instead (considering that 99,999% of aarch64 users will run LE)?
If you don't have a use-case we can suspend big-endian support of GCM optimization on aarch64 until we get a request, use case or maybe aarch64_be get more support in the future by main distributions.
+1. At minimum, someone needs to produce an image to load on a commodity board. If there are no images for a common board then there's no demand in the market. There's no reason to jump through hoops, like qemu, to solve a problem that does not exist.
Jeff
Hi Mamone, Jeff,
sorry for the duplication, used the wrong sender address for the list again.
On Fri, Jan 22, 2021 at 07:07:46PM -0500, Jeffrey Walton wrote:
Do you think it makes sense to try and adjust the code to work with the BE layout natively and have a full 128bit reverse after ldr-like loads on LE instead (considering that 99,999% of aarch64 users will run LE)?
If you don't have a use-case we can suspend big-endian support of GCM optimization on aarch64 until we get a request, use case or maybe aarch64_be get more support in the future by main distributions.
I was actually referring to the performance hit for the overwhelming number of users of a possible "mostly natively BE with quite some overhead for converting back and forth on LE" approach compared to "mostly-LE with slight adjustments for BE" as it (seemingly) started out.
But today's session really cleared up for me that it isn't so much LE vs. BE but just vector element order. What endianness remains is just the given interface to the rest of the nettle code being defined as BE. With the last patch from my previous email all this seemed to fall into place nicely.
+1. At minimum, someone needs to produce an image to load on a commodity board. If there are no images for a common board then there's no demand in the market. There's no reason to jump through hoops, like qemu, to solve a problem that does not exist.
My use-case has always been "because I can" and I really appreciate everone's indulgence so far. So yes, by all means, focus on producing LE asm for arm/aarch64 and I either dig into that for BE support or just disable asm routines on my BE boards.
I also sometimes wonder who is actually producing all this nicely working armeb and aarch64_be support in Qemu and Linux kernel when there's apparently no users. I can understand that it's 90 or 95% very good programming where endianness handling for PowerPC or MIPS just also happens to work for BE arm. But the other 5% must come from somewhere. So there must be some demand by someone but it's certainly very obscure. ;)
On Tue, Jan 5, 2021 at 5:52 PM Maamoun TK maamoun.tk@googlemail.com wrote:
On Tue, Jan 5, 2021 at 3:23 PM Niels Möller nisse@lysator.liu.se wrote:
I wonder which assembly files we should use if target host is aarch64, but ABI=32? I guess the arm/v6/ code can be used unconditionally. Can we also use arm/neon/ code unconditionally?
The reference manual says
Armv8 can support the following levels of support for Advanced SIMD and floating-point instructions:
Full SIMD and floating-point support without exception trapping.
Full SIMD and floating-point support with exception trapping.
No floating-point or SIMD support. This option is licensed only for implementations targeting specialized markets.
As far as I understand, that means Neon should be always available, in both 32-bit and 64-bit mode.
I'll investigate how we can build the existing NEON implementations on 64-bit systems.
I spent some time investigating and testing, it looks like aarch64 gcc can not handle 32-bit assembly code currently. In order to build 32-bit arm binaries on 64-bit systems, one has to use 'gcc-arm-linux-gnueabi' or 'gcc-arm-linux-gnueabihf' toolchains, I went through the options available in aarch64 gcc https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html but none of which allow us to use 32-bit assembly code even '-mabi=ilp32' doesn't do that as I get the same errors with or without it. I'm afraid that we need to re-write the 32-bit assembly code in 64-bit format in order to get those optimizations enabled in 64-bit arm binaries.
regards, Mamone
On Tue, Jan 5, 2021 at 5:52 PM Maamoun TK maamoun.tk@googlemail.com wrote:
On Tue, Jan 5, 2021 at 3:23 PM Niels Möller nisse@lysator.liu.se wrote:
I've made a new branch "arm64" with the configure changes. If you think that looks ok, can you add your new ghash code on top of that?
Great. I'll add the ghash code to the branch once I finish the big-endian support.
(It would be good to also get S390x into the ci system, before adding s390x-specific assembly. I hope that should be easy to do with the same cross setup as for arm, arm64, mips, etc).
This is not possible since qemu doesn't support cipher functions, it implements subcode 0 (query) without actual encipher/decipher operations, take a look here
https://git.qemu.org/?p=qemu.git;a=commit;h=be2b567018d987591647935a7c9648e9...
I had a talk with David Edelsohn for this issue, I concluded that there is no support for cipher functions on qemu and it's unlikely to happen anytime soon. However, I updated the testutils to cover the s390x-specific assembly so the patch can easily be tested manually by executing 'make check'. I also have tested every aspect of this patch to make sure everything will go well once it's merged.
I wonder which assembly files we should use if target host is aarch64,
but ABI=32? I guess the arm/v6/ code can be used unconditionally. Can we also use arm/neon/ code unconditionally?
The reference manual says
Armv8 can support the following levels of support for Advanced SIMD and floating-point instructions:
Full SIMD and floating-point support without exception trapping.
Full SIMD and floating-point support with exception trapping.
No floating-point or SIMD support. This option is licensed only for implementations targeting specialized markets.
As far as I understand, that means Neon should be always available, in both 32-bit and 64-bit mode.
I'll investigate how we can build the existing NEON implementations on 64-bit systems.
regards, Mamone
nettle-bugs@lists.lysator.liu.se