I've just pushed new aes code using intel's aesni instructions. See
https://git.lysator.liu.se/nettle/nettle/blob/530014f3f811d9018ec83a8748fdbc...
It gave a speedup of almost 10 times on the haswell machine where I tested it (and in addition, it should avoid sidechannel leaks in those functions). Clearly, this will be more useful after adding support for fat binaries, detecting presence of these instructions at runtime. For now, it has to be enabled explicitly with the configure argument --enable-x86-aesni.
I have one question, on how to enable support for these instructions in the assembler. For now I added a pseudo-op
.arch bdver2
and that seems to work, but it's a bit too specific for my taste. I would have preferred something like .arch generic64,aes, but I couldn't get that to work. So what's the right way to do this?
I haven't played with the corresponding arch flags to gcc, but I'd prefer do declare within the .asm file itself which instruction set it is intended for.
Feedback on the actual assembler code is also appreciated, of course. It's pretty basic, a dozen lines, no unrolling or other cleverness.
Regards, /Niels
On Sun, Jan 11, 2015 at 3:27 PM, Niels Möller nisse@lysator.liu.se wrote:
I've just pushed new aes code using intel's aesni instructions. See
https://git.lysator.liu.se/nettle/nettle/blob/530014f3f811d9018ec83a8748fdbc... It gave a speedup of almost 10 times on the haswell machine where I tested it (and in addition, it should avoid sidechannel leaks in those functions). Clearly, this will be more useful after adding support for fat binaries, detecting presence of these instructions at runtime. For now, it has to be enabled explicitly with the configure argument --enable-x86-aesni. I have one question, on how to enable support for these instructions in the assembler. For now I added a pseudo-op .arch bdver2
No idea. The openssl code I currently use in gnutls, doesn't utilize the AES instructions. It outputs sequences of: .byte 102,15,56,220,248 .byte 102,68,15,56,220,192 for these instructions. That way they have the code compiled on any system, and the ones with aesni get to execute it. While it works, it requires to do the assembler's job though.
https://github.com/openssl/openssl/blob/69d5747f90136aa026a96204f26ab39549df...
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
No idea. The openssl code I currently use in gnutls, doesn't utilize the AES instructions. It outputs sequences of: .byte 102,15,56,220,248 .byte 102,68,15,56,220,192 for these instructions.
That's a reasonable fallback (gmp does something similar for some instructions). But I'd still like to know the right way to do it...
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Clearly, this will be more useful after adding support for fat binaries, detecting presence of these instructions at runtime.
I've now had a first go at fat-library support. Checked in on the branch fat-library. See https://git.lysator.liu.se/nettle/nettle/blob/fat-library/x86_64/fat/fat.c
Configuration is a bit clumsy and should probably be reorganized when as functions are added, but it seems to work. Except that I try to make it more verbose if NETTLE_FAT_VERBOSE is set in the environment, but on the machine where I tested static library build, there's no output. Maybe the initialization is this case is done before stderr is setup properly. There may be some remaining problems in getting the Makefiles and configure to work with machine-specific files which are C rather than assembly code.
Let me quote the initial comment in fat.c:
/* Fat library initialization works as follows. The main function is fat_init. It tries to do initialization only once, but since it is idempotent and pointer updates are atomic on x86_64, there's no harm if it is in some cases called multiple times from several threads.
The fat_init function checks the cpuid flags, and sets function pointers, e.g, _aes_encrypt_vec, to point to the appropriate implementation.
To get everything hooked in, we use a belt-and-suspenders approach.
When compiling with gcc, we try to register a constructor function which calls fat_init as soon as the library is loaded. If this is unavailable or non-working, we instead arrange fat_init to be called on demand.
For the actual indirection, there are two cases.
If ifunc support is available, function pointers are statically initialized to NULL, and we register resolver functions, e.g., _aes_encrypt_resolve, which calls fat_init, and then returns the function pointer, e.g., the value of _aes_encrypt_vec.
If ifunc is not available, we have to define a wrapper function to jump via the function pointer. (FIXME: For internal calls, we could do this as a macro instead). We statically initialize each function pointer to point to a special initialization function, e.g., _aes_encrypt_init, which calls fat_init, and then invokes the right function. This way, all pointers are setup correctly at the first call to any fat function. */
Regards, /Niels
On Tue, Jan 13, 2015 at 11:20 AM, Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
Clearly, this will be more useful after adding support for fat binaries, detecting presence of these instructions at runtime.
I've now had a first go at fat-library support. Checked in on the branch fat-library. See https://git.lysator.liu.se/nettle/nettle/blob/fat-library/x86_64/fat/fat.c
Looks nice. About the __attribute__((constructor)), you are restricting it to GNUC only, while it seems to be available more widely. In gnutls I use it unconditionally except for sun.
#ifdef __sun # pragma init(fat_constructor) # define _CONSTRUCTOR #else # define _CONSTRUCTOR __attribute__((constructor)) #endif
It's early, but it would be nice if the arm neon code was part of fat as well.
regards, Nikos
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 13/01/2015 11:34 p.m., Nikos Mavrogiannopoulos wrote:
On Tue, Jan 13, 2015 at 11:20 AM, Niels Möller wrote:
(Niels Möller) writes:
Clearly, this will be more useful after adding support for fat binaries, detecting presence of these instructions at runtime.
I've now had a first go at fat-library support. Checked in on the branch fat-library. See https://git.lysator.liu.se/nettle/nettle/blob/fat-library/x86_64/fat/fat.c
Looks nice. About the __attribute__((constructor)), you are restricting it to GNUC only, while it seems to be available more widely. In gnutls I use it unconditionally except for sun.
#ifdef __sun # pragma init(fat_constructor) # define _CONSTRUCTOR #else # define _CONSTRUCTOR __attribute__((constructor)) #endif
It's early, but it would be nice if the arm neon code was part of fat as well.
FYI: the recent BSD versions require clang/llvm build support these days.
AYJ
On Tue, Jan 13, 2015 at 11:52 AM, Amos Jeffries squid3@treenet.co.nz wrote:
It's early, but it would be nice if the arm neon code was part of fat as well.
FYI: the recent BSD versions require clang/llvm build support these days.
That's the main reason I enable the constructor attribute unconditionally. It is supported by clang as well.
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
Looks nice. About the __attribute__((constructor)), you are restricting it to GNUC only, while it seems to be available more widely.
I do have a configure check, setting HAVE_GCC_ATTRIBUTE, based on compiling a test program using __attribute__((noreturn)). Maybe I could use that, and the __sun hack? I know both intel and llvm compilers attempt to be gcc compatible (sometimes too compatible, I've heard some versions of the intel compiler added __GNUC__ as a predefined...).
Ideally it would be nice with a configure test that checks that constructors actually work, but that's hard to do when cross compiling.
It's early, but it would be nice if the arm neon code was part of fat as well.
Sure, that's the next step, once I have a structure I think is workable. Does anyone have a pointer to how to check cpu capabilities on ARM?
Another question: We need some kind of memory barrier when writing and/or reading the initialized flag. The (unlikely) failure case is a thread reading the initialized flag, getting 1, and then reading one of the function pointers, and getting a too old value.
What barrier-instruction(s) should be used, on x86_64 and ARM? It's probably easiest to add any needed sychronization functions to cpuid.asm, to avoid relying on compiler-specific features.
Regards, /Niels
On Tue, 13 Jan 2015, Niels Möller wrote:
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
It's early, but it would be nice if the arm neon code was part of fat as well.
Sure, that's the next step, once I have a structure I think is workable. Does anyone have a pointer to how to check cpu capabilities on ARM?
Yes - and it's a bit hairy. (I've got a TL;DR version halfway down.)
There's no direct CPU instruction for it, contrary to x86. One way of detecting it via pure code, is trying to execute the tested instructions, and catching the SIGILL (or similar on other platforms) in case it isn't supported. (Touching signal handlers from within a library isn't necessary a nice thing to do, though.)
Short of trying to run the instructions, some OSes provide this info in another way - Linux is the main case here.
Before going into the Linux case, note that iOS doesn't have such a mechanism, but it isn't really needed there. On iOS, all armv7 configurations include support for NEON, so if you can assemble NEON instructions you don't need any detection. Since this platform uses fat binaries, you could have a separate armv6 slice of your binary (and that's the main way of doing it here - instead of enabling things at runtime within one binary, include separate slices for each intended configurations). The recent Xcode tools no longer support building for armv6 though, and App Store doesn't accept such submissions any longer.
Similarly for Windows Phone (and WinRT), the tools assume a platform with armv7 including NEON, so this doesn't require any detection. If you'd want to use more exotic instructions that aren't available in this baseline, you'd probably need to have detection via SIGILL/exception handlers.
On Linux, you can open /proc/self/auxv and parse this relatively easily, and check for HWCAP_NEON. This has got the drawback that recent Android kernels may block access to this file [1].
Instead of opening this file, you could use the getauxval function to get the same auxillary vector. Since this function isn't universally available, you'd also need to check whether you can use it at all (or load it using dlsym). In particular, it has only been available for a relative short time on Android, so you can't rely on it there.
The final fallback is parsing /proc/cpuinfo, which always should work. You can pretty easily find the Features line and look for the features. The line ends with a space, so you can use something as simple as strstr(line, " neon ") to parse it.
The gotcha about /proc/cpuinfo is that it is different for ARMv8 kernels - features like neon, which were optional on ARMv7, aren't optional any longer and thus are omitted. To handle this, you can either parse the "CPU architecture" field, and if this is >= 8, assume neon, or you can look for the "asimd" feature which is printed, which means the same.
To simplify running old 32 bit binaries, the Android ARMv8 kernels have an extra compatibility feature for this, readding the "neon" keyword there. [2] [3] This extra compatibility isn't available in upstream kernels though so it can't be relied on (it was proposed in [4] but not merged yet).
TL;DR - it's mostly only necessary on linux. The simplest solution which works everywhere is parsing /proc/cpuinfo.
[1] http://b.android.com/43055 [2] https://android.googlesource.com/kernel/common/+/cba0c6b2913c0d075a7434025f5... [3] https://android.googlesource.com/kernel/common/+/3868e7f8d47992922756d1aa659... [4] http://marc.info/?l=linux-arm-kernel&m=139087240101974
Example of /proc/cpuinfo from a pandaboard:
Processor : ARMv7 Processor rev 10 (v7l) processor : 0 BogoMIPS : 1392.74
processor : 1 BogoMIPS : 1363.33
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x2 CPU part : 0xc09 CPU revision : 10
Hardware : OMAP4 Panda board Revision : 0020 Serial : 0000000000000000
From a Nexus 9:
Processor : NVIDIA Denver 1.0 rev 0 (aarch64) processor : 0 processor : 1 Features : fp asimd aes pmull sha1 sha2 crc32 CPU implementer : 0x4e CPU architecture: AArch64 CPU variant : 0x0 CPU part : 0x000 CPU revision : 0
Hardware : Flounder Revision : 0000 Serial : 0000000000000000 MTS version : 33410787
From a Nexus 9, read from a 32 bit process:
Processor : NVIDIA Denver 1.0 rev 0 (aarch64) processor : 0 processor : 1 Features : fp asimd aes pmull sha1 sha2 crc32 wp half thumb fastmult vfp edsp neon vfpv3 tlsi vfpv4 idiva idivt CPU implementer : 0x4e CPU architecture: 8 CPU variant : 0x0 CPU part : 0x000 CPU revision : 0
Hardware : Flounder Revision : 0000 Serial : 0000000000000000 MTS version : 33410787
Finally, a few examples on all of this from other libraries:
libvpx, catching illegal instruction exceptions on windows platforms, and parsing /proc/cpuinfo: http://git.chromium.org/gitweb/?p=webm/libvpx.git;a=blob;f=vpx_ports/arm_cpu...
libav, trying /proc/self/auxv, falling back to /proc/cpuinfo: https://git.libav.org/?p=libav.git;a=blob;f=libavutil/arm/cpu.c;h=8bdaa884
OpenH264, with very minimal parsing of /proc/cpuinfo (and a bunch of other things): https://github.com/cisco/openh264/blob/34661f1d8/codec/common/src/cpu.cpp#L2...
The Android cpufeatures library (which tries /proc/self/auxv, tries loading getauxval, and falls back to /proc/cpuinfo): https://android.googlesource.com/platform/ndk/+/13a99c7f/sources/android/cpu...
x264, catching SIGILL: http://git.videolan.org/?p=x264.git;a=blob;f=common/cpu.c;h=cad5f2c2e9
// Martin
Martin Storsjö martin@martin.st writes:
On Tue, 13 Jan 2015, Niels Möller wrote:
Sure, that's the next step, once I have a structure I think is workable. Does anyone have a pointer to how to check cpu capabilities on ARM?
Yes - and it's a bit hairy. (I've got a TL;DR version halfway down.)
Thanks a lot. Sounds like easiest to just parse /proc/cpuinfo.
I should also say that I'd like to add some environment variable to override the cpu detection, mainly for testing and benchmarking. I'm thinking that maybe I should use glibc's getenv_secure (which always returns NULL in setuid processes and the like).
The gotcha about /proc/cpuinfo is that it is different for ARMv8 kernels - features like neon, which were optional on ARMv7, aren't optional any longer and thus are omitted. To handle this, you can either parse the "CPU architecture" field, and if this is >= 8, assume neon, or you can look for the "asimd" feature which is printed, which means the same.
I was going to say that there's no support for arm64 yet, but I take it this applies to arm64-systems running 32-bit binaries.
From a Nexus 9: Processor : NVIDIA Denver 1.0 rev 0 (aarch64) Features : fp asimd aes pmull sha1 sha2 crc32 CPU implementer : 0x4e CPU architecture: AArch64
From a Nexus 9, read from a 32 bit process: Processor : NVIDIA Denver 1.0 rev 0 (aarch64) Features : fp asimd aes pmull sha1 sha2 crc32 wp half thumb fastmult vfp edsp neon vfpv3 tlsi vfpv4 idiva idivt CPU architecture: 8
Are you saying that the CPU Architecture: line in /proc/cpuinfo will look different depending on whether the process that opened (or read???) /proc/cpuinfo was 32-bit or 64-bit? I had no idea...
Regards, /Niels
On Tue, 13 Jan 2015, Niels Möller wrote:
Martin Storsjö martin@martin.st writes:
On Tue, 13 Jan 2015, Niels Möller wrote:
Sure, that's the next step, once I have a structure I think is workable. Does anyone have a pointer to how to check cpu capabilities on ARM?
Yes - and it's a bit hairy. (I've got a TL;DR version halfway down.)
Thanks a lot. Sounds like easiest to just parse /proc/cpuinfo.
I should also say that I'd like to add some environment variable to override the cpu detection, mainly for testing and benchmarking. I'm thinking that maybe I should use glibc's getenv_secure (which always returns NULL in setuid processes and the like).
That sounds sensible. I guess you don't support Windows Phone and such (yet), but it may be good to keep in mind that getenv(3) isn't available in such environments at all. See e.g. http://git.chromium.org/gitweb/?p=webm/libvpx.git;a=commitdiff;h=20babf6d9d for a case of working around this.
The gotcha about /proc/cpuinfo is that it is different for ARMv8 kernels - features like neon, which were optional on ARMv7, aren't optional any longer and thus are omitted. To handle this, you can either parse the "CPU architecture" field, and if this is >= 8, assume neon, or you can look for the "asimd" feature which is printed, which means the same.
I was going to say that there's no support for arm64 yet, but I take it this applies to arm64-systems running 32-bit binaries.
Exactly - that's why it's a bit problematic - even if you don't care about 64 bit ARM yet, the existing detection may need to be adjusted slightly. Code that parses /proc/self/auxv will work just fine, and Android has added extra compatibility to their arm64 kernels, but for 32 bit binaries on normal linux systems this would be an issue. (Android haven't really "announced" this extra compatibility either, they did update their own cpufeatures library and recommend people to update it.)
Updating cpuinfo parsing to account for this isn't too hard though: https://git.libav.org/?p=libav.git;a=commitdiff;h=7b0c7c916
So this is mainly an issue if you have old code parsing /proc/cpuinfo that you haven't gotten to updating.
From a Nexus 9: Processor : NVIDIA Denver 1.0 rev 0 (aarch64) Features : fp asimd aes pmull sha1 sha2 crc32 CPU implementer : 0x4e CPU architecture: AArch64
From a Nexus 9, read from a 32 bit process: Processor : NVIDIA Denver 1.0 rev 0 (aarch64) Features : fp asimd aes pmull sha1 sha2 crc32 wp half thumb fastmult vfp edsp neon vfpv3 tlsi vfpv4 idiva idivt CPU architecture: 8
Are you saying that the CPU Architecture: line in /proc/cpuinfo will look different depending on whether the process that opened (or read???) /proc/cpuinfo was 32-bit or 64-bit? I had no idea...
Yes, if the process that opens /proc/cpuinfo is a 32 bit process, it lists a bit more features, and changes "CPU architecture" from "AArch64" to "8". This is an Android extension, to make sure that old binaries with cpu feature detection not aware of ARMv8 will still detect NEON on such devices, while vanilla kernels only will print the "normal" version of this (the one you see in 64 bit mode) regardless of the calling process type. (/proc/self/auxv does look different depending on the process bitness as well, even in vanilla kernels.)
See the two android kernel commits I linked in the previous mail, for info about this extra compatibility feature.
// Martin
On Tue, Jan 13, 2015 at 12:22 PM, Niels Möller nisse@lysator.liu.se wrote:
Another question: We need some kind of memory barrier when writing and/or reading the initialized flag. The (unlikely) failure case is a thread reading the initialized flag, getting 1, and then reading one of the function pointers, and getting a too old value. What barrier-instruction(s) should be used, on x86_64 and ARM? It's probably easiest to add any needed sychronization functions to cpuid.asm, to avoid relying on compiler-specific features.
Is that really needed? I mean you are setting these values at the constructor, that is prior to any thread being created, and there shouldn't be multiple CPUs to worry about.
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
Is that really needed? I mean you are setting these values at the constructor, that is prior to any thread being created, and there shouldn't be multiple CPUs to worry about.
If the constructor thing works, it's no problem. And if ifunc is supported, I don't know how that really works, but I imagine the dynamic loader serializes calls to the resolver functions, and that whatever magic is used in the case of static libraries also is no problem.
So remains the case of no C extensions, where the initialization is hooked in via the initial values of all function pointers (the way it's done in GMP). Here, all bets on the timing of calls are off, the application can spawn multiple threads, and have the threads all call nettle for the first time.
So the problem is a bit obscure, but I think if we just replace initialized = 1 by _nettle_synchronous_write (&initialized, 1), implemented as
_nettle_synchronous_write: mfence movl %esi, (%rdi) mfence
it will be safe in all cases.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Another question: We need some kind of memory barrier when writing and/or reading the initialized flag. The (unlikely) failure case is a thread reading the initialized flag, getting 1, and then reading one of the function pointers, and getting a too old value.
After discussing this on another forum, I've been told that the x86 architecture is strongly ordered (as long as one doesn't use certain instructions, like "non-temporal store"). So a plain store to a volatile int should to, with no memory barriers.
Case of ARM will be different, since it has a weaker memory model.
See also http://lwn.net/Articles/252110/
Regards, /Niels
On Tue, Jan 13, 2015 at 11:34 AM, Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com wrote:
On Tue, Jan 13, 2015 at 11:20 AM, Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
Clearly, this will be more useful after adding support for fat binaries, detecting presence of these instructions at runtime.
I've now had a first go at fat-library support. Checked in on the branch fat-library. See https://git.lysator.liu.se/nettle/nettle/blob/fat-library/x86_64/fat/fat.c
A quick and dirty patch to enable SSE2 instructions for memxor() on Intel CPUs is attached. I tried to follow the logic in the fat.c file, but I may have missed something. I've not added memxor3() because it is actually slower with SSE2.
SSE2: memxor aligned 26081.83 memxor unaligned 25893.69
No-SSE2: memxor aligned 17806.94 memxor unaligned 16581.48
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
A quick and dirty patch to enable SSE2 instructions for memxor() on Intel CPUs is attached. I tried to follow the logic in the fat.c file, but I may have missed something. I've not added memxor3() because it is actually slower with SSE2.
Cool!
SSE2: memxor aligned 26081.83 memxor unaligned 25893.69
No-SSE2: memxor aligned 17806.94 memxor unaligned 16581.48
How confident are you that the intel vs amd check is the right way to enable sse2? I guess we could add check on the particular cpu model later, if needed. Which model(s) did you benchmark on?
It would be nice in a way if we could share code with x86_64/memxor.asm. E.g., by defining x86_64/fat/memxor-1.asm and x86_64/fat/memxor-2.asm which each include the same file with a different setting of USE_SSE2.
But I haven't looked at that carefully, it might be better to have a unified x86_64/fat/memxor.asm with two entry points, like you do.
I've also been considering m4 hacks to let a single fat .asm file include multiple other .asm files, or including the same file twice, without labels or m4 definitions colliding, but I'm not sure that's worth the effort. The foo-1.asm, foo-2.asm, ... scheme is a bit inelegant, but it is easy to understand.
- _nettle_cpuid (0, cpuid_data);
- if (memcmp(&cpuid_data[1], "Genu", 4) == 0 &&
memcmp(&cpuid_data[3], "ineI", 4) == 0 &&
memcmp(&cpuid_data[2], "ntel", 4) == 0) {
This could also be written as a single memcmp call, or 3 comparisons of integers.
Regards, /Niels
On Fri, 2015-01-16 at 22:18 +0100, Niels Möller wrote:
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
A quick and dirty patch to enable SSE2 instructions for memxor() on Intel CPUs is attached. I tried to follow the logic in the fat.c file, but I may have missed something. I've not added memxor3() because it is actually slower with SSE2.
Cool!
SSE2: memxor aligned 26081.83 memxor unaligned 25893.69
No-SSE2: memxor aligned 17806.94 memxor unaligned 16581.48
How confident are you that the intel vs amd check is the right way to enable sse2? I guess we could add check on the particular cpu model later, if needed. Which model(s) did you benchmark on?
The benchmarks (if it is same as the older code I've sent you few years ago), have been done on intel i7, i5 and a xeon. In all of them there was an improvement. The benchmark above is on i7.
About that not improving on AMD I have no more data than what I've wrote you last time (which was few years ago). No idea if newer AMD processors behave better.
It would be nice in a way if we could share code with x86_64/memxor.asm. E.g., by defining x86_64/fat/memxor-1.asm and x86_64/fat/memxor-2.asm which each include the same file with a different setting of USE_SSE2. But I haven't looked at that carefully, it might be better to have a unified x86_64/fat/memxor.asm with two entry points, like you do. I've also been considering m4 hacks to let a single fat .asm file include multiple other .asm files, or including the same file twice, without labels or m4 definitions colliding, but I'm not sure that's worth the effort. The foo-1.asm, foo-2.asm, ... scheme is a bit inelegant, but it is easy to understand.
I didn't like the duplication of code either. I'm not very skilled in m4, but I though that x86_64/ could include the fat variant and use the non-sse2 variant.
The code in fat.c is quite elaborate on the cases it handles. The more functions added the more unmanageable the code will become. Would it make sense to restrict that support to the systems where ifunc is available? Then the addition of new optimized functions becomes very simple.
regards, Nikos
Nikos Mavrogiannopoulos nmav@gnutls.org writes:
The benchmarks (if it is same as the older code I've sent you few years ago), have been done on intel i7, i5 and a xeon. In all of them there was an improvement. The benchmark above is on i7.
About that not improving on AMD I have no more data than what I've wrote you last time (which was few years ago). No idea if newer AMD processors behave better.
I don't remember much of this benchmarking (and things may have changed, anyway). I think I'm going to add an environment variable to override the cpu detection, so different variants can be checked easily at runtime. So we'll see later on if some finer granularity is needed.
I didn't like the duplication of code either. I'm not very skilled in m4, but I though that x86_64/ could include the fat variant and use the non-sse2 variant.
I think I'd prefer to do it the other way around, with memxor-1.asm and memxor-2.asm both including x86_64/memxor.asm, just defining USE_SSE2 differently. With little actual code under fat/. Do you see any problem with that approach?
The code in fat.c is quite elaborate on the cases it handles. The more functions added the more unmanageable the code will become. Would it make sense to restrict that support to the systems where ifunc is available? Then the addition of new optimized functions becomes very simple.
I agree that as more functions are added, we need some macros for the boilerplate code. But I think that can be done without dropping support for the non-ifunc systems. Basically, use an alternative definition of your DEFINE_FAT_FUNC which defines a wrapper function and an init function, instead of a resolver function.
Regards, /Niels
On Sat, 2015-01-17 at 09:42 +0100, Niels Möller wrote:
I didn't like the duplication of code either. I'm not very skilled in m4, but I though that x86_64/ could include the fat variant and use the non-sse2 variant.
I think I'd prefer to do it the other way around, with memxor-1.asm and memxor-2.asm both including x86_64/memxor.asm, just defining USE_SSE2 differently. With little actual code under fat/. Do you see any problem with that approach?
No (but no idea how to implement it).
The code in fat.c is quite elaborate on the cases it handles. The more functions added the more unmanageable the code will become. Would it make sense to restrict that support to the systems where ifunc is available? Then the addition of new optimized functions becomes very simple.
I agree that as more functions are added, we need some macros for the boilerplate code. But I think that can be done without dropping support for the non-ifunc systems. Basically, use an alternative definition of your DEFINE_FAT_FUNC which defines a wrapper function and an init function, instead of a resolver function.
I realized that non-ifunc systems are desirable, or windows support goes away. I couldn't make wrapper functions using macros. What I'm thinking is a perl script which auto-generates the wrapper functions by reading fat.c and the header files. What would you think of that?
regards, Nikos
Nikos Mavrogiannopoulos nmav@gnutls.org writes:
On Sat, 2015-01-17 at 09:42 +0100, Niels Möller wrote:
I think I'd prefer to do it the other way around, with memxor-1.asm and memxor-2.asm both including x86_64/memxor.asm, just defining USE_SSE2 differently. With little actual code under fat/. Do you see any problem with that approach?
No (but no idea how to implement it).
Pushed in now.
I realized that non-ifunc systems are desirable, or windows support goes away. I couldn't make wrapper functions using macros. What I'm thinking is a perl script which auto-generates the wrapper functions by reading fat.c and the header files. What would you think of that?
Let me give it a try using the C preprocessor first.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Let me give it a try using the C preprocessor first.
Done now, starting from your macros.
Argument lists have to be passed both with and without types, which looks a bit ugly, but I think it's good enough. E.g, this is the code specific to _aes_encrypt:
DECLARE_FAT_FUNC(_nettle_aes_encrypt, aes_crypt_internal_func) DECLARE_FAT_FUNC_VAR(aes_encrypt, aes_crypt_internal_func, x86_64) DECLARE_FAT_FUNC_VAR(aes_encrypt, aes_crypt_internal_func, aesni)
DEFINE_FAT_FUNC(_nettle_aes_encrypt, void, (unsigned rounds, const uint32_t *keys, const struct aes_table *T, size_t length, uint8_t *dst, const uint8_t *src), (rounds, keys, T, length, dst, src))
The nice thing is that these definitions are not arch-specific, so the macros could be moved to fat.h and reused when writing fat-arm.c.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
I think I'm going to add an environment variable to override the cpu detection, so different variants can be checked easily at runtime.
Say we add a secure_getenv("NETTLE_FAT_OVERRIDE") in the fat initialization. What should it look like? Some alternatives:
1. Specify substitute values to replace result of the cpuid calls?
2. A list of feature keywords?
3. A list of function:variant, where each entry specifies an override for the particular function?
Regards, /Niels
On Sun, 2015-01-18 at 00:04 +0100, Niels Möller wrote:
nisse@lysator.liu.se (Niels Möller) writes:
I think I'm going to add an environment variable to override the cpu detection, so different variants can be checked easily at runtime.
Say we add a secure_getenv("NETTLE_FAT_OVERRIDE") in the fat initialization. What should it look like? Some alternatives:
- Specify substitute values to replace result of the cpuid calls?
- A list of feature keywords?
- A list of function:variant, where each entry specifies an override for the particular function?
For gnutls I have an environment variable which is interpreted as an alternative CPUID. E.g. you can put a flag with the following in GNUTLS_CPUID_OVERRIDE: 0x1: Disable all run-time detected optimizations 0x2: Enable AES-NI 0x4: Enable SSSE3 0x8: Enable PCLMUL 0x100000: Enable VIA padlock 0x200000: Enable VIA PHE 0x400000: Enable VIA PHE SHA512
nisse@lysator.liu.se (Niels Möller) writes:
Say we add a secure_getenv("NETTLE_FAT_OVERRIDE") in the fat initialization. What should it look like? Some alternatives:
Implemented (for systems that have secure_getenv). Syntax as follows: The value is a comma separated list. Entries are either single keywords, e.g., "neon" or "aesni", or keyword:value, e.g., "vendor:intel" or "arch:7".
Setting NETTLE_FAT_VERBOSE displays the the values used in the same format.
If the environment variable is set, it completely overrides automatic detection.
It's a bit unclear if it works with static libraries.
Regards, /Niels
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
It's early, but it would be nice if the arm neon code was part of fat as well.
I've pushed a start for fat binary support on arm. A complication, not yet handled, is that for some functions (in fact, all neon-related code, I think), the runtime choice is between the C implementation and an assembly implementation. So we need some additional name mangling to do, e.g,
#define sha3_permute _nettle_sha3_permute_c
during the compilation of sha3-permute.c. Just a question on where to configure that. Maybe one could add something like
#ifdef FAT_RENAME #include FAT_RENAME #endif
and substitute FAT_RENAME in config.h, e.g., to fat-arm-rename.h. Hmm, or maybe it's good enough to just check the HAVE_NATIVE_foo. (In the case of non-fat builds, when that define is set, the file with the C implementation is never compiled at all).
Regards, /Niels
On Mon, 19 Jan 2015, Niels Möller wrote:
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
It's early, but it would be nice if the arm neon code was part of fat as well.
I've pushed a start for fat binary support on arm.
Unfortunately I don't have much opinion on the other things you mentioned in your mail, but I did have a brief look at the arm feature detection.
I see you're looking at the CPU architecture field as well. There's a big gotcha related to that one; some ARMv6 CPUs report CPU architecture: 7. See http://code.google.com/p/android/issues/detail?id=10812 and https://android.googlesource.com/platform/ndk/+/13a99c7f/sources/android/cpu... (lines 716-737) for more details about this. (Unfortunately I don't have any better pointers to the kernel source/discussions for an explanation of this.)
For example a raspberry pi has got the following /proc/cpuinfo:
processor : 0 model name : ARMv6-compatible processor rev 7 (v6l) Features : swp half thumb fastmult vfp edsp java tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xb76 CPU revision : 7
Hardware : BCM2708 Revision : 0002 Serial : 00000000d605188c
If you only need to decide whether to enable ARMv6 specific instructions, it should be just fine, but in case you'd use it for enabling ARMv7 stuff as well, you'd need some sort of workaround for this.
// Martin
Martin Storsjö martin@martin.st writes:
I see you're looking at the CPU architecture field as well. There's a big gotcha related to that one; some ARMv6 CPUs report CPU architecture: 7.
Oops. To get it right, maybe one have to look up the "CPU part" field (and "CPU implementor"?) in a table.
But currently, it's used only to check for existence of armv6 instructions. I don't remember all the details, but one of the instructions seems to be uxtb, used in the aes code.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
I've pushed a start for fat binary support on arm.
And now some more, including choice between C and neon implementations (currently there's neon code for salsa20, sha512, sha3, and umac).
Testing appreciated.
I haven't done the memory barrier thing yet, it appears to be more complicated than I had hoped. The manual I have say that the dmb instruction (data memory barrier) is available only with armv7 and later. And that armv6 uses writes to CP15 registers (I haven't yet tried to figure what that means out, or if this method works also on later versions).
For pre armv6, maybe memory was strongly ordered, or there where no multi-processor support at all?
So it seems we may need some arch type detection to find out if and how to do a memory barrier!
I've had a quick look at what linux does (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/ar...), and it seems messy.
I wonder if there's some different approach to get C compiler, C library or the kernel do a memory barrier for us? I'd prefer to not link with any thread libraries.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
I haven't done the memory barrier thing yet, it appears to be more complicated than I had hoped. The manual I have say that the dmb instruction (data memory barrier) is available only with armv7 and later. And that armv6 uses writes to CP15 registers (I haven't yet tried to figure what that means out, or if this method works also on later versions).
I think I've found a simple solution. I deleted the initialized flag in fat_init, instead I let each caller read the particular function pointer it is interested in, and check if it is already properly initialized or not. I.e., check if the current value equals its static initializer, and if so, call fat_init.
This way, store order consistency between threads no longer matters, and we won't need any memory barriers.
I'd like to merge this code on the master branch soon. It would be nice if anyone else could give it a little testing, in particular on various ARM devices. I've tested it on a few different x86_64 pc:s and an ARMv7 pandaboard, all running gnu/linux.
Regards, /Niels
On Fri, 23 Jan 2015, Niels Möller wrote:
nisse@lysator.liu.se (Niels Möller) writes:
I haven't done the memory barrier thing yet, it appears to be more complicated than I had hoped. The manual I have say that the dmb instruction (data memory barrier) is available only with armv7 and later. And that armv6 uses writes to CP15 registers (I haven't yet tried to figure what that means out, or if this method works also on later versions).
I think I've found a simple solution. I deleted the initialized flag in fat_init, instead I let each caller read the particular function pointer it is interested in, and check if it is already properly initialized or not. I.e., check if the current value equals its static initializer, and if so, call fat_init.
This way, store order consistency between threads no longer matters, and we won't need any memory barriers.
I'd like to merge this code on the master branch soon. It would be nice if anyone else could give it a little testing, in particular on various ARM devices. I've tested it on a few different x86_64 pc:s and an ARMv7 pandaboard, all running gnu/linux.
I tested it on a raspberry pi (ARMv6), and it seems to work pretty much as intended - I was able to do a fat build with neon, while executing the testsuite works (so the detection seems to work as intended).
I also tested building for ARMv5 using the android NDK, and I noted that arm/v6/aes*.asm require a ".arch armv6" at the start, otherwise they fail to assemble in that configuration. (The neon sources seem to have ".fpu neon" similarly already. I'm not sure if some of the neon source perhaps would require an ".arch armv7-a" as well, but they did seem to build just fine in my test so perhaps it isn't necessary.)
To test this for yourself in case you're interested, add <ndk>/toolchains/arm-linux-androideabi-4.6/prebuilt/*x86*/bin to your path, configure with this line: SYSROOT=<ndk>/platforms/android-3/arch-arm/ CC="arm-linux-androideabi-gcc --sysroot=$SYSROOT" CXX="arm-linux-androideabi-g++ --sysroot=$SYSROOT" ./configure --host=arm-linux-gnueabi --enable-fat
Other than that, building with --enable-fat does seem to do the right thing - much better than the current setup. (E.g. currently, if cross-compiling for raspberry pi, it fails to enable the v6 routines, since the host triplet is arm-bcm2708hardfp-linux-gnueabi even though it's a armv6 device. When building on such a device, config.guess gives armv6l-unknown-linux-gnueabihf instead.)
I take it you've tested building for windows? Although the x86 detection should be much simpler, so it's only the absence of ifunc that'd be tested there.
// Martin
Martin Storsjö martin@martin.st writes:
I tested it on a raspberry pi (ARMv6), and it seems to work pretty much as intended - I was able to do a fat build with neon, while executing the testsuite works (so the detection seems to work as intended).
Good.
I also tested building for ARMv5 using the android NDK, and I noted that arm/v6/aes*.asm require a ".arch armv6" at the start, otherwise they fail to assemble in that configuration.
I'll apply that patch.
To test this for yourself in case you're interested, add <ndk>/toolchains/arm-linux-androideabi-4.6/prebuilt/*x86*/bin to your path, configure with this line:
I have some experience in building things for android. But it's a bit of a hassle to get the testsuite over to a device for testing.
Other than that, building with --enable-fat does seem to do the right thing - much better than the current setup. (E.g. currently, if cross-compiling for raspberry pi, it fails to enable the v6 routines, since the host triplet is arm-bcm2708hardfp-linux-gnueabi even though it's a armv6 device. When building on such a device, config.guess gives armv6l-unknown-linux-gnueabihf instead.)
To me, that sounds like the crosscompiler setup is a bit strange. It ought to be possible to configure with --host=armv6l-unknown-linux-gnueabihf, and still get the right cross tools, right?
I take it you've tested building for windows?
I can cross compile for 32-bit and 64-bit windows, but my wine setup doesn't support 64-bit executables. So testing on x86_64 windows (and macosx) is appreciated.
Thanks a lot for the testing, /Niels
On Sat, 24 Jan 2015, Niels Möller wrote:
Martin Storsjö martin@martin.st writes:
Other than that, building with --enable-fat does seem to do the right thing - much better than the current setup. (E.g. currently, if cross-compiling for raspberry pi, it fails to enable the v6 routines, since the host triplet is arm-bcm2708hardfp-linux-gnueabi even though it's a armv6 device. When building on such a device, config.guess gives armv6l-unknown-linux-gnueabihf instead.)
To me, that sounds like the crosscompiler setup is a bit strange. It ought to be possible to configure with --host=armv6l-unknown-linux-gnueabihf, and still get the right cross tools, right?
Yeah, it'd just require a bit more typing (setting CC/CXX manually). But it's no big issue anyway since the option for fat builds takes care of it nicely.
// Martin
On Wed, 21 Jan 2015, Niels Möller wrote:
nisse@lysator.liu.se (Niels Möller) writes:
I've pushed a start for fat binary support on arm.
And now some more, including choice between C and neon implementations (currently there's neon code for salsa20, sha512, sha3, and umac).
I noticed that arm/v6/sha1-compress and arm/v6/sha256-compress aren't hooked up in fat builds yet - is that intentional?
For pre armv6, maybe memory was strongly ordered, or there where no multi-processor support at all?
AFAIK there was no multiprocessor support before that at all. But your latest solution seems to be simple and robust enough, and simple is always good.
// Martin
Martin Storsjö martin@martin.st writes:
I noticed that arm/v6/sha1-compress and arm/v6/sha256-compress aren't hooked up in fat builds yet - is that intentional?
No, that's unintentional. I should fix it.
Regards, /Niels
Martin Storsjö martin@martin.st writes:
I noticed that arm/v6/sha1-compress and arm/v6/sha256-compress aren't hooked up in fat builds yet - is that intentional?
Fixed now. On the master branch, where everything is merged now.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se