I quickly discovered that I needed the round-reduced variants of Salsa20. Here they are. I didn't update the assembler part, so I would need help finishing that part.
/Simon
Simon Josefsson simon@josefsson.org writes:
I quickly discovered that I needed the round-reduced variants of Salsa20. Here they are. I didn't update the assembler part, so I would need help finishing that part.
Nice. Key setup is the same, just the number of iterations in the main loop is different?
+static void +salsa20r_crypt(struct salsa20_ctx *ctx,
int rounds,
unsigned length,
uint8_t *c,
const uint8_t *m)
This is the function which should be implemented in assembly. Do you think this should be a public, documented function (in which case salsa20r_crypt sounds like a reasonable name), or internal, in which case nettle convention would be something like _salsa20_crypt?
--- a/testsuite/salsa20-test.c +++ b/testsuite/salsa20-test.c @@ -13,13 +13,18 @@ memzero_p (const uint8_t *p, size_t n) return 1; }
+typedef void (*salsa20_func) (struct salsa20_ctx *ctx,
unsigned length, uint8_t *dst,
const uint8_t *src);
A very minor point... I've decided to try to stick to non-pointer typedefs for function types. I.e.,
typedef void salsa20_func (struct salsa20_ctx *ctx, unsigned length, uint8_t *dst, const uint8_t *src);
That way, it is possible (although maybe not useful very often) to use the typedef for declaring functions, like
salsa20_func salsa20_crypt; salsa20_func salsa20r8_crypt; salsa20_func salsa20r12_crypt;
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Simon Josefsson simon@josefsson.org writes:
I quickly discovered that I needed the round-reduced variants of Salsa20. Here they are. I didn't update the assembler part, so I would need help finishing that part.
Nice. Key setup is the same, just the number of iterations in the main loop is different?
Yes.
+static void +salsa20r_crypt(struct salsa20_ctx *ctx,
int rounds,
unsigned length,
uint8_t *c,
const uint8_t *m)
This is the function which should be implemented in assembly.
Actually, sleeping on this, I realized that we really want to export the Salsa20 core primitive (this was what I actually needed), and that is the primitive that should be implemented in assembler. I've fixed this in the attached patch.
The Salsa20 core is a hash function (not your typical hash function though) described here:
If we implement that quickly in assembler, with a variable round parameter, that will be sufficient to build fast C code around.
All of the Salsa20, Salsa20/8, Salsa20/12 and the stream handling code could then be written in C.
Do you think this should be a public, documented function (in which case salsa20r_crypt sounds like a reasonable name), or internal, in which case nettle convention would be something like _salsa20_crypt?
It should be an internal function, I've fixed this in the patch below.
--- a/testsuite/salsa20-test.c +++ b/testsuite/salsa20-test.c @@ -13,13 +13,18 @@ memzero_p (const uint8_t *p, size_t n) return 1; }
+typedef void (*salsa20_func) (struct salsa20_ctx *ctx,
unsigned length, uint8_t *dst,
const uint8_t *src);
A very minor point... I've decided to try to stick to non-pointer typedefs for function types. I.e.,
typedef void salsa20_func (struct salsa20_ctx *ctx, unsigned length, uint8_t *dst, const uint8_t *src);
That way, it is possible (although maybe not useful very often) to use the typedef for declaring functions, like
salsa20_func salsa20_crypt; salsa20_func salsa20r8_crypt; salsa20_func salsa20r12_crypt;
Sure.
Updated patch below.
/Simon
Simon Josefsson simon@josefsson.org writes:
Actually, sleeping on this, I realized that we really want to export the Salsa20 core primitive (this was what I actually needed), and that is the primitive that should be implemented in assembler. I've fixed this in the attached patch.
The Salsa20 core is a hash function (not your typical hash function though) described here:
I guess it could be named salsa20_hash, then? (I think there was such a function in a previous version of the code).
If we implement that quickly in assembler, with a variable round parameter, that will be sufficient to build fast C code around.
Then you'd first write the hash output to memory, then read it back to xor it with the message. Since sals20 is pretty fast, I think you'll get a measurablle performance penalty compared to the currrent code which keeps the hash output in registers until it is xored to the message. You really need to get just the hash output, without xoring it to anything?
It would definitely be cleaner to have the hash function separately.
+salsa20_core (uint32_t src[_SALSA20_INPUT_LENGTH],
uint32_t dst[_SALSA20_INPUT_LENGTH],
unsigned rounds)
[...]
- for (i = 0;i < _SALSA20_INPUT_LENGTH;++i)
- {
uint32_t t = x[i] + src[i];
dst[i] = LE_SWAP32 (t);
- }
+}
This makes for a very peculiar interface for a non-internal function. It would make more sense from an interface perspectivve to either not do these byte swaps, or have the output parameter be of type uint8_t *. Or do something like the union gcm_block in gcm.h (although that's also not pretty), if we want to be able to store the byte swapped value with a word-sized store.
I don't remember precisely the background of the current implementation, but I think the point was to do as much as possible of the processing as word operations, including the byte swapping.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Simon Josefsson simon@josefsson.org writes:
Actually, sleeping on this, I realized that we really want to export the Salsa20 core primitive (this was what I actually needed), and that is the primitive that should be implemented in assembler. I've fixed this in the attached patch.
The Salsa20 core is a hash function (not your typical hash function though) described here:
I guess it could be named salsa20_hash, then? (I think there was such a function in a previous version of the code).
The name of the hash is "Salsa20 core" but I think little effort has gone into tightening up the documentation around the Salsa20 hash (for example, there are no test vectors that I could find). salsa20_hash works for me, but could be confusing as it isn't a normal hash.
If we implement that quickly in assembler, with a variable round parameter, that will be sufficient to build fast C code around.
Then you'd first write the hash output to memory, then read it back to xor it with the message. Since sals20 is pretty fast, I think you'll get a measurablle performance penalty compared to the currrent code which keeps the hash output in registers until it is xored to the message.
Right, good point.
You really need to get just the hash output, without xoring it to anything?
Yes, although if necessary I could xor it to a zero buffer if there were no other way... however I'll loose performance, and my application (scrypt) would benefit from good performance.
It would definitely be cleaner to have the hash function separately.
I agree.
+salsa20_core (uint32_t src[_SALSA20_INPUT_LENGTH],
uint32_t dst[_SALSA20_INPUT_LENGTH],
unsigned rounds)
[...]
- for (i = 0;i < _SALSA20_INPUT_LENGTH;++i)
- {
uint32_t t = x[i] + src[i];
dst[i] = LE_SWAP32 (t);
- }
+}
This makes for a very peculiar interface for a non-internal function. It would make more sense from an interface perspectivve to either not do these byte swaps, or have the output parameter be of type uint8_t *. Or do something like the union gcm_block in gcm.h (although that's also not pretty), if we want to be able to store the byte swapped value with a word-sized store.
Let's use uint8_t. The first sentence of the Salsa20 core webpage is:
The Salsa20 core is a function from 64-byte strings to 64-byte strings: the Salsa20 core reads a 64-byte string x and produces a 64-byte string Salsa20(x).
So that is consistent with uint8_t.
I don't remember precisely the background of the current implementation, but I think the point was to do as much as possible of the processing as word operations, including the byte swapping.
Yes that will be faster I suppose.
/Simon
Simon Josefsson simon@josefsson.org writes:
Yes, although if necessary I could xor it to a zero buffer if there were no other way...
I was going to suggest using something like
void salsa20_core (const uint32_t *input, unsigned rounds, unsigned length, uint8_t *dst, const uint8_t *src)
which would compute the hash of the INPUT. If SRC is NULL, it would just store the LENGTH first bytes at DST. And if SRC is non-NULL, it would xor that data before storing.
A bit clumsy, but reasonably general.
But then I remembered that for high performance encryption, it's not sufficient with an assembly function which does only one block. You can gain a lot of performance (current code doesn't do that) by hashing two input blocks in parallel.
So at the moment, I'm inclined to let the assembly function do encryption, including xoring data and incrementing the counter. And if you only need the hash, you'd need a wrapper which encrypts a zero buffer and undoes the incrementing (the assembly function could leave the input block unmodified and maintain the updated counter locally; that would be a bit extra hassle but I don't think it would impact performance). Then we have sufficient freedom for optimizing salsa20 encryption, with a small performance disadvantage for using just the hash.
however I'll loose performance, and my application (scrypt) would benefit from good performance.
Will you be hashing several independent blocks? If so, for highest perforamnce you would also neeed an interface which lets the assembly routines do several blocks in parallel.
Regards, /Niels
Simon Josefsson simon@josefsson.org writes:
Actually, sleeping on this, I realized that we really want to export the Salsa20 core primitive (this was what I actually needed), and that is the primitive that should be implemented in assembler.
Now that this is in, did you make any use of it yet? I don't quite remember what your application was.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Simon Josefsson simon@josefsson.org writes:
Actually, sleeping on this, I realized that we really want to export the Salsa20 core primitive (this was what I actually needed), and that is the primitive that should be implemented in assembler.
Now that this is in, did you make any use of it yet? I don't quite remember what your application was.
For some reason I had missed that this went in, thanks for commiting this. How about the attached final cleanup patch, to export the Salsa20 core? I think that is consistent with the approach you outlined.
My application was scrypt, and I'll update my implementation to use the latest Nettle interfaces soon. I'll submit the scrypt part later on...
/Simon
Also when updating NEWS I noticed that it says:
* Support for the salsa20 block cipher, including x86_64
Obviously, Salsa20 is a stram cipher, not a block cipher. Do you want to fix that (it was for the already released version 2.5 release)?
/Simon
Simon Josefsson simon@josefsson.org writes:
Also when updating NEWS I noticed that it says:
- Support for the salsa20 block cipher, including x86_64
Obviously, Salsa20 is a stram cipher, not a block cipher. Do you want to fix that (it was for the already released version 2.5 release)?
Fixed that now. A bit silly to edit an old NEWS entry, but typos and errors like this should not be left around.
Regards, /Niels
Simon Josefsson simon@josefsson.org writes:
For some reason I had missed that this went in, thanks for commiting this. How about the attached final cleanup patch, to export the Salsa20 core? I think that is consistent with the approach you outlined.
I've had a quick look, and I think it makes sense. But I'm not sure if it gets byteorder right.
+void +salsa20_core (uint8_t *dst,
const uint8_t *src,
unsigned rounds)
+{
[...]
- _salsa20_core (dst32, src32, rounds);
- for (i = 0; i < _SALSA20_INPUT_LENGTH; i++)
- LE_WRITE_UINT32(&dst[i * sizeof (uint32_t)], dst32[i]);
+}
_salsa20_core produces byteswapped output (a somewhat peculiar interface, but it's an internal function, the reason is to be able do do any needed swap with word operations). So I think it should be a plain memcpy above.
Testing on both little-endian and big-endian would be nice (I should try to get xenofarm up and running again, but last time I tried the git support seemed somewhat lacking).
Regards, /Niels
On 11/06/2012 12:27 PM, Niels Möller wrote:
_salsa20_core produces byteswapped output (a somewhat peculiar interface, but it's an internal function, the reason is to be able do do any needed swap with word operations). So I think it should be a plain memcpy above.
Testing on both little-endian and big-endian would be nice
I've tested on a powerpc machine running debian sid, starting at the current HEAD (1c7d767b4ca224e26b7d7d3875d2d89201accb29); at the HEAD, "make check" passes all tests with no problems.
With simon's 0001-Add-salsa20_core-function.patch applied (i manually tuned the Changelog entry), i get:
PASS: gosthash94 PASS: ripemd160 Assert failed 193: MEMEQ (SALSA20_BLOCK_SIZE, dst, H("a41f859c6608cc993b81cacb020cef05" "044b2181a2fd337dfd7b1c6396682f29" "b4393168e3c9e6bcfe6bc5b7a06d96ba" "e424cc102c91745c24ad673dc7618f81")) Aborted FAIL: salsa20 PASS: sha1
I'd be happy to figure out how to give access to this powerpc machine if any of y'all want it, or to make other changes and test them on this big-endian machine.
hth,
--dkg
Daniel Kahn Gillmor dkg@fifthhorseman.net writes:
I'd be happy to figure out how to give access to this powerpc machine if any of y'all want it,
I have access to a couple of big endian machines (the most convenient for me being a sparc), but I don't run tests on them very regularly.
or to make other changes and test them on this big-endian machine.
Now we get to patching patches, which gets a bit messy, but I suspect it will work fine if you replace
for (i = 0; i < _SALSA20_INPUT_LENGTH; i++) LE_WRITE_UINT32(&dst[i * sizeof (uint32_t)], dst32[i]);
at the end of Simon's salsa20_core.c:salsa20_core by
/* _salsa20_core does any needed byteswapping of the output. A memcpy is needed to support unaligned dst; simply casting like _salsa20_core ((uint32_t *) dst, src32, rounds) is not portable. */ memcpy (dst, dst32, sizeof (dst32));
Regards, /Niels
On 11/27/2012 04:20 PM, Niels Möller wrote:
Now we get to patching patches, which gets a bit messy, but I suspect it will work fine if you replace
for (i = 0; i < _SALSA20_INPUT_LENGTH; i++) LE_WRITE_UINT32(&dst[i * sizeof (uint32_t)], dst32[i]);
at the end of Simon's salsa20_core.c:salsa20_core by
/* _salsa20_core does any needed byteswapping of the output. A memcpy is needed to support unaligned dst; simply casting like _salsa20_core ((uint32_t *) dst, src32, rounds) is not portable. */ memcpy (dst, dst32, sizeof (dst32));
Thanks, that's now tested and "make check" passes on both powerpc and i386.
I've attached a revised version of the patch containing the above change.
Regards,
--dkg
Daniel Kahn Gillmor dkg@fifthhorseman.net writes:
/* _salsa20_core does any needed byteswapping of the output. A memcpy is needed to support unaligned dst; simply casting like _salsa20_core ((uint32_t *) dst, src32, rounds) is not portable. */ memcpy (dst, dst32, sizeof (dst32));
Thanks, that's now tested and "make check" passes on both powerpc and i386.
Thanks for testing. I was about to update this patch too, but then it occured to me that this interface:
+void +salsa20_core (uint8_t *dst,
const uint8_t *src,
unsigned rounds)
is not ideal -- the reason is that the Salsa20 core is not defined with a parametrised number of rounds, so the interface is somewhat of a bastardisation.
In my work space, I have used the namespace 'salsa20r_core' instead. This opens up for later addition of a true 'salsa20_core' function which would use the official 20 rounds.
What do you think?
The patch below is update to apply against latest master.
/Simon
Simon Josefsson simon@josefsson.org writes:
+void +salsa20_core (uint8_t *dst,
const uint8_t *src,
unsigned rounds)
is not ideal -- the reason is that the Salsa20 core is not defined with a parametrised number of rounds, so the interface is somewhat of a bastardisation.
Naming is difficult, it's awkward to use the prefix "salsa20" for a function which is "salsa20, but not really 20"... In sed syntax it would be salsa20_sx20xrx ;-)
In my work space, I have used the namespace 'salsa20r_core' instead. This opens up for later addition of a true 'salsa20_core' function which would use the official 20 rounds.
What do you think?
I have no better suggestions for naming. But if we think of salsa20r_core as mostly for internal use, maybe we don't need it?
If I understood you correctly, your primary use case is scrypt, which you intend to implement in Nettle? Then maybe you would be better off without an extra wrapper function around _salsa20_core? If nothing else, you could then make sure you have proper alignment so you don't need an extra memcpy.
I hesitate a bit to add, document and support a new "obscure" function until there's a clear external use case.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Simon Josefsson simon@josefsson.org writes:
+void +salsa20_core (uint8_t *dst,
const uint8_t *src,
unsigned rounds)
is not ideal -- the reason is that the Salsa20 core is not defined with a parametrised number of rounds, so the interface is somewhat of a bastardisation.
Naming is difficult, it's awkward to use the prefix "salsa20" for a function which is "salsa20, but not really 20"... In sed syntax it would be salsa20_sx20xrx ;-)
It could be argued that the name of the function is actually "Salsa20 core" and that reduced round versions would be "Salsa20/r core" or similar. That would be consistent with the naming scheme for the Salsa20 stream cipher -- there Salsa20/12 means the Salsa20 stream cipher but with 12 rounds. It is not called Salsa12.
In my work space, I have used the namespace 'salsa20r_core' instead. This opens up for later addition of a true 'salsa20_core' function which would use the official 20 rounds.
What do you think?
I have no better suggestions for naming. But if we think of salsa20r_core as mostly for internal use, maybe we don't need it?
If I understood you correctly, your primary use case is scrypt, which you intend to implement in Nettle? Then maybe you would be better off without an extra wrapper function around _salsa20_core? If nothing else, you could then make sure you have proper alignment so you don't need an extra memcpy.
I hesitate a bit to add, document and support a new "obscure" function until there's a clear external use case.
Good points. Generally, I think it would be nice to have public interfaces for the Salsa20 core function, but currently there really aren't any use-cases for it except for scrypt that I am aware of. The conservative approach then is to not expose it right now.
What is driving this particular patch, though, is that the scrypt code needs a salsa20 core function that takes uint8_t's and return uint8_t's. I think there are two ways to address that need:
1) Implement the salsa20r_core uint8_t function in the scrypt file.
2) Add another internal function in salsa20.h that implement the uint8_t interface.
I prefer the second alternative, because then the implementation is done in the right place if there is ever another need for the salsa20 core uint8_t interface within Nettle (or externally). Thoughts?
I'm thinking it could be something like:
void _salsa20_core8(uint8_t *dst, const uint8_t *src, unsigned rounds);
possibly the current _salsa20_core should be renamed _salsa20_core32 for consistency. Having the uint32_t interface be called _salsa20_core even if it is an internal function seems a bit confusing -- the proper interface to it is uint8_t's.
/Simon
Simon Josefsson simon@josefsson.org writes:
It could be argued that the name of the function is actually "Salsa20 core" and that reduced round versions would be "Salsa20/r core" or similar. That would be consistent with the naming scheme for the Salsa20 stream cipher -- there Salsa20/12 means the Salsa20 stream cipher but with 12 rounds. It is not called Salsa12.
I think the name "salsa20r_core" is good enough. If we ever do some function(s) specifically for 12 rounds, what would they be called? salsa20r12_*? Make some sense to me.
What is driving this particular patch, though, is that the scrypt code needs a salsa20 core function that takes uint8_t's and return uint8_t's. I think there are two ways to address that need:
Implement the salsa20r_core uint8_t function in the scrypt file.
Add another internal function in salsa20.h that implement the
uint8_t interface.
I prefer the second alternative, because then the implementation is done in the right place if there is ever another need for the salsa20 core uint8_t interface within Nettle (or externally). Thoughts?
I think I prefer 1) for now. At this point, I don't want to rule out the possibility that it turns out to be preferable to adapt the scrypt code to call the more awkward _salsa20_core directly (if we can avoid a memcpy, it may even give a modest performance improvement).
And it's easy to move salsa20r_core to a separate file and make it non-static later on, whenever the need arises.
I'm thinking it could be something like:
void _salsa20_core8(uint8_t *dst, const uint8_t *src, unsigned rounds);
I see no reason to use a different name than salsa20r_core, as you suggested above. In nettle, leading underscore names are mainly for functions with awkward interfaces, dictated by what's suitable for assembly implementation or other implementation specific issues, and where it is likely that the interface will change as the implementation evolves. And the salsa20 core function you want is not of that character, whether documented or not.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se