Hello,
The attached patch implements the XTS block cipher mode, as specified in IEEE P1619. The interface is split into a generic pair of functions for encryption and decryption and additional AES-128/AES-256 variants.
The function signatures follows the same pattern used by other block- cipher modes like ctr, cfb, ccm, etc...
Basic tests using a small selection of NIST CAVS vectors are provided.
XTS is use in several disk encryption algorithms and programs because it allows to use a block cipher even when the input length is not a perfect multiple of the block cipher length by using ciphertext stealing.
Thanks to Daiki Ueno for initial review.
Feedback is appreciated.
Simo.
Simo Sorce simo@redhat.com writes:
The attached patch implements the XTS block cipher mode, as specified in IEEE P1619. The interface is split into a generic pair of functions for encryption and decryption and additional AES-128/AES-256 variants.
Thanks. Sorry for the late response.
The function signatures follows the same pattern used by other block- cipher modes like ctr, cfb, ccm, etc...
But it looks like one has to pass the complete message to one call? Other modes support incremental encryption (with the requirement that all calls but the last must be an integral number of blocks). I.e., calling sequence like
xts_aes128_set_key xts_aes128_set_iv xts_aes128_encrypt ... // 1 or more times xts_aes128_set_iv // Start new message xts_aes128_encrypt ... // 1 or more times
+The @code{n} plaintext blocks are transformed into @code{n} ciphertext blocks +@code{C_1},@dots{} @code{C_n} as follows.
+For a plaintext length that is a perfect multiple of the XTS block size: +@example +T_1 = E_k2(IV) MUL a^0 +C_1 = E_k1(P_1 XOR T_1) XOR T_1
+@dots{}
+T_n = E_k2(IV) MUL a^(n-1) +C_n = E_k1(P_n XOR T_n) XOR T_n +@end example
+For any other plaintext lengths: +@example +T_1 = E_k2(IV) MUL a^0 +C_1 = E_k1(P_1 XOR T_1) XOR T_1
+@dots{}
+T_(n-2) = E_k2(IV) MUL a^(n-3) +C_(n-2) = E_k1(P_(n-2) XOR T_(n-2)) XOR T_(n-2)
+T_(n-1) = E_k2(IV) MUL a^(n-2) +CC_(n-1) = E_k1(P_(n-1) XOR T_(n-1)) XOR T_(n-1)
+T_n = E_k2(IV) MUL a^(n-1) +PP = [1..m]Pn | [m+1..128]CC_(n-1) +C_(n-1) = E_k1(PP XOR T_n) XOR T_n
+C_n = [1..m]CC_(n-1) +@end example
So the second key, with E_k2, is only ever used to encrypt the IV? If you add a set_iv function, that could do this encryption and only store E_k2(IV).
--- /dev/null +++ b/xts.c @@ -0,0 +1,219 @@
[...]
+static void +xts_shift(uint8_t *T) +{
- uint8_t carry;
- uint8_t i;
- for (i = 0, carry = 0; i < XTS_BLOCK_SIZE; i++)
- {
uint8_t msb = T[i] & 0x80;
T[i] = T[i] << 1;
T[i] |= carry;
carry = msb >> 7;
- }
- if (carry)
- T[0] ^= 0x87;
+}
I think this is the same as block_mulx, in cmac.c. (Also same byte order, right?)
Since the block size is fixed to 128 bits, I think it makes sense to use the nettle_block16 type for all blocks but the application's src and destination areas. Then we get proper alignment, and can easily use operations on larger units.
BTW, for side-channel silence, we should change
if (carry) T[0] ^= 0x87;
to something like
T[0] ^= 0x87 & - carry;
(and similarly for the cmac version).
- fblen = length - (length % XTS_BLOCK_SIZE);
- XTSENC(twk_ctx, T, tweak);
- /* the zeroth power of alpha is the initial ciphertext value itself, so we
- skip shifting and do it at the end of each block operation instead */
- for (i = 0; i < fblen; i += XTS_BLOCK_SIZE)
- {
In other places, loops like this are often written as
for (; length >= BLOCK_SIZE; length -= BLOCK_SIZE, src += BLOCK_SIZE, dst += BLOCK_SIZE)
Then there's no need for the up-front division length & BLOCK_SIZE. Doesn't matter much in this case, since the block size is a constant power of two, but in general, division is quite expensive.
C = &dst[i];
XTSCPY(P, &src[i]);
XTSXOR(P, T); /* P -> PP */
XTSENC(enc_ctx, C, P); /* CC */
XTSXOR(C, T); /* CC -> C */
I think it would be clearer with encf being an explicit argument to the macros that need it (or maybe do it without the macros, if they expand to only a single call each).
Regards, /Niels
On Fri, 2019-03-15 at 13:14 +0100, Niels Möller wrote:
Simo Sorce simo@redhat.com writes:
The attached patch implements the XTS block cipher mode, as specified in IEEE P1619. The interface is split into a generic pair of functions for encryption and decryption and additional AES-128/AES-256 variants.
Thanks. Sorry for the late response.
The function signatures follows the same pattern used by other block- cipher modes like ctr, cfb, ccm, etc...
But it looks like one has to pass the complete message to one call?
Yes, due to ciphertext stealing, XTS needs to know what are the last two blocks, or at the very least needs to withhold the last processed block in order to be able to change it if a final partial block is provided. This means inputs and outputs would not be symmetrical and I felt it would make it somewhat hard to deal with as an API. In general XTS is used for block storage and the input is always fully available (and relatively small, either around 512 bytes, or 4k).
Other modes support incremental encryption (with the requirement that all calls but the last must be an integral number of blocks). I.e., calling sequence like
xts_aes128_set_key xts_aes128_set_iv xts_aes128_encrypt ... // 1 or more times xts_aes128_set_iv // Start new message xts_aes128_encrypt ... // 1 or more times
+The @code{n} plaintext blocks are transformed into @code{n} ciphertext blocks +@code{C_1},@dots{} @code{C_n} as follows.
+For a plaintext length that is a perfect multiple of the XTS block size: +@example +T_1 = E_k2(IV) MUL a^0 +C_1 = E_k1(P_1 XOR T_1) XOR T_1
+@dots{}
+T_n = E_k2(IV) MUL a^(n-1) +C_n = E_k1(P_n XOR T_n) XOR T_n +@end example
+For any other plaintext lengths: +@example +T_1 = E_k2(IV) MUL a^0 +C_1 = E_k1(P_1 XOR T_1) XOR T_1
+@dots{}
+T_(n-2) = E_k2(IV) MUL a^(n-3) +C_(n-2) = E_k1(P_(n-2) XOR T_(n-2)) XOR T_(n-2)
+T_(n-1) = E_k2(IV) MUL a^(n-2) +CC_(n-1) = E_k1(P_(n-1) XOR T_(n-1)) XOR T_(n-1)
+T_n = E_k2(IV) MUL a^(n-1) +PP = [1..m]Pn | [m+1..128]CC_(n-1) +C_(n-1) = E_k1(PP XOR T_n) XOR T_n
+C_n = [1..m]CC_(n-1) +@end example
So the second key, with E_k2, is only ever used to encrypt the IV? If you add a set_iv function, that could do this encryption and only store E_k2(IV).
What would be the advantage ? I guess it may make sense if we were to allow to call the encryption function multiple times, but as explained above I am not sure this is necessarily desirable. It may also risk misuse where people set the same IV for all encryption operations, that would be catastrophic, but probably can be handled by clearing the stored IV when the encryption is finalized.
--- /dev/null +++ b/xts.c @@ -0,0 +1,219 @@
[...]
+static void +xts_shift(uint8_t *T) +{
- uint8_t carry;
- uint8_t i;
- for (i = 0, carry = 0; i < XTS_BLOCK_SIZE; i++)
- {
uint8_t msb = T[i] & 0x80;
T[i] = T[i] << 1;
T[i] |= carry;
carry = msb >> 7;
- }
- if (carry)
- T[0] ^= 0x87;
+}
I think this is the same as block_mulx, in cmac.c. (Also same byte order, right?)
Looks the same indeed, should I share it? Just copy it from cmac? Something else?
Since the block size is fixed to 128 bits, I think it makes sense to use the nettle_block16 type for all blocks but the application's src and destination areas. Then we get proper alignment, and can easily use operations on larger units.
Ok.
BTW, for side-channel silence, we should change
if (carry) T[0] ^= 0x87;
to something like
T[0] ^= 0x87 & - carry;
(and similarly for the cmac version).
I can do it for xts.c, and provide a separate patch for cmac.c too, or use a common function for both and handle it there.
- fblen = length - (length % XTS_BLOCK_SIZE);
- XTSENC(twk_ctx, T, tweak);
- /* the zeroth power of alpha is the initial ciphertext value itself, so we
- skip shifting and do it at the end of each block operation instead */
- for (i = 0; i < fblen; i += XTS_BLOCK_SIZE)
- {
In other places, loops like this are often written as
for (; length >= BLOCK_SIZE; length -= BLOCK_SIZE, src += BLOCK_SIZE, dst += BLOCK_SIZE)
Then there's no need for the up-front division length & BLOCK_SIZE. Doesn't matter much in this case, since the block size is a constant power of two, but in general, division is quite expensive.
Ok, I can change that.
C = &dst[i];
XTSCPY(P, &src[i]);
XTSXOR(P, T); /* P -> PP */
XTSENC(enc_ctx, C, P); /* CC */
XTSXOR(C, T); /* CC -> C */
I think it would be clearer with encf being an explicit argument to the macros that need it (or maybe do it without the macros, if they expand to only a single call each).
Ok, will drop the macros, they seemed clearer, but now that I am rereding the code I found myself looking at their implementation more often than I thought necessary.
Simo.
Simo Sorce simo@redhat.com writes:
But it looks like one has to pass the complete message to one call?
Yes, due to ciphertext stealing, XTS needs to know what are the last two blocks, or at the very least needs to withhold the last processed block in order to be able to change it if a final partial block is provided. This means inputs and outputs would not be symmetrical and I felt it would make it somewhat hard to deal with as an API. In general XTS is used for block storage and the input is always fully available (and relatively small, either around 512 bytes, or 4k).
I see. Can you mention this (and the fact that messages must be at least one full block) in the manual?
Would it make sense to rename functions to xts_encrypt_message, xts_decrypt_message, to emphasize that they operate on complete messages? That would be analogous with the ccm message functions.
So the second key, with E_k2, is only ever used to encrypt the IV? If you add a set_iv function, that could do this encryption and only store E_k2(IV).
What would be the advantage ?
With an api for complete messages only, my suggestion makes little sense.
Looks the same indeed, should I share it? Just copy it from cmac? Something else?
Would be nice to share it, but since it's so short, duplication is no big deal.
Also re-reading the cmac version, block_mulx, I think it's unfortunate to use READ_UINT64 and WRITE_UINT64, since arguments are aligned. It would be preferable to load 64-bit values and use __builtin_bswap64 when needed and available (see ctr.c for a similar hack). But that's an independent improvement.
Regards, /Niels
On Fri, 2019-03-15 at 17:30 +0100, Niels Möller wrote:
Simo Sorce simo@redhat.com writes:
But it looks like one has to pass the complete message to one call?
Yes, due to ciphertext stealing, XTS needs to know what are the last two blocks, or at the very least needs to withhold the last processed block in order to be able to change it if a final partial block is provided. This means inputs and outputs would not be symmetrical and I felt it would make it somewhat hard to deal with as an API. In general XTS is used for block storage and the input is always fully available (and relatively small, either around 512 bytes, or 4k).
I see. Can you mention this (and the fact that messages must be at least one full block) in the manual?
I will.
Would it make sense to rename functions to xts_encrypt_message, xts_decrypt_message, to emphasize that they operate on complete messages? That would be analogous with the ccm message functions.
Good idea.
So the second key, with E_k2, is only ever used to encrypt the IV? If you add a set_iv function, that could do this encryption and only store E_k2(IV).
What would be the advantage ?
With an api for complete messages only, my suggestion makes little sense.
Looks the same indeed, should I share it? Just copy it from cmac? Something else?
Would be nice to share it, but since it's so short, duplication is no big deal.
Also re-reading the cmac version, block_mulx, I think it's unfortunate to use READ_UINT64 and WRITE_UINT64, since arguments are aligned. It would be preferable to load 64-bit values and use __builtin_bswap64 when needed and available (see ctr.c for a similar hack). But that's an independent improvement.
I can make a new function and attempt to make these improvements there, then it can either be shared or copied back to cmac at a later time.
Simo.
On Fri, 2019-03-15 at 09:27 -0400, Simo Sorce wrote:
I think this is the same as block_mulx, in cmac.c. (Also same byte order, right?)
Looks the same indeed, should I share it? Just copy it from cmac? Something else?
Turns out the algorithm is not equivalent, as the shift is applied to the array as if it were a big 128bit little endian value, the endianess of the two is different.
I changed the implementation to a much simpler form that show the difference:
/* shift one and XOR with 0x87. */ /* src and dest can point to the same buffer for in-place operations */ static void xts_shift(union nettle_block16 *dst, const union nettle_block16 *src) { uint8_t carry = src->b[15] >> 7; dst->u64[1] = (src->u64[1] << 1) | (src->u64[0] >> 63); dst->u64[0] = src->u64[0] << 1; dst->b[0] ^= 0x87 & -carry; }
Simo.
On Fri, 2019-03-15 at 16:33 -0400, Simo Sorce wrote:
On Fri, 2019-03-15 at 09:27 -0400, Simo Sorce wrote:
I think this is the same as block_mulx, in cmac.c. (Also same byte order, right?)
Looks the same indeed, should I share it? Just copy it from cmac? Something else?
Turns out the algorithm is not equivalent, as the shift is applied to the array as if it were a big 128bit little endian value, the endianess of the two is different.
I changed the implementation to a much simpler form that show the difference:
/* shift one and XOR with 0x87. */ /* src and dest can point to the same buffer for in-place operations */ static void xts_shift(union nettle_block16 *dst, const union nettle_block16 *src) { uint8_t carry = src->b[15] >> 7; dst->u64[1] = (src->u64[1] << 1) | (src->u64[0] >> 63); dst->u64[0] = src->u64[0] << 1; dst->b[0] ^= 0x87 & -carry; }
Simo.
Attached find patch with all the changes I understood we agreed on in the thread.
For easier review I also have a tree with the original patch and discrete steps in form of separate commits that I rebased and merged together: https://gitlab.com/simo5/nettle/tree/xts_exploded
Note the documentation already warns that at least a block of data is required and less than 16 bytes of input is illegal, so no change has been applied in that regard.
HTH, Simo.
Simo Sorce simo@redhat.com writes:
Turns out the algorithm is not equivalent, as the shift is applied to the array as if it were a big 128bit little endian value, the endianess of the two is different.
Ah, I see.
/* shift one and XOR with 0x87. */ /* src and dest can point to the same buffer for in-place operations */ static void xts_shift(union nettle_block16 *dst, const union nettle_block16 *src) { uint8_t carry = src->b[15] >> 7; dst->u64[1] = (src->u64[1] << 1) | (src->u64[0] >> 63); dst->u64[0] = src->u64[0] << 1; dst->b[0] ^= 0x87 & -carry; }
This will then work only on little-endian systems?
I think it would be nice with a structure like
b0 = src->u64[0]; b1 = src->u64[1]; /* Load inputs */ ... swap if big-endian ... uint64_t carry = (b1 >> 63); b1 = (b1 << 1) | (b0 >> 63) b0 = (b0 << 1) ^ (0x87 & -carry); ... swap if big-endian ... dst->u64[0] = b0; dst->u64[1] = b1; /* Store output */
I.e., no memory accesses smaller than 64-bits.
Possibly with load + swap and swap + store done with some system-dependent macros.
But it's not essential for a first version of xts; copying block_mulx and just replacing READ_UINT64 with LE_READ_UINT64 and similarly for WRITE would be ok for now. There are more places with potential for micro-optimizations related to endianness. While I think the READ/WRITE_UINT macros are adequate in most places where unaligned application data is read and written by C code.
Regards, /Niels
On Fri, 2019-03-15 at 22:33 +0100, Niels Möller wrote:
Simo Sorce simo@redhat.com writes:
Turns out the algorithm is not equivalent, as the shift is applied to the array as if it were a big 128bit little endian value, the endianess of the two is different.
Ah, I see.
/* shift one and XOR with 0x87. */ /* src and dest can point to the same buffer for in-place operations */ static void xts_shift(union nettle_block16 *dst, const union nettle_block16 *src) { uint8_t carry = src->b[15] >> 7; dst->u64[1] = (src->u64[1] << 1) | (src->u64[0] >> 63); dst->u64[0] = src->u64[0] << 1; dst->b[0] ^= 0x87 & -carry; }
This will then work only on little-endian systems?
I think it would be nice with a structure like
b0 = src->u64[0]; b1 = src->u64[1]; /* Load inputs */ ... swap if big-endian ... uint64_t carry = (b1 >> 63); b1 = (b1 << 1) | (b0 >> 63) b0 = (b0 << 1) ^ (0x87 & -carry); ... swap if big-endian ... dst->u64[0] = b0; dst->u64[1] = b1; /* Store output */
I.e., no memory accesses smaller than 64-bits.
Possibly with load + swap and swap + store done with some system-dependent macros.
But it's not essential for a first version of xts; copying block_mulx and just replacing READ_UINT64 with LE_READ_UINT64 and similarly for WRITE would be ok for now. There are more places with potential for micro-optimizations related to endianness. While I think the READ/WRITE_UINT macros are adequate in most places where unaligned application data is read and written by C code.
I will add the macros to swap endianess, and resend a new version.
Simo.
On Fri, 2019-03-15 at 17:46 -0400, Simo Sorce wrote:
On Fri, 2019-03-15 at 22:33 +0100, Niels Möller wrote:
Simo Sorce simo@redhat.com writes:
Turns out the algorithm is not equivalent, as the shift is applied to the array as if it were a big 128bit little endian value, the endianess of the two is different.
Ah, I see.
/* shift one and XOR with 0x87. */ /* src and dest can point to the same buffer for in-place operations */ static void xts_shift(union nettle_block16 *dst, const union nettle_block16 *src) { uint8_t carry = src->b[15] >> 7; dst->u64[1] = (src->u64[1] << 1) | (src->u64[0] >> 63); dst->u64[0] = src->u64[0] << 1; dst->b[0] ^= 0x87 & -carry; }
This will then work only on little-endian systems?
I think it would be nice with a structure like
b0 = src->u64[0]; b1 = src->u64[1]; /* Load inputs */ ... swap if big-endian ... uint64_t carry = (b1 >> 63); b1 = (b1 << 1) | (b0 >> 63) b0 = (b0 << 1) ^ (0x87 & -carry); ... swap if big-endian ... dst->u64[0] = b0; dst->u64[1] = b1; /* Store output */
I.e., no memory accesses smaller than 64-bits.
Possibly with load + swap and swap + store done with some system-dependent macros.
But it's not essential for a first version of xts; copying block_mulx and just replacing READ_UINT64 with LE_READ_UINT64 and similarly for WRITE would be ok for now. There are more places with potential for micro-optimizations related to endianness. While I think the READ/WRITE_UINT macros are adequate in most places where unaligned application data is read and written by C code.
I will add the macros to swap endianess, and resend a new version.
New patch attached, the diff has also been applied as an additional commit to my xts_exploded tree in gitlab.
Simo Sorce simo@redhat.com writes:
New patch attached, the diff has also been applied as an additional commit to my xts_exploded tree in gitlab.
Thanks. Looks pretty good, a few more comments below.
Id din't look so closely at the tests, but it would be good with to test inplace opertion also with the aes-specific functions.
For performance (which we don't need to focus that much on for the initial version), one thing to do would be to fill a buffer with consecutive T_k values, and do the encrypt/decrypt and memxor operations on several blocks at a time.
+For a plaintext length that is a perfect multiple of the XTS block size: +@example +T_1 = E_k2(IV) MUL a^0 +C_1 = E_k1(P_1 XOR T_1) XOR T_1
+@dots{}
+T_n = E_k2(IV) MUL a^(n-1) +C_n = E_k1(P_n XOR T_n) XOR T_n +@end example
Nit: I'd write the T updates as
T_1 = E_k2(IV) T_k = T_(k-1) MUL a
since this is how to implement it, and no powering operation is needed.
+The functions @var{encf} @var{decf} are of type
+@code{void f (void *@var{ctx}, size_t @var{length}, uint8_t *@var{dst}, +const uint8_t *@var{src})},
It seems the manual doesn't really have an entry for the nettle_cipher_func type. Maybe it should. Anyway, the type for the first argument is _const_ void *.
+/* shift one and XOR with 0x87. */ +/* src and dest can point to the same buffer for in-place operations */ +static void +xts_shift(union nettle_block16 *dst,
const union nettle_block16 *src)
+{
- uint8_t carry = src->b[15] >> 7;
- uint64_t b0 = LE_READ_UINT64(src->b);
- uint64_t b1 = LE_READ_UINT64(src->b+8);
- b1 = (b1 << 1) | (b0 >> 63);
- b0 = b0 << 1;
- LE_WRITE_UINT64(dst->b, b0);
- LE_WRITE_UINT64(dst->b+8, b1);
- dst->b[0] ^= 0x87 & -carry;
+}
Looks right. It can be optimized later on.
+/*
- prev is the block to steal from
- curr is the input block to the last step
- length is the partial block length
- dst is the destination partial block
- src is the source partial block
- In the Encryption case:
- prev -> the output of the N-1 encryption step
- curr -> the input to the Nth step (will be encrypted as Cn-1)
- dst -> the final Cn partial block
- src -> the final Pn partial block
- In the decryption case:
- prev -> the output of the N-1 decryption step
- curr -> the input to the Nth step (will be decrypted as Pn-1)
- dst -> the final Pn partial block
- src -> the final Cn partial block
- */
+static void +xts_steal(uint8_t *prev, uint8_t *curr,
size_t length, uint8_t *dst, const uint8_t *src)
+{
- /* copy the remaining in the current input block */
- memcpy(curr, src, length);
- /* fill the current block with the last blocksize - length
- bytes of the previous block */
- memcpy(&curr[length], &prev[length], XTS_BLOCK_SIZE - length);
- /* This must be last or inplace operations will break
- copy 'length' bytes of the previous block in the
- destination block, which is the final partial block
- returned to the caller */
- memcpy(dst, prev, length);
+}
This is a bit confusing; first we write to the curr block, using data from src and prev. And next we copy from prev to dst; is that the swapping of the last two blocks? Maybe this could be made simpler, and a copy eliminatedif the main loops were
for (;length >= 2*XTS_BLOCK_SIZE;
For the arguments, it looks like prev could be const, and curr could be nettle_block16 (but changing the latter is perhaps a bit pointless, since we only treat it as a byte array with no obvious benefits from the alignment).
+static void +check_length(size_t length, uint8_t *dst) +{
- assert(length >= XTS_BLOCK_SIZE);
- /* asserts may be compiled out, try to save the user by zeroing the dst in
- case the buffer contains sensitive data (like the clear text for inplace
- encryption) */
- if (length < XTS_BLOCK_SIZE)
- memxor(dst, dst, length);
+}
+/* works also for inplace encryption/decryption */
+void +xts_encrypt_message(const void *enc_ctx, const void *twk_ctx,
nettle_cipher_func *encf,
const uint8_t *tweak, size_t length,
uint8_t *dst, const uint8_t *src)
+{
- union nettle_block16 T;
- union nettle_block16 P;
- check_length(length, dst);
- encf(twk_ctx, XTS_BLOCK_SIZE, T.b, tweak);
- /* the zeroth power of alpha is the initial ciphertext value itself, so we
- skip shifting and do it at the end of each block operation instead */
- for (;length >= XTS_BLOCK_SIZE;
length -= XTS_BLOCK_SIZE, src += XTS_BLOCK_SIZE, dst += XTS_BLOCK_SIZE)
- {
memcpy(P.b, src, XTS_BLOCK_SIZE);
memxor(P.b, T.b, XTS_BLOCK_SIZE); /* P -> PP */
This is what memxor3 is for. (If it's more efficient in this case is not entirely obvious, though).
+/* XTS Mode with AES-128 */ +struct xts_aes128_ctx {
- struct aes128_ctx cipher;
- struct aes128_ctx tweak_cipher;
+};
Could consider renaming it to xts_aes128_key, somewhat analogous to struct eax_key and struct gcm_key. This represents message-independent data, and then the name xts_aes128_ctx could be used in case we add an interface with a buffer and incremental encryption and decryption. Not clear to me if that's likely or not to ever be useful, though.
Regards, /Niels
Hi Niels, attached find two patches, comments inline.
On Tue, 2019-03-19 at 07:31 +0100, Niels Möller wrote:
Simo Sorce simo@redhat.com writes:
New patch attached, the diff has also been applied as an additional commit to my xts_exploded tree in gitlab.
Thanks. Looks pretty good, a few more comments below.
Id din't look so closely at the tests, but it would be good with to test inplace opertion also with the aes-specific functions.
For performance (which we don't need to focus that much on for the initial version), one thing to do would be to fill a buffer with consecutive T_k values, and do the encrypt/decrypt and memxor operations on several blocks at a time.
+For a plaintext length that is a perfect multiple of the XTS block size: +@example +T_1 = E_k2(IV) MUL a^0 +C_1 = E_k1(P_1 XOR T_1) XOR T_1
+@dots{}
+T_n = E_k2(IV) MUL a^(n-1) +C_n = E_k1(P_n XOR T_n) XOR T_n +@end example
Nit: I'd write the T updates as
T_1 = E_k2(IV) T_k = T_(k-1) MUL a
since this is how to implement it, and no powering operation is needed.
Ok changed it.
+The functions @var{encf} @var{decf} are of type
+@code{void f (void *@var{ctx}, size_t @var{length}, uint8_t *@var{dst}, +const uint8_t *@var{src})},
It seems the manual doesn't really have an entry for the nettle_cipher_func type. Maybe it should. Anyway, the type for the first argument is _const_ void *.
Fixed.
+/* shift one and XOR with 0x87. */ +/* src and dest can point to the same buffer for in-place operations */ +static void +xts_shift(union nettle_block16 *dst,
const union nettle_block16 *src)
+{
- uint8_t carry = src->b[15] >> 7;
- uint64_t b0 = LE_READ_UINT64(src->b);
- uint64_t b1 = LE_READ_UINT64(src->b+8);
- b1 = (b1 << 1) | (b0 >> 63);
- b0 = b0 << 1;
- LE_WRITE_UINT64(dst->b, b0);
- LE_WRITE_UINT64(dst->b+8, b1);
- dst->b[0] ^= 0x87 & -carry;
+}
Looks right. It can be optimized later on.
Just for curiosity, would it be ok change, LE_READ_UINT64/LE_WRITE_UINT64 to use uint64_t and have the macro use __bswap64 if available or fall back to the original unoptimized code ?
(__bswap64 already knows how to use __builtin_bswap64 or falls back to similar code as in LE_READ_UINT64)
Or do we need a new macro that expects aligned input/output and leave the existing macro alone, to be used for unaligned input/output ?
+/*
- prev is the block to steal from
- curr is the input block to the last step
- length is the partial block length
- dst is the destination partial block
- src is the source partial block
- In the Encryption case:
- prev -> the output of the N-1 encryption step
- curr -> the input to the Nth step (will be encrypted as Cn-1)
- dst -> the final Cn partial block
- src -> the final Pn partial block
- In the decryption case:
- prev -> the output of the N-1 decryption step
- curr -> the input to the Nth step (will be decrypted as Pn-1)
- dst -> the final Pn partial block
- src -> the final Cn partial block
- */
+static void +xts_steal(uint8_t *prev, uint8_t *curr,
size_t length, uint8_t *dst, const uint8_t *src)
+{
- /* copy the remaining in the current input block */
- memcpy(curr, src, length);
- /* fill the current block with the last blocksize - length
- bytes of the previous block */
- memcpy(&curr[length], &prev[length], XTS_BLOCK_SIZE - length);
- /* This must be last or inplace operations will break
- copy 'length' bytes of the previous block in the
- destination block, which is the final partial block
- returned to the caller */
- memcpy(dst, prev, length);
+}
This is a bit confusing; first we write to the curr block, using data from src and prev. And next we copy from prev to dst; is that the swapping of the last two blocks? Maybe this could be made simpler, and a copy eliminatedif the main loops were
for (;length >= 2*XTS_BLOCK_SIZE;
For the arguments, it looks like prev could be const, and curr could be nettle_block16 (but changing the latter is perhaps a bit pointless, since we only treat it as a byte array with no obvious benefits from the alignment).
Ok, I took a stab at removing xts_steal completely in the second patch, let me know what you think, I think I may like it better than my original code and uses nettle_block16 for temporary storage to avoid a copy.
+static void +check_length(size_t length, uint8_t *dst) +{
- assert(length >= XTS_BLOCK_SIZE);
- /* asserts may be compiled out, try to save the user by zeroing the dst in
- case the buffer contains sensitive data (like the clear text for inplace
- encryption) */
- if (length < XTS_BLOCK_SIZE)
- memxor(dst, dst, length);
+}
+/* works also for inplace encryption/decryption */
+void +xts_encrypt_message(const void *enc_ctx, const void *twk_ctx,
nettle_cipher_func *encf,
const uint8_t *tweak, size_t length,
uint8_t *dst, const uint8_t *src)
+{
- union nettle_block16 T;
- union nettle_block16 P;
- check_length(length, dst);
- encf(twk_ctx, XTS_BLOCK_SIZE, T.b, tweak);
- /* the zeroth power of alpha is the initial ciphertext value itself, so we
- skip shifting and do it at the end of each block operation instead */
- for (;length >= XTS_BLOCK_SIZE;
length -= XTS_BLOCK_SIZE, src += XTS_BLOCK_SIZE, dst += XTS_BLOCK_SIZE)
- {
memcpy(P.b, src, XTS_BLOCK_SIZE);
memxor(P.b, T.b, XTS_BLOCK_SIZE); /* P -> PP */
This is what memxor3 is for. (If it's more efficient in this case is not entirely obvious, though).
I did not know about memxor3, it makes the code better, esp in the second patch, I think.
+/* XTS Mode with AES-128 */ +struct xts_aes128_ctx {
- struct aes128_ctx cipher;
- struct aes128_ctx tweak_cipher;
+};
Could consider renaming it to xts_aes128_key, somewhat analogous to struct eax_key and struct gcm_key. This represents message-independent data, and then the name xts_aes128_ctx could be used in case we add an interface with a buffer and incremental encryption and decryption. Not clear to me if that's likely or not to ever be useful, though.
I did this, the reasoning makes sense to me.
HTH, Simo.
Simo Sorce simo@redhat.com writes:
Just for curiosity, would it be ok change, LE_READ_UINT64/LE_WRITE_UINT64 to use uint64_t and have the macro use __bswap64 if available or fall back to the original unoptimized code ?
I think LE_READ_UINT64 and related macros should be left as is, and used for unaligned reads and writes.
For xts_shift, there are three cases, and I think they could be handled like this:
1. Little-endian system: Use 64-bit reads and writes, no swaps needed.
2. Big-endian system, __builtin_bswap64 detected (HAVE_BUILTIN_BSWAP64): Use 64-bit reads and writes + __builtin_bswap64.
3. Big-endian system, no __builtin_bswap64. Here we can either use the current code, with byte accesses only. Or attempt to define byteswap without builtins and follow 2. I'd lean towards using the current code, unless there's some system where something more complicated is known to improve performance.
In case (1) or (2), use 64-bit operations exclusively, also for the carry logic. And then the macros and #ifdefs should be arranged to make it not too ugly. E.g., cases (1) and (2) could perhaps be combined, with a BSWAP64_IF_BE macro or the like. Or a macro like LE_READ_ALIGNED_UINT64 (with an uint64_t * argument)
Only current use of __builtin_bswap64 is in ctr.c, the ctr_fill16 function.
An older example is in salsa20-core-internal.c, with a LE_SWAP32 macro. That one could take advantage of __builtin_bswap32, if available. I don't know if the compiler can recognize the current expression as a byte swap.
And it probably is more important to optimize xts_shift if/when one also arranges the code to produces multiple blocks of T_k values at a time (say, 32 blocks, 512 bytes of stack-allocated temporary storage), without reading the intermediate values back from memory to registers. That has been a significant optimization for both ctr mode and cbc decrypt.
I haven't reviewed the new version of the patch yet, I hope to get to that in a day or two.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
- Big-endian system, no __builtin_bswap64. Here we can either use the current code, with byte accesses only. Or attempt to define byteswap without builtins and follow 2. I'd lean towards using the current code, unless there's some system where something more complicated is known to improve performance.
And another possible trick for big-endian is to do an "opposite-endian" left shift as
((x & 0x7f7f7f7f7f7f7f7f) << 1) | ((x & 0x8080808080808080) >> 15) where this bit is the carry out ^
Regards, /Niels
On Wed, 2019-03-20 at 06:14 +0100, Niels Möller wrote:
nisse@lysator.liu.se (Niels Möller) writes:
- Big-endian system, no __builtin_bswap64. Here we can either use the current code, with byte accesses only. Or attempt to define byteswap without builtins and follow 2. I'd lean towards using the current code, unless there's some system where something more complicated is known to improve performance.
And another possible trick for big-endian is to do an "opposite-endian" left shift as
((x & 0x7f7f7f7f7f7f7f7f) << 1) | ((x & 0x8080808080808080) >> 15) where this bit is the carry out ^
This would allow us to avoid copies at the cost of more complicated code.
Which do you prefer? using endian.h where available? Or having two separate codepaths depending on the endianess of the machine ?
Simo.
Simo Sorce simo@redhat.com writes:
On Wed, 2019-03-20 at 06:14 +0100, Niels Möller wrote:
And another possible trick for big-endian is to do an "opposite-endian" left shift as
((x & 0x7f7f7f7f7f7f7f7f) << 1) | ((x & 0x8080808080808080) >> 15) where this bit is the carry out ^
This would allow us to avoid copies at the cost of more complicated code.
Which do you prefer? using endian.h where available? Or having two separate codepaths depending on the endianess of the machine ?
If it matters for performance, use the fastest variant. Using separate implementations of xts_shift, with #if:s depending on endianness and compiler support, is fine.
I'd expect the opposite-endian shift to be more efficient when bswap is particularly slow, and implemented in terms of shifting and masking.
A bit difficult to determine, though. Neither existence of endian.h macros or __builtin_bswap64 implies that the byte swapping is cheap. Are there any interesting platforms these days that lack an efficient bswap instruction? And are big-endian? Does mips have a bswap instruction?
Regards, /Niels
On Wed, 2019-03-20 at 14:46 +0100, Niels Möller wrote:
Simo Sorce simo@redhat.com writes:
On Wed, 2019-03-20 at 06:14 +0100, Niels Möller wrote:
And another possible trick for big-endian is to do an "opposite-endian" left shift as
((x & 0x7f7f7f7f7f7f7f7f) << 1) | ((x & 0x8080808080808080) >> 15) where this bit is the carry out ^
This would allow us to avoid copies at the cost of more complicated code.
Which do you prefer? using endian.h where available? Or having two separate codepaths depending on the endianess of the machine ?
If it matters for performance, use the fastest variant. Using separate implementations of xts_shift, with #if:s depending on endianness and compiler support, is fine.
I'd expect the opposite-endian shift to be more efficient when bswap is particularly slow, and implemented in terms of shifting and masking.
A bit difficult to determine, though. Neither existence of endian.h macros or __builtin_bswap64 implies that the byte swapping is cheap. Are there any interesting platforms these days that lack an efficient bswap instruction? And are big-endian? Does mips have a bswap instruction?
In the end I went with the opposit-endian swapping solution and two separate implementations for LE and BE. My reasoning is that the compiler can definitely better optimize the LE version and the BE version done this way is probably not slower than using byteswapping even when optimized bswap is available. The secondary reason is that I feel this version is more readable.
I am attaching all 3 patches anew as I also fixed the other issues you mentioned in a previous email. Namely improved the non-stealing case for encryption/decryption by removing the duplicate last block handling, and changed the memclearing memxor with a memset.
HTH, Simo.
Simo Sorce simo@redhat.com writes:
I am attaching all 3 patches anew as I also fixed the other issues you mentioned in a previous email.
Thanks. I'm about to merge. I've run cross-compile+qemu tests also on big-endian mips, and it seems to work fine.
+@subsubsection @acronym{XTS}-@acronym{AES} interface
+The @acronym{AES} @acronym{XTS} functions provide an API for using the +@acronym{XTS} mode with the @acronym{AES} block ciphers. The parameters +all have the same meaning as the general interface, except that the +@var{enc_ctx}, @var{dec_ctx}, @var{twk_ctx}, @var{encf} and @var{decf} are +replaced with an @acronym{AES} context structure called @var{ctx}, and a +appropriate set-key function must be called before using any of the encryption +or decryption functions in this interface.
+@deftp {Context struct} {struct xts_aes128_ctx} +Holds state corresponding to the AES-128 block cipher. +@end deftp
+@deftp {Context struct} {struct xts_aes256_ctx} +Holds state corresponding to the AES-256 block cipher. +@end deftp
These structs were renamed from _ctx to _key, right?
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Simo Sorce simo@redhat.com writes:
I am attaching all 3 patches anew as I also fixed the other issues you mentioned in a previous email.
Thanks. I'm about to merge. I've run cross-compile+qemu tests also on big-endian mips, and it seems to work fine.
Merged to master branch now. Thanks!
/Niels
On Sun, 2019-03-24 at 19:57 +0100, Niels Möller wrote:
nisse@lysator.liu.se (Niels Möller) writes:
Simo Sorce simo@redhat.com writes:
I am attaching all 3 patches anew as I also fixed the other issues you mentioned in a previous email.
Thanks. I'm about to merge. I've run cross-compile+qemu tests also on big-endian mips, and it seems to work fine.
Merged to master branch now. Thanks!
Thanks a lot! I see you also fixed the _ctx -> _key suffix in the docs, sorry for not catching it myself, and thank you for dealing with it!
Cheers, Simo.
On Tue, 2019-03-19 at 22:36 +0100, Niels Möller wrote:
Simo Sorce simo@redhat.com writes:
Just for curiosity, would it be ok change, LE_READ_UINT64/LE_WRITE_UINT64 to use uint64_t and have the macro use __bswap64 if available or fall back to the original unoptimized code ?
I think LE_READ_UINT64 and related macros should be left as is, and used for unaligned reads and writes.
For xts_shift, there are three cases, and I think they could be handled like this:
Little-endian system: Use 64-bit reads and writes, no swaps needed.
Big-endian system, __builtin_bswap64 detected (HAVE_BUILTIN_BSWAP64): Use 64-bit reads and writes + __builtin_bswap64.
Big-endian system, no __builtin_bswap64. Here we can either use the current code, with byte accesses only. Or attempt to define byteswap without builtins and follow 2. I'd lean towards using the current code, unless there's some system where something more complicated is known to improve performance.
In case (1) or (2), use 64-bit operations exclusively, also for the carry logic. And then the macros and #ifdefs should be arranged to make it not too ugly. E.g., cases (1) and (2) could perhaps be combined, with a BSWAP64_IF_BE macro or the like. Or a macro like LE_READ_ALIGNED_UINT64 (with an uint64_t * argument)
Only current use of __builtin_bswap64 is in ctr.c, the ctr_fill16 function.
An older example is in salsa20-core-internal.c, with a LE_SWAP32 macro. That one could take advantage of __builtin_bswap32, if available. I don't know if the compiler can recognize the current expression as a byte swap.
And it probably is more important to optimize xts_shift if/when one also arranges the code to produces multiple blocks of T_k values at a time (say, 32 blocks, 512 bytes of stack-allocated temporary storage), without reading the intermediate values back from memory to registers. That has been a significant optimization for both ctr mode and cbc decrypt.
So the way I would do this is by using the glibc macros le64toh() and htole64(), they already handle to do the right thing depending on which architecture we are on.
If it is ok to make configure if those macros are available (/usr/include/endian.h) I would use them or provide replacements that just call into the existing macros.
I haven't reviewed the new version of the patch yet, I hope to get to that in a day or two.
Regards, /Niels
Simo Sorce simo@redhat.com writes:
If it is ok to make configure if those macros are available (/usr/include/endian.h) I would use them or provide replacements that just call into the existing macros.
Sure, provided there's a reasonable portable fallback. On non-glibc systems, I think we should use __builtin_bswap64 when provided by gcc or its look-alikes.
Regards, /Niels
Simo Sorce simo@redhat.com writes:
Ok, I took a stab at removing xts_steal completely in the second patch, let me know what you think, I think I may like it better than my original code and uses nettle_block16 for temporary storage to avoid a copy.
I like the version without xts_steal.
It's slightly annoying to repeat duplicate code for a final complete block, but no big deal. Alternative ways to do the final block of the non-stealing case (including the case of exactly one block) are
for (; length >= 2 * XTS_BLOCK_SIZE || length == XTS_BLOCK_SIZE; ...) { ... } if (length > 0) { ... steal ... }
or (since we require at least one block)
do { ... length -= XTS_BLOCK_SIZE; if (!length) return; } while (length >= 2*XTS_BLOCK_SIZE);
Do what you think makes it clearest.
For the tests, have you checked that there's coverage for the special wraparound? I.e., that tests fail if the line
dst->b[0] ^= 0x87 & -carry;
is changed. Since there are a very small number of test vectors with more than one block, we could be unlucky and have carry == 0 all the time when xts_shift is called from the tests...
+static void +check_length(size_t length, uint8_t *dst) +{
- assert(length >= XTS_BLOCK_SIZE);
- /* asserts may be compiled out, try to save the user by zeroing the dst in
- case the buffer contains sensitive data (like the clear text for inplace
- encryption) */
- if (length < XTS_BLOCK_SIZE)
- memxor(dst, dst, length);
+}
Why memxor rather than memset?
Regards, /Niels
On Wed, 2019-03-20 at 06:52 +0100, Niels Möller wrote:
Simo Sorce simo@redhat.com writes:
Ok, I took a stab at removing xts_steal completely in the second patch, let me know what you think, I think I may like it better than my original code and uses nettle_block16 for temporary storage to avoid a copy.
I like the version without xts_steal.
It's slightly annoying to repeat duplicate code for a final complete block, but no big deal. Alternative ways to do the final block of the non-stealing case (including the case of exactly one block) are
for (; length >= 2 * XTS_BLOCK_SIZE || length == XTS_BLOCK_SIZE; ...) { ... } if (length > 0) { ... steal ... }
or (since we require at least one block)
do { ... length -= XTS_BLOCK_SIZE; if (!length) return; } while (length >= 2*XTS_BLOCK_SIZE);
Do what you think makes it clearest.
Makes sense, I'll pick one.
For the tests, have you checked that there's coverage for the special wraparound? I.e., that tests fail if the line
dst->b[0] ^= 0x87 & -carry;
is changed. Since there are a very small number of test vectors with more than one block, we could be unlucky and have carry == 0 all the time when xts_shift is called from the tests...
When I botched my change to xts_shift() I always got errors in tests, I will double check this is covered once again by simply removing the instruction.
+static void +check_length(size_t length, uint8_t *dst) +{
- assert(length >= XTS_BLOCK_SIZE);
- /* asserts may be compiled out, try to save the user by zeroing the dst in
- case the buffer contains sensitive data (like the clear text for inplace
- encryption) */
- if (length < XTS_BLOCK_SIZE)
- memxor(dst, dst, length);
+}
Why memxor rather than memset?
Ah, no good reason, I'll switch it.
Simo.
On 03/15/19 16:33 , Simo Sorce wrote:
On Fri, 2019-03-15 at 09:27 -0400, Simo Sorce wrote:
I think this is the same as block_mulx, in cmac.c. (Also same byte order, right?)
Looks the same indeed, should I share it? Just copy it from cmac? Something else?
Turns out the algorithm is not equivalent, as the shift is applied to the array as if it were a big 128bit little endian value, the endianess of the two is different.
I changed the implementation to a much simpler form that show the difference:
/* shift one and XOR with 0x87. */ /* src and dest can point to the same buffer for in-place operations */ static void xts_shift(union nettle_block16 *dst, const union nettle_block16 *src) { uint8_t carry = src->b[15] >> 7; dst->u64[1] = (src->u64[1] << 1) | (src->u64[0] >> 63); dst->u64[0] = src->u64[0] << 1; dst->b[0] ^= 0x87 & -carry;
Nitpick: mixing different-sized access (esp. writes) to same memory is problematic for modern cpus (it confuses speculative engine): uint64_t carry = src->u64[1] >> 63; dst->u64[1] = (src->u64[1] << 1) | (src->u64[0] >> 63); dst->u64[0] = src->u64[0] << 1; dst->u64[0] ^= 0x87 & -carry;
}
On Sat, 2019-03-16 at 16:26 +0300, Yuriy M. Kaminskiy wrote:
On 03/15/19 16:33 , Simo Sorce wrote:
On Fri, 2019-03-15 at 09:27 -0400, Simo Sorce wrote:
I think this is the same as block_mulx, in cmac.c. (Also same byte order, right?)
Looks the same indeed, should I share it? Just copy it from cmac? Something else?
Turns out the algorithm is not equivalent, as the shift is applied to the array as if it were a big 128bit little endian value, the endianess of the two is different.
I changed the implementation to a much simpler form that show the difference:
/* shift one and XOR with 0x87. */ /* src and dest can point to the same buffer for in-place operations */ static void xts_shift(union nettle_block16 *dst, const union nettle_block16 *src) { uint8_t carry = src->b[15] >> 7; dst->u64[1] = (src->u64[1] << 1) | (src->u64[0] >> 63); dst->u64[0] = src->u64[0] << 1; dst->b[0] ^= 0x87 & -carry;
Nitpick: mixing different-sized access (esp. writes) to same memory is problematic for modern cpus (it confuses speculative engine):
In what way does it confuses speculation ?
uint64_t carry = src->u64[1] >> 63; dst->u64[1] = (src->u64[1] << 1) | (src->u64[0] >> 63); dst->u64[0] = src->u64[0] << 1; dst->u64[0] ^= 0x87 & -carry;
This is equivalent, but I would like to know what is the exact problem before changing. Confusing speculation may actually be a feature in this day and age :-)
Simo.
nettle-bugs@lists.lysator.liu.se