On 13/12/2014 7:49 a.m., Niels Möller wrote:
Amos Jeffries squid3@treenet.co.nz writes:
The attached patch implements a proposed API/ABI extension adding support for RFC 4648 section 5 "Base 64 Encoding with URL and Filename Safe Alphabet"
Thanks. It makes sense to me to add support for this in one way or the other.
External code simply calls the init() function relevant to the alphabet it is needing to encode/decode with. The library internally uses the context to select which lookup table to use for later base64 function calls.
Passing the alphabet to the init function and recording it in the context is nice in some ways, but it breaks the ABI (since the size of the context struct is part of the ABI). Now, maybe we have to break the ABI for 3.1 anyway, to introduce versioned symbols. I don't yet know for sure if an soname change is necessary or not.
The alternative to storing this in the context, is to introduce some general encoding/decoding functions which support arbitrary alphabets and take the needed table as an additional argument. Then base64_* and base64url_* would be simple wrapper functions passing the right table.
The base64_encode_raw() and base64_encode_group() functions which do not use contexts are left untouched for now.
Both functions are undocumented, so we can consider changing them if neeed. As for encode_group, that seems to be used only in the openpgp code which is not really in a working state. So we can probably ignore that one.
But base64_encode_raw is used by the main encoding function, base64_encode_update. So to support other alphabets one either needs to pass some kind of alphabet argument to base64_encode_raw, or introduce a new function with such an argument.
Does anyone know if applications are using base64_encode_raw, despite it's status as undocumented?
+/* which alphabet to use */ +#define BASE64_ALPHABET 0 +#define BASE64URL_ALPHABET 1
I think I'd prefer to avoid an enumeration of available alphabets. And in the API you sketch, applications will never use these constants, right?
Yes.
/* Base64 encoding */
/* Maximum length of output for base64_encode_update. NOTE: Doesn't @@ -73,11 +79,17 @@ struct base64_encode_ctx { unsigned word; /* Leftover bits */ unsigned bits; /* Number of bits, always 0, 2, or 4. */
- unsigned alphabet; /* which alphabet to use for encoding */
};
Assuming we go with adding stuff to the context struct, I'd prefer to add a pointer directly to the alphabet (and for decoding, a pointer directly to the lookup table). So we don't need any switch on an alphabet enum.
diff --git a/base64-decode.c b/base64-decode.c index f622baa..fbaf54f 100644 --- a/base64-decode.c +++ b/base64-decode.c @@ -43,7 +43,7 @@ #define TABLE_END -3
static const signed char -decode_table[0x100] = +default_decode_table[0x100] = { /* White space is HT, VT, FF, CR, LF and SPC */ -1, -1, -1, -1, -1, -1, -1, -1, -1, -2, -2, -2, -2, -2, -1, -1, @@ -64,10 +64,40 @@ decode_table[0x100] = -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, };
+static const signed char +urlextended_decode_table[0x100] = +{
- /* White space is HT, VT, FF, CR, LF and SPC */
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -2, -2, -2, -2, -2, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1,
- 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, -3, -1, -1,
- -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
- 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, 63,
- -1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
- 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+};
I think the base64url decoding functions, and its table, should be put in its own object file. Maybe static linking is obscure these days, but it's desirable that if one uses only the standard base64 alphabet, one shouldn't get dependencies causing the linker to drag in tables for other alphabets.
Are you sure you want that?
Except for the init function and table the remainder of the code is all shared. Separating the tables from the code means we would end up with +4 files just holding lookup tables and have to move the table symbols/pointers into a .h.
void base64_decode_init(struct base64_decode_ctx *ctx) { ctx->word = ctx->bits = ctx->padding = 0;
- ctx->alphabet = BASE64_ALPHABET;
+}
+void +base64url_decode_init(struct base64_decode_ctx *ctx) +{
- ctx->word = ctx->bits = ctx->padding = 0;
- ctx->alphabet = BASE64URL_ALPHABET;
}
And then the assignment of ctx->alphabet would be
ctx->table = base64_decode_table;
and
ctx->table = base64url_decode_table;
in the respective init functions.
int @@ -76,8 +106,11 @@ base64_decode_single(struct base64_decode_ctx *ctx, uint8_t src) { int data;
- data = decode_table[src];
- if (ctx->alphabet == BASE64URL_ALPHABET)
- data = urlextended_decode_table[src];
- else
- data = default_decode_table[src];
And this piece of code would then simplify to
data = ctx->table[src];
@@ -155,6 +171,7 @@ base64_encode_single(struct base64_encode_ctx *ctx, unsigned done = 0; unsigned word = ctx->word << 8 | src; unsigned bits = ctx->bits + 8;
const uint8_t *encode_table = (ctx->alphabet == BASE64URL_ALPHABET ? urlextended_encode_table : default_encode_table);
while (bits >= 6) {
I think this works, but only because the ENCODE macro which refers to the name "encode_table" now gets a local variable, instead of the global table in the old code. At a casual look, the local variable appears unused. I think it's easier to understand if this dependency where made more explicit.
Would you like that done as code documentation, manually unrolling ENCODE, or replacing it with a static function?
Finally, are there other alphabets that are in use? You have mentioned, earlier, the alphabet used for crypt(3) (where is that documented???), and iirc, SRP files also use some unusual alphabet for base64.
Maybe we should make it reasonably easy to let applications supply custom alphabets? This should be almost trivial for encoding, while for decoding, we would either need to document the format of the lookup table, or provide a function to generate a decoding table for a given alphabet.
When the alphabet pointer is stored in the context the caller is able to change that however and whenever they wish. For my first submission I thought that was slightly too dangerous to propose.
But if you are happy to do it, then it resolves both the enumeration existence and the ability for callers to use custom alphabets.
Amos