Re: Please add base-64 URL-safe alphabet

29 Jan 2015

On 13/12/2014 7:49 a.m., Niels Möller wrote:
...
Amos Jeffries squid3@treenet.co.nz writes:
...
The attached patch implements a proposed API/ABI extension adding
support for RFC 4648 section 5 "Base 64 Encoding with URL and Filename
Safe Alphabet"
Thanks. It makes sense to me to add support for this in one way or the
other.
...
External code simply calls the init() function relevant to the
alphabet it is needing to encode/decode with. The library internally
uses the context to select which lookup table to use for later base64
function calls.
Passing the alphabet to the init function and recording it in the context
is nice in some ways, but it breaks the ABI (since the size of the
context struct is part of the ABI). Now, maybe we have to break the ABI
for 3.1 anyway, to introduce versioned symbols. I don't yet know for
sure if an soname change is necessary or not.
The alternative to storing this in the context, is to introduce some
general encoding/decoding functions which support arbitrary alphabets and
take the needed table as an additional argument. Then base64_* and
base64url_* would be simple wrapper functions passing the right table.
...
The base64_encode_raw() and base64_encode_group() functions which do
not use contexts are left untouched for now.
Both functions are undocumented, so we can consider changing them if
neeed. As for encode_group, that seems to be used only in the openpgp code
which is not really in a working state. So we can probably ignore that one.
But base64_encode_raw is used by the main encoding function,
base64_encode_update. So to support other alphabets one either needs to
pass some kind of alphabet argument to base64_encode_raw, or introduce a
new function with such an argument.
Does anyone know if applications are using base64_encode_raw, despite
it's status as undocumented?
...
+/* which alphabet to use */
+#define BASE64_ALPHABET 0
+#define BASE64URL_ALPHABET 1
I think I'd prefer to avoid an enumeration of available alphabets. And
in the API you sketch, applications will never use these constants,
right?
Yes.
...
...
/* Base64 encoding */
/* Maximum length of output for base64_encode_update. NOTE: Doesn't
@@ -73,11 +79,17 @@ struct base64_encode_ctx
 {
   unsigned word;   /* Leftover bits */
   unsigned bits;  /* Number of bits, always 0, 2, or 4. */

unsigned alphabet; /* which alphabet to use for encoding */

};
Assuming we go with adding stuff to the context struct, I'd prefer to
add a pointer directly to the alphabet (and for decoding, a pointer
directly to the lookup table). So we don't need any switch on an
alphabet enum.
...

diff --git a/base64-decode.c b/base64-decode.c
index f622baa..fbaf54f 100644
--- a/base64-decode.c
+++ b/base64-decode.c
@@ -43,7 +43,7 @@
 #define TABLE_END -3
static const signed char
-decode_table[0x100] =
+default_decode_table[0x100] =
 {
   /* White space is HT, VT, FF, CR, LF and SPC */
   -1, -1, -1, -1, -1, -1, -1, -1, -1, -2, -2, -2, -2, -2, -1, -1, 
@@ -64,10 +64,40 @@ decode_table[0x100] =
   -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
+static const signed char
+urlextended_decode_table[0x100] =
+{

/* White space is HT, VT, FF, CR, LF and SPC */
-1, -1, -1, -1, -1, -1, -1, -1, -1, -2, -2, -2, -2, -2, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, -3, -1, -1,
-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, 63,
-1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,

+};



I think the base64url decoding functions, and its table, should be put
in its own object file. Maybe static linking is obscure these days, but
it's desirable that if one uses only the standard base64 alphabet, one
shouldn't get dependencies causing the linker to drag in tables for
other alphabets.
Are you sure you want that?
Except for the init function and table the remainder of the code is all
shared. Separating the tables from the code means we would end up with
+4 files just holding lookup tables and have to move the table
symbols/pointers into a .h.
...
...
void
 base64_decode_init(struct base64_decode_ctx *ctx)
 {
   ctx->word = ctx->bits = ctx->padding = 0;

ctx->alphabet = BASE64_ALPHABET;

+}



+void
+base64url_decode_init(struct base64_decode_ctx *ctx)
+{

ctx->word = ctx->bits = ctx->padding = 0;
ctx->alphabet = BASE64URL_ALPHABET;

}
And then the assignment of ctx->alphabet would be
ctx->table = base64_decode_table;
and
ctx->table = base64url_decode_table;
in the respective init functions.
...
int
@@ -76,8 +106,11 @@ base64_decode_single(struct base64_decode_ctx *ctx,
   	     uint8_t src)
 {
   int data;


data = decode_table[src];



if (ctx->alphabet == BASE64URL_ALPHABET)
data = urlextended_decode_table[src];
else
data = default_decode_table[src];

And this piece of code would then simplify to
data = ctx->table[src];
...
@@ -155,6 +171,7 @@ base64_encode_single(struct base64_encode_ctx *ctx,
   unsigned done = 0;
   unsigned word = ctx->word << 8 | src;
   unsigned bits = ctx->bits + 8;

const uint8_t *encode_table = (ctx->alphabet == BASE64URL_ALPHABET ? urlextended_encode_table : default_encode_table);
while (bits >= 6)
  {


I think this works, but only because the ENCODE macro which refers to
the name "encode_table" now gets a local variable, instead of the global
table in the old code. At a casual look, the local variable appears
unused. I think it's easier to understand if this dependency where made
more explicit.
Would you like that done as code documentation, manually unrolling
ENCODE, or replacing it with a static function?
...
Finally, are there other alphabets that are in use? You have
mentioned, earlier, the alphabet used for crypt(3) (where is that
documented???), and iirc, SRP files also use some unusual alphabet for
base64.
Maybe we should make it reasonably easy to let applications supply
custom alphabets? This should be almost trivial for encoding, while for
decoding, we would either need to document the format of the lookup
table, or provide a function to generate a decoding table for a given
alphabet.
When the alphabet pointer is stored in the context the caller is able to
change that however and whenever they wish. For my first submission I
thought that was slightly too dangerous to propose.
But if you are happy to do it, then it resolves both the enumeration
existence and the ability for callers to use custom alphabets.
Amos

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Please add base-64 URL-safe alphabet