On Wed, Nov 23, 2016 at 10:30 PM, Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum 10353@lyskom.lysator.liu.se wrote:
I think you are conflagrating range with interpretation. Both a Latin1 string and an UTF-8 encoded one are 8-bit strings (with a 0-255 range). What would be useful is a datatype that declares that the elements are not Unicode characters (as they are in the Latin1 string case) but some raw binary encoding (as they are in the UTF-8 case), optionally also specifying which encoding. This has been suggested before (with "buffer" as a suggestion for the name of the new datatype), but it has never been implemented due to the difficulty of introducing such a datatype in a consistent way while still retaining backward compatibility.
I agree, but using string(8bit) to mean "binary data" is something that's 100% backward compatible. Unicode text would always be referred to as string(21bit), even if it happens to contain nothing but Latin-1 characters.
FWIW, I would support an actual division of data types, such that you cannot concatenate one onto the other. But having seen what happened with Python 2 -> Python 3, I would expect this to be a fairly significant backward compatibility break. It'd probably be something for Pike 9.0 or even 10.0. There would be an opportunity to learn from Python here, though, and maybe do things more smoothly. In any case, the first step is to *right now* think about binary data and textual data as different things, and distinguish them and convert them as appropriate.
ChrisA