On Wed, Nov 23, 2016 at 10:00 PM, Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum 10353@lyskom.lysator.liu.se wrote:
If there are no character values >127, then the encoding step is a no-op, so skipping it buys you nothing except making your code harder to read.
I envision a future time when we can usefully distinguish between string(21bit) and string(8bit), with string(7bit) being trivially transformed into either of the above. As such, calling utf8_to_string or string_to_utf8 would be an actual type conversion (albeit of a subtype, rather than actually requiring any major changes).
ChrisA
I think you are conflagrating range with interpretation. Both a Latin1 string and an UTF-8 encoded one are 8-bit strings (with a 0-255 range). What would be useful is a datatype that declares that the elements are not Unicode characters (as they are in the Latin1 string case) but some raw binary encoding (as they are in the UTF-8 case), optionally also specifying which encoding. This has been suggested before (with "buffer" as a suggestion for the name of the new datatype), but it has never been implemented due to the difficulty of introducing such a datatype in a consistent way while still retaining backward compatibility.
(The idea for the new datatype was that it would be used for I/O, which always needs to be encoded somehow, and that it would not internalize (hash) the values since this is generally less useful for encoded strings.)
Strings with known encoding that can transfer into other strings with a known encoding easily and readable (and in some cases without any interaction) would be useful.
For instance,
Stdio.FILE x = ...; x->set_encoding("utf8");
string s = "räksmörgås"; String t = String.JP2022("\33(BHello, world!"); x->write(s); x->write(t); x->write(s+t);
pike-devel@lists.lysator.liu.se