On Thu, Nov 24, 2016 at 12:20 AM, Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum 10353@lyskom.lysator.liu.se wrote:
\U12345678 possibly should be an error, as it's not valid Unicode.
It's valid Pike. Pike supports the full ISO/IEC 10646 31-bit range, plus an equally large negative range.
so you could use string(32bit) for those sorts of non-textual strings.
Not string(31bit)?
My statement about Unicode text specifically excludes anything that isn't valid according to the Unicode standard.
Which makes it even worse since the set of valid characters change with each release of the Unicode standard...
AIUI the Unicode Consortium has declared that they will never define any characters beyond 0x10FFFF, as that would destroy UTF-16 as a valid encoding. And you're right that "valid according to the Unicode standard" is a bit too restrictive, but certainly "valid within the declaration of character range" should be safe.
What type would "Foo" have? How would you specify a UTF-8 encoded literal?
Now, these are questions that can't truly be answered with the current system. I would like the former to be string(7bit),
Then you are contradicting yourself, since you claimed that Unicode text would _always_ be referred to as string(21bit), and "Foo" is definitely Unicode text (both 'F' and 'o' have been part of the Unicode standard since the first version).
string(7bit) would be implicitly upcastable to string(21bit), since ASCII text can be represented validly as either bytes (with the top bit clear) or as Unicode codepoint sequences.
and the latter would be either string(7bit) or string(8bit) depending on whether there are non-ASCII characters in it.
But how would the compiler know that the characters are UTF-8 encoded, so that it does not assign a type of string(21bit) instead?
Right, and that's something that can't be done in the current standard. Hence this entire proposal has to wait until some major changes can be done.
In Python, it's done with a prefix - u"asdf" is a Unicode string, and b"asdf" is a byte string. It would need to be something similarly syntactic, once the two become actually different types. For today's Pikes, though, that's not possible, so the safest way is to simply keep track of it in the programmer's mind, without language support.
ChrisA