On Thu, Nov 24, 2016 at 12:20 AM, Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum 10353@lyskom.lysator.liu.se wrote:
\U12345678 possibly should be an error, as it's not valid Unicode.
It's valid Pike. Pike supports the full ISO/IEC 10646 31-bit range, plus an equally large negative range.
so you could use string(32bit) for those sorts of non-textual strings.
Not string(31bit)?
My statement about Unicode text specifically excludes anything that isn't valid according to the Unicode standard.
Which makes it even worse since the set of valid characters change with each release of the Unicode standard...
AIUI the Unicode Consortium has declared that they will never define any characters beyond 0x10FFFF, as that would destroy UTF-16 as a valid encoding. And you're right that "valid according to the Unicode standard" is a bit too restrictive, but certainly "valid within the declaration of character range" should be safe.
What type would "Foo" have? How would you specify a UTF-8 encoded literal?
Now, these are questions that can't truly be answered with the current system. I would like the former to be string(7bit),
Then you are contradicting yourself, since you claimed that Unicode text would _always_ be referred to as string(21bit), and "Foo" is definitely Unicode text (both 'F' and 'o' have been part of the Unicode standard since the first version).
string(7bit) would be implicitly upcastable to string(21bit), since ASCII text can be represented validly as either bytes (with the top bit clear) or as Unicode codepoint sequences.
and the latter would be either string(7bit) or string(8bit) depending on whether there are non-ASCII characters in it.
But how would the compiler know that the characters are UTF-8 encoded, so that it does not assign a type of string(21bit) instead?
Right, and that's something that can't be done in the current standard. Hence this entire proposal has to wait until some major changes can be done.
In Python, it's done with a prefix - u"asdf" is a Unicode string, and b"asdf" is a byte string. It would need to be something similarly syntactic, once the two become actually different types. For today's Pikes, though, that's not possible, so the safest way is to simply keep track of it in the programmer's mind, without language support.
ChrisA
Right, and that's something that can't be done in the current standard. Hence this entire proposal has to wait until some major changes can be done.
Yup. And then those changes should not be a repurposing of an existing mechanism (element ranges on the string type) but something more appropriate for the goals.
In Python, it's done with a prefix - u"asdf" is a Unicode string, and b"asdf" is a byte string.
Since nominally strings are Unicode (with the extended ISO 10646 range) strings now, I think "asdf" can be left as the syntax for that, and we only need a new syntax for the byte string ("buffer") type. We can also look at Java, which has byte[] as the type for byte strings, requiring literals like {'a','s','d','f'}, but I would like to see something a bit more convenient to use. :-)
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
can also look at Java, which has byte[] as the type for byte strings, requiring literals like {'a','s','d','f'}, but I would like to see
In the EngineIO implementation I currently abuse Stdio.Buffer to fulfill this binary data type here and there. It's not ideal, but it works.
Well, I'm not sure that's actually abusing it; Stdio.Buffer is a sort of compromise for getting some of the benefits of a native buffer type while not getting all of the problems (it does not affect compatibility as it uses a separate set of APIs, and while that does lead to inconsistency it's not too bad when a class does it). So in cases where a native buffer type would have helped you, I think using Stdio.Buffer as a substitute (provided it actually is able to help you in the same way) is valid.
pike-devel@lists.lysator.liu.se