On Wed, Nov 23, 2016 at 11:10 PM, Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum 10353@lyskom.lysator.liu.se wrote:
I agree, but using string(8bit) to mean "binary data" is something that's 100% backward compatible.
It would not be backwards compatible, since that is not what string(8bit) means today.
By "binary data", I mean eight-bit strings of arbitrary bytes - like you'd read from a file or something. Currently, functions like Stdio.read_file simply return "string", but they'll effectively be returning string(8bit).
Unicode text would always be referred to as string(21bit), even if it happens to contain nothing but Latin-1 characters.
That doesn't really make sense. So you say that "R\xe4ksm\xf6rg\xe5s" would have type string(21bit)? What type would "\U12345678" have?
\U12345678 possibly should be an error, as it's not valid Unicode. Maybe the Pike string type can be used for other things, but they're not Unicode text - so you could use string(32bit) for those sorts of non-textual strings. (I don't know of any use cases, so I can't say beyond that.) My statement about Unicode text specifically excludes anything that isn't valid according to the Unicode standard.
What type would "Foo" have? How would you specify a UTF-8 encoded literal?
Now, these are questions that can't truly be answered with the current system. I would like the former to be string(7bit), and the latter would be either string(7bit) or string(8bit) depending on whether there are non-ASCII characters in it. But they're probably both just type 'string' at the moment.
ChrisA
By "binary data", I mean eight-bit strings of arbitrary bytes - like you'd read from a file or something. Currently, functions like Stdio.read_file simply return "string", but they'll effectively be returning string(8bit).
No, Stdio.read_file currently returns string(8bit). That simply means that each element will be in the range 0-255. If you were to change the meaning to something else, you would create compatibility issues by making some currently valid assignments involving string(8bit) invalid.
\U12345678 possibly should be an error, as it's not valid Unicode.
It's valid Pike. Pike supports the full ISO/IEC 10646 31-bit range, plus an equally large negative range.
so you could use string(32bit) for those sorts of non-textual strings.
Not string(31bit)?
My statement about Unicode text specifically excludes anything that isn't valid according to the Unicode standard.
Which makes it even worse since the set of valid characters change with each release of the Unicode standard...
What type would "Foo" have? How would you specify a UTF-8 encoded literal?
Now, these are questions that can't truly be answered with the current system. I would like the former to be string(7bit),
Then you are contradicting yourself, since you claimed that Unicode text would _always_ be referred to as string(21bit), and "Foo" is definitely Unicode text (both 'F' and 'o' have been part of the Unicode standard since the first version).
and the latter would be either string(7bit) or string(8bit) depending on whether there are non-ASCII characters in it.
But how would the compiler know that the characters are UTF-8 encoded, so that it does not assign a type of string(21bit) instead?
It's valid Pike. Pike supports the full ISO/IEC 10646 31-bit range, plus an equally large negative range.
Also note that Pike strings doesn't necessarily contain Unicode, even if they usually do. They _could_ just as well contain RGB pixels or random memory access data from a 12-bit-word system.
Yup, the thing we were discussing was how it would be nice to actually be able to declare when they contain something else. :-) But it is a valid point that binary encoded data is not necessarily 8-bit. You should definitely be allowed to declare something as buffer(12bit) if you want to store 12-bit values in it.
I think it would be a good idea as well, see 21907878.
The only thing that should have to care about the encoding should be the endpoints.
How are string constants handled today? If I do
string s = "räksmörgås";
am I guaranteed a certain encoding of s?
Yes, s will be Unicode. Of course, you need to declare the character encoding of your source file using a #charset tag (or use a BOM to indicate UTF encoding).
pike-devel@lists.lysator.liu.se