I'm considering adding support for differentiating between different width strings to the type system. This whould be useful in tracking down places where wide strings are passed where narrow strings are expected. The question is what the syntax should be. Please note that the syntax should aim at being 100% backward compatible, so adding new keywords is probably not a good idea.
My suggestion:
string(8) 8-bit (aka narrow) string(16) At most 16-bit wide string(32) At most 32-bit wide (default)
This would allow for the possible extension of
string(7) 7-bit (aka USASCII)
Comments?
Yes, but apparently people has choosen to forget the arguments against which were raised then...
A binary datatype would be fine too, but how would you find non-conflicting type names now?
string(8) works on the technical level (it's easy to check), but it doesn't really work on the semantical level, I agree. But it *does* solve the problem better than nothing... I think. (Unless ->write et al are in big trouble now.)
Well, why not use the syntax string(binary) for a binary string, string(utf8) for one which is UTF-8 encoded etc? I.e. you can put any identifier there, and for the type to match the identifier needs to be the same.
Using the current definition of identifiers (which is probably a good idea), it does. That's why I wrote "utf8" and not "utf-8". But iso_8859_1 would work of course. I agree that - looks better, but I'd rather keep the syntax consistent.
What type would write() take in such a scenario? I assume that's given by the locale (LC_CTYPE, or similar) and not known at coding time.
If _typeof(write) ends up something even more horrible than ditto map, I'd argue it would probably do more harm (to debuggability) than good.
(My thinking is of course that function(string(utf8)|string(...all encoding permutations known to Locale.Charset), mixed ... : int) is a seriously useless type, and would choke type system and programmer alike. / _typeof() ate my hamsscrollback)
It would presumably take binary, since that's the best guess that can be made statically (especially since you don't know what the file descriptor is connected to). It would probably be difficult to have Locale.Charset.encoder return anything other than binary too.
It would be cool if you could declare a Stdio.File(utf8) though. :)
It would be cool if you could declare a Stdio.File(utf8) though. :)
It's the only use case that comes to mind for me where this suggestion would be substantially better than the crude width-based system. It's worth noting that neither proposition outrules the other, though; they are syntax compatible with one another.
No, I can think of many examples where a contract-based string type can be useful. Enforced character encoding is probably less useful than a subtype devoid of illegal characters, e.g. string(windows_filename), string(xml_cdata), etc. The problem is that none of these strings can be modified without the "type owner" checking them to see if they still conform to the string subtype. Any modification could of course clear the subtype, and Pikes shared strings would act as a subtype cache.
True; those would be useful, with some well thought out supporting tech.
Is there oppossition to having both, in some distant future when the more ambitious contract code has presumably been written by someone who cares for both idea and implementation, the way Grubba apparently does for string(<integer>)?
I find a string(8) next week much more useful than neither for months or years from now.
I'm opposed to having string(8) for much the same reason I was opposed to string_width(): It invites abuse. Suddenly someone thinks "Ok, here I need an 8-bit string for this API, let's check if the input is 8-bit and call string_to_utf8() if not!" and you get a really weird function which arbitrarily will encode text as either iso-8859-1 or UTF-8. (Example taken from reality, but no need to mention names.)
Subtyped strings are only really useful if you can guarantee that it indeed has that subtype (e.g. doesn't contain NULL). That means that only the "owner" of the subtype should be able to set a string to that subtype.
Isn't it a set you would really want, like string("upper_cased"|"no_nulls") or should you throw away/regain type information for calls to functions which accepts different types?
Ok, now implemented in Pike 7.7:
typeof("");
(1) Result: string
_typeof("");
(2) Result: string(0)
_typeof("foobar");
(3) Result: string(8)
_typeof("\x2000");
(4) Result: string(16)
_typeof("\x200000");
(5) Result: string
_typeof("") <= _typeof("foo");
(6) Result: 1
_typeof("foo") <= _typeof("foo");
(7) Result: 1
_typeof("foo") <= _typeof("");
(8) Result: 0
Seems to work...
Now it's just a question of strengthening the types of various functions.
We'll still be able to call write() (et c) with data not typed harder than "string", when the run type is compatible with string(8), I hope?
I'm against. This proposal seems to be guided by the implementation rather than design. The question is: Why are these functions "excpecting narrow strings"? Isn't it because they operate on binary data, rather than text strings? Or alternatively text strings encoded in some particular transport encoding? What we should be able to declare is that "this function takes text", or "this function takes binary data", both of which are currently covered by the type "string". Being able to declare that something takes "string, but no values larger than 255" still doesn't make this distinction. And additionaly being able to make other arbitrary restrictions like "no values larger than 4095 (string(12)) is just silly.
I'm for readding the "buffer" datatype from LPC, but this proposal is bad.
I agree here. string(8) should really be called 'buffer' or 'data' or something similar.
What's the use of string(16)?
UTF-16-encoded data is typically string(16), isn't it? Isn't that what Windows is using for N things?
UTF-16-encoded data is typically string(16), isn't it? Isn't that what Windows is using for N things?
Well, no, not really (utf16, that is) since it's byte-order dependent.
string(16) is a string with 16-bit integers in it, aka lower half of unicode.
It's an internal distinction that's not really important.
utf-16-be and utf-16-le would both be encoded as string(8), presumably.
So, that means that we more or less, we have string(binary) and string(wide)?
Where wide means "internal, unencoded" and binary is a subset of wide?
Well. Sort of.
The internal strings in pike are supposed to be unicode at all times.
Binary data has a rather undefined charset.
I guess what I am hinting at is that a real 'binary' (or 'buffer' or 'data' or whatever) type (with a lot of compatibility for old programs) would be really useful.
Yes. Sharing is rather optional, it would not really hurt all that much if it's shared. (See also: Performance of String.Buffer vs. just string) but it most likely won't help all that much either, unless it's still 'string(8)' internally, which would keep the internal codechanges to a minimum.
The two first functions I looked at for changing from string to string(x) both lacked proper wide-string check...
pike-devel@lists.lysator.liu.se