Syntax for restricted width string types

List overview All Threads
Download

newer

older

DST tables

Syntax for restricted width string...

Henrik Grubbstr�m (Lysator) ＠ Pike (-) developers forum

2 Mar 2007 2 Mar '07

6:40 p.m.

I'm considering adding support for differentiating between different width strings to the type system. This whould be useful in tracking down places where wide strings are passed where narrow strings are expected. The question is what the syntax should be. Please note that the syntax should aim at being 100% backward compatible, so adding new keywords is probably not a good idea.

My suggestion:

string(8) 8-bit (aka narrow) string(16) At most 16-bit wide string(32) At most 32-bit wide (default)

This would allow for the possible extension of

string(7) 7-bit (aka USASCII)

Comments?

Show replies by date

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

2 Mar 2 Mar

8 p.m.

This is fully in line with what I proposed a while back. I approve.

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum

8 p.m.

+1.

Mirar ＠ Pike developers forum

8 p.m.

Looks fine to me. (Haven't we discussed exactly that before?)

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

4 Mar 4 Mar

9:40 p.m.

Yes, but apparently people has choosen to forget the arguments against which were raised then...

Mirar ＠ Pike developers forum

9:55 p.m.

A binary datatype would be fine too, but how would you find non-conflicting type names now?

string(8) works on the technical level (it's easy to check), but it doesn't really work on the semantical level, I agree. But it *does* solve the problem better than nothing... I think. (Unless ->write et al are in big trouble now.)

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:05 p.m.

Well, why not use the syntax string(binary) for a binary string, string(utf8) for one which is UTF-8 encoded etc? I.e. you can put any identifier there, and for the type to match the identifier needs to be the same.

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

10:05 p.m.

You can't order them directly. is utf8 <= binary?

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:10 p.m.

...

is utf8 <= binary?

No. And neither is binary <= utf8. That follows from my statement that "the identifier needs to be the same". ^^^^^^^^^^^^^^^^^^^^

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum

10:10 p.m.

Would string(iso-8859-1) fail identifierness, or would that work too?

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:10 p.m.

Using the current definition of identifiers (which is probably a good idea), it does. That's why I wrote "utf8" and not "utf-8". But iso_8859_1 would work of course. I agree that - looks better, but I'd rather keep the syntax consistent.

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum

10:20 p.m.

What type would write() take in such a scenario? I assume that's given by the locale (LC_CTYPE, or similar) and not known at coding time.

If _typeof(write) ends up something even more horrible than ditto map, I'd argue it would probably do more harm (to debuggability) than good.

(My thinking is of course that function(string(utf8)|string(...all encoding permutations known to Locale.Charset), mixed ... : int) is a seriously useless type, and would choke type system and programmer alike. / _typeof() ate my hamsscrollback)

Peter Bortas ＠ Pike developers forum

10:25 p.m.

...

What type would write() take in such a scenario? I assume that's given by the locale (LC_CTYPE, or similar) and not known at coding time.

write takes Pike-strings, where does locale come in?

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

10:30 p.m.

It entered the discussion when someone said "utf8".

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:40 p.m.

It would presumably take binary, since that's the best guess that can be made statically (especially since you don't know what the file descriptor is connected to). It would probably be difficult to have Locale.Charset.encoder return anything other than binary too.

It would be cool if you could declare a Stdio.File(utf8) though. :)

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum

11:15 p.m.

...

It would be cool if you could declare a Stdio.File(utf8) though. :)

It's the only use case that comes to mind for me where this suggestion would be substantially better than the crude width-based system. It's worth noting that neither proposition outrules the other, though; they are syntax compatible with one another.

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

11:30 p.m.

No, I can think of many examples where a contract-based string type can be useful. Enforced character encoding is probably less useful than a subtype devoid of illegal characters, e.g. string(windows_filename), string(xml_cdata), etc. The problem is that none of these strings can be modified without the "type owner" checking them to see if they still conform to the string subtype. Any modification could of course clear the subtype, and Pikes shared strings would act as a subtype cache.

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum

5 Mar 5 Mar

12:05 a.m.

True; those would be useful, with some well thought out supporting tech.

Is there oppossition to having both, in some distant future when the more ambitious contract code has presumably been written by someone who cares for both idea and implementation, the way Grubba apparently does for string(<integer>)?

I find a string(8) next week much more useful than neither for months or years from now.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:05 a.m.

I'm opposed to having string(8) for much the same reason I was opposed to string_width(): It invites abuse. Suddenly someone thinks "Ok, here I need an 8-bit string for this API, let's check if the input is 8-bit and call string_to_utf8() if not!" and you get a really weird function which arbitrarily will encode text as either iso-8859-1 or UTF-8. (Example taken from reality, but no need to mention names.)

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

4 Mar 4 Mar

10:45 p.m.

What type would a string literal in Pike have?

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

10:50 p.m.

string.

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

11:05 p.m.

Subtyped strings are only really useful if you can guarantee that it indeed has that subtype (e.g. doesn't contain NULL). That means that only the "owner" of the subtype should be able to set a string to that subtype.

Isn't it a set you would really want, like string("upper_cased"|"no_nulls") or should you throw away/regain type information for calls to functions which accepts different types?

Henrik Grubbstr�m (Lysator) ＠ Pike (-) developers forum

3 Mar 3 Mar

5:10 p.m.

Ok, now implemented in Pike 7.7:

...

typeof("");

(1) Result: string

...

_typeof("");

(2) Result: string(0)

...

_typeof("foobar");

(3) Result: string(8)

...

_typeof("\x2000");

(4) Result: string(16)

...

_typeof("\x200000");

(5) Result: string

...

_typeof("") <= _typeof("foo");

(6) Result: 1

...

_typeof("foo") <= _typeof("foo");

(7) Result: 1

...

_typeof("foo") <= _typeof("");

(8) Result: 0

Seems to work...

Now it's just a question of strengthening the types of various functions.

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum

7 p.m.

We'll still be able to call write() (et c) with data not typed harder than "string", when the run type is compatible with string(8), I hope?

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

4 Mar 4 Mar

6:30 p.m.

precompile needs an update.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

9:20 p.m.

I'm against. This proposal seems to be guided by the implementation rather than design. The question is: Why are these functions "excpecting narrow strings"? Isn't it because they operate on binary data, rather than text strings? Or alternatively text strings encoded in some particular transport encoding? What we should be able to declare is that "this function takes text", or "this function takes binary data", both of which are currently covered by the type "string". Being able to declare that something takes "string, but no values larger than 255" still doesn't make this distinction. And additionaly being able to make other arbitrary restrictions like "no values larger than 4095 (string(12)) is just silly.

I'm for readding the "buffer" datatype from LPC, but this proposal is bad.

Per Hedbor () ＠ Pike (-) developers forum

5 Mar 5 Mar

9:55 a.m.

I agree here. string(8) should really be called 'buffer' or 'data' or something similar.

What's the use of string(16)?

Mirar ＠ Pike developers forum

11:10 a.m.

UTF-16-encoded data is typically string(16), isn't it? Isn't that what Windows is using for N things?

Per Hedbor () ＠ Pike (-) developers forum

11:35 a.m.

...

UTF-16-encoded data is typically string(16), isn't it? Isn't that what Windows is using for N things?

Well, no, not really (utf16, that is) since it's byte-order dependent.

string(16) is a string with 16-bit integers in it, aka lower half of unicode.

It's an internal distinction that's not really important.

utf-16-be and utf-16-le would both be encoded as string(8), presumably.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

12:05 p.m.

Correct.

Pike v7.6 release 93 running Hilfe v3.5 (Incremental Pike Frontend)

...

String.width(Locale.Charset.encoder("utf-16-le")->feed("\x12345")->drain());

(1) Result: 8

...

String.width(Locale.Charset.encoder("utf-16-be")->feed("\x12345")->drain());

(2) Result: 8

...

Mirar ＠ Pike developers forum

12:10 p.m.

So, that means that we more or less, we have string(binary) and string(wide)?

Where wide means "internal, unencoded" and binary is a subset of wide?

Per Hedbor () ＠ Pike (-) developers forum

12:15 p.m.

Well. Sort of.

The internal strings in pike are supposed to be unicode at all times.

Binary data has a rather undefined charset.

I guess what I am hinting at is that a real 'binary' (or 'buffer' or 'data' or whatever) type (with a lot of compatibility for old programs) would be really useful.

Mirar ＠ Pike developers forum

1 p.m.

As in a non-shared piece consisting of bytes?

Per Hedbor () ＠ Pike (-) developers forum

1:05 p.m.

Yes. Sharing is rather optional, it would not really hurt all that much if it's shared. (See also: Performance of String.Buffer vs. just string) but it most likely won't help all that much either, unless it's still 'string(8)' internally, which would keep the internal codechanges to a minimum.

Mirar ＠ Pike developers forum

1:45 p.m.

Minimum code change is always a good start. :)

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

4 Mar 4 Mar

11:10 p.m.

The two first functions I looked at for changing from string to string(x) both lacked proper wide-string check...

6747

Age (days ago)

6750

Last active (days ago)

pike-devel@lists.lysator.liu.se

35 comments

7 participants

tags (0)

participants (7)

Henrik Grubbstr�m (Lysator) ＠ Pike (-) developers forum
Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum
Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum
Mirar ＠ Pike developers forum
Per Hedbor () ＠ Pike (-) developers forum
Peter Bortas ＠ Pike developers forum