Re: why is pike not using utf-8 internally?

List overview All Threads
Download

newer

older

Pike 7.6.98 beta

pike project bounty

Mirar ＠ Pike developers forum

16 Dec 2006 16 Dec '06

2:20 p.m.

I think lots of systems are using UTF8 internally because adding support for widestrings is a *lot* of work... and for a lot of string operations it doesn't matter that much, for instance regexps. (But even there you will lose the functionality of easily matching different characters that doesn't fit in ASCII, for instance, like "[åäö]", since they will have multiple characters in the UTF8 string.)

Show replies by date

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum

16 Dec 16 Dec

4:10 p.m.

New subject: why is pike not using utf-8 internally?

...

I think lots of systems are using UTF8 internally because adding support for widestrings is a *lot* of work... and for a lot of string operations it doesn't matter that much, for instance regexps.

I wildly disagree with the regexp example (character classes and UTF8 just don't blend at all, for instance), though you are probably right about your "more effort than we care for" guess.

Mirar ＠ Pike developers forum

4:40 p.m.

New subject: why is pike not using utf-8 internally?

Yes, that was my example too (I think), but for instance separating fields on tabs or numerical sorting (sort -n) and concatenating strings have no real use for decoding UTF8.

In the regexp example you could make classes anyway ({\201å}|{\201ä}| {\201ö}), but I think I'd still rather use a widestring...

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

4:50 p.m.

New subject: why is pike not using utf-8 internally?

Numerical sorting might be a bad example, as there are plenty of non-ASCII digits in Unicode...

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum

5:15 p.m.

New subject: why is pike not using utf-8 internally?

I'm more thinking about negative character classes, which you can't do properly on UTF8. Unless all the characters you want to not match are ASCII, a special case in which the character class will still work.

The best single example is probably the regexp ".", which on real strings means "a character", but on a UTF8 string becomes something like "some number of bits of a character". The UTF8 counterpart for "." isn't nearly as readable and regexps can be complicated enough even before trying to apply them to some encoded string.

Anyway, working with UTF8 encoded data is a leaky abstraction (basic assumptions about how operations work don't hold) which requires a higher level of understanding of which operations work as they would on proper strings, and it's a plentiful source of breakage and bugs better avoided, unless you have written a UTF8 {en|de}coder yourself on some occasion and know the pitfalls by heart.

6810

Age (days ago)

6810

Last active (days ago)

pike-devel@lists.lysator.liu.se

4 comments

3 participants

tags (0)

participants (3)

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum
Mirar ＠ Pike developers forum