I think lots of systems are using UTF8 internally because adding support for widestrings is a *lot* of work... and for a lot of string operations it doesn't matter that much, for instance regexps. (But even there you will lose the functionality of easily matching different characters that doesn't fit in ASCII, for instance, like "[åäö]", since they will have multiple characters in the UTF8 string.)
I think lots of systems are using UTF8 internally because adding support for widestrings is a *lot* of work... and for a lot of string operations it doesn't matter that much, for instance regexps.
I wildly disagree with the regexp example (character classes and UTF8 just don't blend at all, for instance), though you are probably right about your "more effort than we care for" guess.
Yes, that was my example too (I think), but for instance separating fields on tabs or numerical sorting (sort -n) and concatenating strings have no real use for decoding UTF8.
In the regexp example you could make classes anyway ({\201å}|{\201ä}| {\201ö}), but I think I'd still rather use a widestring...
Numerical sorting might be a bad example, as there are plenty of non-ASCII digits in Unicode...
I'm more thinking about negative character classes, which you can't do properly on UTF8. Unless all the characters you want to not match are ASCII, a special case in which the character class will still work.
The best single example is probably the regexp ".", which on real strings means "a character", but on a UTF8 string becomes something like "some number of bits of a character". The UTF8 counterpart for "." isn't nearly as readable and regexps can be complicated enough even before trying to apply them to some encoded string.
Anyway, working with UTF8 encoded data is a leaky abstraction (basic assumptions about how operations work don't hold) which requires a higher level of understanding of which operations work as they would on proper strings, and it's a plentiful source of breakage and bugs better avoided, unless you have written a UTF8 {en|de}coder yourself on some occasion and know the pitfalls by heart.
pike-devel@lists.lysator.liu.se