Re: string_to_utf8() behavior on non-wide strings

3 Nov 2004

      On Wed, Nov 03, 2004 at 11:15:02AM +0100, Mirar @ Pike  developers forum wrote:
...
Not at all. UTF-8 was made to encode 8-bit characters as well as
16-bit.
There is little (if any) sense to encode 8-bit values into 8-bit values,
  expanding string (size) on the way, don't you think so?
...
There is no way to distinguish an 8-bit wide string and an
UTF-8-encoded string.
That's why decision about conversion should be left to application/user.
...
Note that the following must *always* be true:
|  str == utf8_to_string(string_to_utf8(str));
... unless str is _already_ UTF-8 encoded and contains character codes
...
0x7f. string_to_utf8() assumes that: a) str is 16- or 32-bit wide;
b) is 7-bit only; if not - it won't work as expected/intended.
Try:
str = string_to_utf8("\x1234\x1234");
    str = utf8_to_string(string_to_utf8(str));

  What will be in str? "\x1234\x1234"? Wrong. Try it :) That's exactly
  what is happening in SQLite, BTW.
...
If Sqlite doesn't work, fix Sqlite or the glue to it.
It does work - as advertised. Sqlite just assumes that _any_ string is
  (probably) UTF-8, i.e. it makes no conversions, so it makes little sense
  (and even produces problems) when conversion is made implicitly.
This is not a problem to fix the glue - but before I commit the changes I
  would like to be sure that nobody will be hurt, and I would like to
  understand why it is done as it is now (so far it seems to me that it was
  a mistake or misunderstanding of documentation).
Regards,
/Al

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: string_to_utf8() behavior on non-wide strings