On Wed, Nov 03, 2004 at 06:15:01AM +0100, Martin Nilsson (DivX Networks) @ Pike (-) developers forum wrote:
first. The 8-bit string "ä" is utf-8 encoded as "ä", so if string_to_utf8 was altered it to not encode "ä" it would perform incorrectly.
I couldn't imagine that one will try to encode 8-bit strings to UTF-8 (since 8-bit strings usually contain characters in some character set, code page, whatever - so those must be converted to UTF-8 in some other way, except, of course, when those are binary).
Well, let's assume that there is some (remote) application which expects UTF-8 (hence, 8-bit) string on input. User passes on UTF-8 string to Pike, pike applies string_to_utf8(), remote application gets this double encoded UTF-8 strings but it performs _no_ decoding from UTF-8 to UTF-8 (since UTF-8 is expected). So, is this correct? Or, to illustrate:
utf8_to_string(string_to_utf8(string_to_utf8("\x1234\x4321"))); You see what I mean? The result of utf8_to_string() is _wrong_.
you pass a utf8-encoded string to big_query it will be encoded a second time only to be decoded directly on the receiving side and be stored as the origianl utf-8 string in the databse.
Huh? SQLite expects UTF-8: "The only difference between them is that the second argument, specifying the SQL statement to compile, is assumed to be encoded in UTF-8 for the sqlite3_prepare() function and UTF-16 for sqlite3_prepare16()."
And more: "In the current implementation of SQLite, the SQL parser only works with UTF-8 text. So if you supply UTF-16 text it will be converted. This is just an implementation issue and there is nothing to prevent future versions of SQLite from parsing UTF-16 encoded SQL natively."
And even more: "SQLite is not particular about the text it receives and is more than happy to process text strings that are not normalized or even well-formed UTF-8 or UTF-16. Thus, programmers who want to store IS08859 data can do so using the UTF-8 interfaces. As long as no attempts are made to use a UTF-16 collating sequence or SQL function, the byte sequence of the text will not be modified in any way."
So, when you pass on double-encoded UTF-8 string to SQLite, it will be stored "as is" - conversion will be performed if (and only if) database encoding (UTF-8) differs from one that you used (UTF-16, for instance), in all other cases - it will not be done, so once string is double encoded, anything that accesses sqlite database will get incorrect data (expecting that it is UTF-8) - see my illustration above.
When another application reads the string it will again be utf-8 encoded by sqlite and returned to the user utf8-encoded twice.
No way - see above. It will not be converted unless you use sqlite3_*16() functions - it will be passed as is (in current version of SQLite, at least).
If that is not what you want, don't utf-8 encode the string. Note that for the %s API, 8-bit strings are stored as BLOBs unencoded, but they will still be utf-8 encoded if read as text.
This differs from what I see in the code. If column type is BLOB, it will not be converted from UTF-8:
if( sqlite3_column_type(stmt, i)==SQLITE_TEXT ) f_utf8_to_string(1);
And column type will be blob if: 1) I use bound values; 2) Those are 8-bit strings. Alternatively, 1) could be avoided using X'123456' syntax in query string.
BTW, just finished some stuff - took me long time to figure out that in SQLite/testsuite.in some "illegal" characters were not correctly processed: ({ "ble", "ble" }). May be this is something that was intended to work with implicit and unconditional conversion of 8-bit strings to utf8, but (again, see above) this is not quite correct way to go.
Actually, this implicit conversion prevents anybody from using UTF-8 encoded strings when working with SQLite module (this is only module which does this type of conversion) - user is forced to use Pike's wide-strings to get correct UTF-8 encoding, which is not always good idea (user may already have prepared UTF-8 strings, say, from external sources). It also prevents using of native ISO-8859-1 charset in SQLite (with 7th bit set) if database is accessed not only from Pike.
Regards, /Al