Hi,
What should be the correct behavior of string_to_utf8() in case if source string is not wide-string? Currently it just converts all 8-bit values to UTF-8 representation, which is not (really) Right Thing (tm) - imagine UTF-8 string passed on... It will be scrambled... Yes, I know, no one would do this intentionally, but "to err is human" - one live example is SQLite module, where this conversion is performed by default, and UTF-8 string may (and eventually will) be passed to SQLite, will be scrambled, and... :)
May be it makes sense to do nothing (return source string) in case if source is 8-bit string? Or (at least) check that source is valid UTF-8 stream (I wouldn't choose this way, though)? Regards, /Al
Convering 8-bit strings to utf8 is very much what string_to_utf8() should do. How else would you convert them into utf8-strings?
On Wed, Nov 03, 2004 at 04:20:02AM +0100, Martin Nilsson (DivX Networks) @ Pike (-) developers forum wrote:
Convering 8-bit strings to utf8 is very much what string_to_utf8() should do. How else would you convert them into utf8-strings?
What about case when string is already utf8? utf8 is 8 bit by definition, so there is no way to convert utf8 to utf8, or do I miss something?
Regards, /Al
If I UTF8-encode a string three times and UTF8-decode a string three times I expect to get the original string.
On Wed, Nov 03, 2004 at 04:50:02AM +0100, Martin Nilsson (DivX Networks) @ Pike (-) developers forum wrote:
If I UTF8-encode a string three times and UTF8-decode a string three times I expect to get the original string.
Right, but if (in case of SQLite, for example) I pass on UTF8 string to big_query(), it will be encoded second time, so the value in database will be incorrect (and longer than original) - to external application.
In case if I use bound parameters, everything will be OK (no conversion is done for 8-bit strings), but query text by itself may be UTF8 string, and will be converted by unconditional call to f_string_to_utf8().
Basically, any call to f_string_to_utf8() will scramble existing UTF8 encoding, so it will (obviously) be decoded correctly if only Pike is used (which will decode it always), however anything but Pike, reading values stored by Pike, may fail.
Example:
... query = string_to_utf8("INSERT INTO x VALUES('\x1234')"); ... 100 lines later big_query(query)->fetch_row();
So now we have incorrectly encoded value in table - external application will read it and make no conversion since it is expected to be in UTF8 already. It will be (still) correctly encoded, but not something that makes any sense.
Regards, /Al
Now you are on a different topic, so let's finish the previous one first. The 8-bit string "ä" is utf-8 encoded as "ä", so if string_to_utf8 was altered it to not encode "ä" it would perform incorrectly.
Over to SQLite. The SQLite interface takes utf-8 encoded strings. So if you pass a utf8-encoded string to big_query it will be encoded a second time only to be decoded directly on the receiving side and be stored as the origianl utf-8 string in the databse. When another application reads the string it will again be utf-8 encoded by sqlite and returned to the user utf8-encoded twice. If that is not what you want, don't utf-8 encode the string. Note that for the %s API, 8-bit strings are stored as BLOBs unencoded, but they will still be utf-8 encoded if read as text.
On Wed, Nov 03, 2004 at 06:15:01AM +0100, Martin Nilsson (DivX Networks) @ Pike (-) developers forum wrote:
first. The 8-bit string "ä" is utf-8 encoded as "ä", so if string_to_utf8 was altered it to not encode "ä" it would perform incorrectly.
I couldn't imagine that one will try to encode 8-bit strings to UTF-8 (since 8-bit strings usually contain characters in some character set, code page, whatever - so those must be converted to UTF-8 in some other way, except, of course, when those are binary).
Well, let's assume that there is some (remote) application which expects UTF-8 (hence, 8-bit) string on input. User passes on UTF-8 string to Pike, pike applies string_to_utf8(), remote application gets this double encoded UTF-8 strings but it performs _no_ decoding from UTF-8 to UTF-8 (since UTF-8 is expected). So, is this correct? Or, to illustrate:
utf8_to_string(string_to_utf8(string_to_utf8("\x1234\x4321"))); You see what I mean? The result of utf8_to_string() is _wrong_.
you pass a utf8-encoded string to big_query it will be encoded a second time only to be decoded directly on the receiving side and be stored as the origianl utf-8 string in the databse.
Huh? SQLite expects UTF-8: "The only difference between them is that the second argument, specifying the SQL statement to compile, is assumed to be encoded in UTF-8 for the sqlite3_prepare() function and UTF-16 for sqlite3_prepare16()."
And more: "In the current implementation of SQLite, the SQL parser only works with UTF-8 text. So if you supply UTF-16 text it will be converted. This is just an implementation issue and there is nothing to prevent future versions of SQLite from parsing UTF-16 encoded SQL natively."
And even more: "SQLite is not particular about the text it receives and is more than happy to process text strings that are not normalized or even well-formed UTF-8 or UTF-16. Thus, programmers who want to store IS08859 data can do so using the UTF-8 interfaces. As long as no attempts are made to use a UTF-16 collating sequence or SQL function, the byte sequence of the text will not be modified in any way."
So, when you pass on double-encoded UTF-8 string to SQLite, it will be stored "as is" - conversion will be performed if (and only if) database encoding (UTF-8) differs from one that you used (UTF-16, for instance), in all other cases - it will not be done, so once string is double encoded, anything that accesses sqlite database will get incorrect data (expecting that it is UTF-8) - see my illustration above.
When another application reads the string it will again be utf-8 encoded by sqlite and returned to the user utf8-encoded twice.
No way - see above. It will not be converted unless you use sqlite3_*16() functions - it will be passed as is (in current version of SQLite, at least).
If that is not what you want, don't utf-8 encode the string. Note that for the %s API, 8-bit strings are stored as BLOBs unencoded, but they will still be utf-8 encoded if read as text.
This differs from what I see in the code. If column type is BLOB, it will not be converted from UTF-8:
if( sqlite3_column_type(stmt, i)==SQLITE_TEXT ) f_utf8_to_string(1);
And column type will be blob if: 1) I use bound values; 2) Those are 8-bit strings. Alternatively, 1) could be avoided using X'123456' syntax in query string.
BTW, just finished some stuff - took me long time to figure out that in SQLite/testsuite.in some "illegal" characters were not correctly processed: ({ "ble", "ble" }). May be this is something that was intended to work with implicit and unconditional conversion of 8-bit strings to utf8, but (again, see above) this is not quite correct way to go.
Actually, this implicit conversion prevents anybody from using UTF-8 encoded strings when working with SQLite module (this is only module which does this type of conversion) - user is forced to use Pike's wide-strings to get correct UTF-8 encoding, which is not always good idea (user may already have prepared UTF-8 strings, say, from external sources). It also prevents using of native ISO-8859-1 charset in SQLite (with 7th bit set) if database is accessed not only from Pike.
Regards, /Al
All strings are widestrings. 8-bit strings are just a tiny bit less wide.
Even the result from string_to_utf8 is a widestring. It's up to you to remember that's it's been encoded.
On Wed, Nov 03, 2004 at 07:10:02AM +0100, Mirar @ Pike developers forum wrote:
It's up to you to remember that's it's been encoded.
Well, this is exactly my point :) I (and I guess I am not only one) don't expect any implicit conversions behind the scenes, especially when it cannot be (easily) avoided (or controlled), not necessary and leads to interoperability problems.
Regards, /Al
But that sounds like a problem with sqllite (was it?), not string_to_utf8.
It a particular library throws around the conversion, you just have to document what exactly it does, if you can't generalize it...
On Wed, Nov 03, 2004 at 10:40:05AM +0100, Mirar @ Pike developers forum wrote:
But that sounds like a problem with sqllite (was it?), not string_to_utf8.
Not really, it is just SQLite module that uses string_to_utf8() implicitly. Just to note - MySql also supports UTF-8, but there is no implicit conversion done, strings are passed as is.
And the (current) behavior of string_to_utf8() still may cause problems - once it will be used somewhere else. AFAIK, UTF-8 encoding was not intended to encode 8-bit wide characters (this simply makes no sense), so when argument is 8-bit wide string, nothing should be done (well, at most - check that input is valid UTF-8 stream) - this seems logical, or?
It a particular library throws around the conversion, you just have to document what exactly it does, if you can't generalize it...
The problem is (in particular case of SQLite module) that with implicit conversion in place it will not be possible to use encodings other than UTF-8 (while library allows it).
What is worse, it is required to use 16- or 32- bit wide strings to store UTF-8 string into database - i.e., any external UTF-8 string (user input, for instance) which should be passed to sqlite must be converted to 16- or 32-bit Pike string, and only then passed to SQLite functions (where it again will be converted to UTF-8) - otherwise conversion will scramble it.
... just grepped through sources - there are very few places where utf8 conversions are performed - SQLite, PCRE, some xml stuff and (naturally) charset handling modules.
I am not against conversion, but I strongly believe that any conversion should be controlled by the user (application). Implicit conversion (unless it is unobtrusive - which is not the case) is Very Bad Thing (tm)...
NB: This all is not only a theory - I've a real application which cannot use SQLite "as is", i.e. with this conversion. I can use MySql or Informix without any problems, though - just wanted to get rid of it...
Regards, /Al
UTF-8 stream) - this seems logical, or?
Not at all. UTF-8 was made to encode 8-bit characters as well as 16-bit.
There is no way to distinguish an 8-bit wide string and an UTF-8-encoded string.
UTF-8 was *not* made for encoding 7-bit wide string, and subsequently doesn't encode 7-bit-wide strings.
Note that the following must *always* be true:
| str == utf8_to_string(string_to_utf8(str));
or to generalize:
| str == utf8_to_string(utf8_to_string(string_to_utf8(string_to_utf8(str))));
NB: This all is not only a theory - I've a real application which cannot use SQLite "as is", i.e. with this conversion. I can use MySql or Informix without any problems, though - just wanted to get rid of it...
If Sqlite doesn't work, fix Sqlite or the glue to it.
On Wed, Nov 03, 2004 at 11:15:02AM +0100, Mirar @ Pike developers forum wrote:
Not at all. UTF-8 was made to encode 8-bit characters as well as 16-bit.
There is little (if any) sense to encode 8-bit values into 8-bit values, expanding string (size) on the way, don't you think so?
There is no way to distinguish an 8-bit wide string and an UTF-8-encoded string.
That's why decision about conversion should be left to application/user.
Note that the following must *always* be true:
| str == utf8_to_string(string_to_utf8(str));
... unless str is _already_ UTF-8 encoded and contains character codes
0x7f. string_to_utf8() assumes that: a) str is 16- or 32-bit wide;
b) is 7-bit only; if not - it won't work as expected/intended.
Try:
str = string_to_utf8("\x1234\x1234"); str = utf8_to_string(string_to_utf8(str)); What will be in str? "\x1234\x1234"? Wrong. Try it :) That's exactly what is happening in SQLite, BTW.
If Sqlite doesn't work, fix Sqlite or the glue to it.
It does work - as advertised. Sqlite just assumes that _any_ string is (probably) UTF-8, i.e. it makes no conversions, so it makes little sense (and even produces problems) when conversion is made implicitly.
This is not a problem to fix the glue - but before I commit the changes I would like to be sure that nobody will be hurt, and I would like to understand why it is done as it is now (so far it seems to me that it was a mistake or misunderstanding of documentation).
Regards, /Al
... unless str is _already_ UTF-8 encoded and contains character codes
0x7f. string_to_utf8() assumes that: a) str is 16- or 32-bit wide;
b) is 7-bit only; if not - it won't work as expected/intended.
No. It's *always* true.
str = string_to_utf8("\x1234\x1234");
str == "á\210´á\210´"
utf8_to_string(string_to_utf8(str));
== "á\210´á\210´", which is what is expected.
On Wed, Nov 03, 2004 at 11:55:02AM +0100, Mirar @ Pike developers forum wrote:
No. It's *always* true.
OK, you are right. That is true. But:
s1 = string_to_utf8("\x1234\x1234"); s2 = string_to_utf8(s1);
s1 != s2 // Right? Obviously.
So what do we have in case if:
- User provides UTF-8 string as input (s1) - It gets encoded with string_to_utf8() - Result (s2) is passed to another application, which expects UTF-8 and don't use any conversion (operates directly on UTF-8 strings - comparisons, etc).
So... Unless this another application (in our case sqlite or someting that uses same db with sqlite) will use appropriate conversion, right?
I.e. - I store some (UTF-8 encoded) string into column, it will be encoded second time (in glue), then I retrieve this from perl-sqlite (since I use UTF-8 encoded value directly, I don't decode it) and (say) compare to some constant (also UTF-8). It won't match, or?
(Well, it could happen that I am crazy and completely miss something - but I don't see where :)
Regards, /Al
If the SQLite glue in Pike expects unencoded Pike strings (since it will do the UTF8 conversion internally) you should feed it unencoded Pike strings and not UTF8 data.
- User provides UTF-8 string as input (s1)
If you know your input is UTF8, simply call utf8_to_string() before handing the data to the Pike glue.
There is little (if any) sense to encode 8-bit values into 8-bit values, expanding string (size) on the way, don't you think so?
A generic 8-bit string doesn't contain any info that tells what kind of encoding that has been used. That's something you need to keep track of elsewhere. If you use Pike strings all the way you needn't worry about this; it's only when you deali with I/O (files, http, keyboard input etc) it becomes an issue, and then various methods has been devised to handle it (the <?xml version="1.0" encoding="..."?> header is one example).
On Wed, Nov 03, 2004 at 01:01:51PM +0100, Jonas Walldén @ Pike developers forum wrote:
If you know your input is UTF8, simply call utf8_to_string() before handing the data to the Pike glue.
It means - more CPU usage, more load, more memory for buffers, and all this only to pass on data through Pike? Does it makes _any_ difference to Pike - the meaning (contents) of any _binary_ string? Why I should make all those conversions when I can simply pass data as is?
BTW, most (if not all) current database interfaces (in Pike) will (most probably) fail if I'll use strings wider that 8-bit.
May be this is my specific case, but I've an applications with really high load (gigabytes of data), so obviously I don't want any unnecessary conversions, if I can avoid this (most, but not all of the data is not processed nor checked, just pupmped through).
And, back to original problem, in spite of recent comments - if Pike _forces_ user to use Unicode strings (16- or 32-bit wide), why this is not enforced _everywhere_ but only in SQLite? :)
Regards, /Al
Isn't the binary data processed anyway (e.g. quoted to be inserted into a SQL statement)? Perhaps bindings in some SQL drivers can avoid that step?
Anyway, if the SQL glue knows about the data types involved in the query (and in the table definition) then I suppose that info may be used to apply UTF-8 conversion selectively.
If there's a conversion step that is unnecessary in your case, you can perhaps add a flag somewhere to explicitly turn it off (but I suggest that you make some actual measurements of what you'll gain with it first, since it isn't unlikely that there are much bigger cpu hoggers elsewhere).
Anyway, either it UTF8 encodes completely or it don't do it at all. Anything else is invariably bogus (which reminds me of the wml-url encoding in Roxen which is bogus in exactly this way - I still like to get some sort of answer from Nilsson on that).
Well, come to think of it, a third option is actually to do it conditionally and also store a flag which tells whether it was UTF8 encoded or not. In any case, when it's time to decode the string you have to _know_ if it's to be UTF8 decoded or not.
On Wed, Nov 03, 2004 at 03:00:01PM +0100, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
perhaps add a flag somewhere to explicitly turn it off (but I suggest that you make some actual measurements of what you'll gain with it
The gain is: 1) Less memory - I don't need temporary buffers for conversion. Every new string creation is a pain - when you have a lot of them... 2) Less CPU usage - on short sequecnes this is almost invisible, but when you have gigabytes of data - it makes visible load.
Well, come to think of it, a third option is actually to do it conditionally and also store a flag which tells whether it was UTF8 encoded or not.
Unfortunately, there is no place for such flag in database - sqlite doesn't support any tagging (only 4 basic types, 2 of them integers, 1 binary and 1 text). I would prefer a flag (in connection/object) like "pass data as is" or (preferably) "make conversion" (so it won't be default, since this is not done anywhere else by default), i.e. without any implicit conversions, and handle results on my own.
In any case, when it's time to decode the string you have to _know_ if it's to be UTF8 decoded or not.
Sure, but better if I can control when (and where) it is done.
If flag (method) is OK, then it is OK for me (I wonder, is there someone who actually uses (or plans to) SQLite?) :)
Regards, /Al
The gain is: /.../
Yes, I can also reason theoretically what the gain is. What I meant was to actually _measure_ it. Is it 0.1% speed/memory gain? 1%? 10%? I'm not saying that you haven't done so, but if you have you should be able to give a fairly exact figure. If it's only 1% or thereabouts you'll probably get more "bang for the buck" by attacking something else. Maybe the query formatter; it copies and quotes the whole strings too afterall, unless SQLite support bindings (does it?).
If flag (method) is OK, then it is OK for me /.../
How do you intend to implement the flag in that case?
Space out! Then it's not unlikely that the UTF8 conversion make quite a lot of difference, I guess.
On Wed, Nov 03, 2004 at 07:50:03PM +0100, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
Yes, I can also reason theoretically what the gain is. What I meant was to actually _measure_ it. Is it 0.1% speed/memory gain? 1%? 10%?
OK, the facts. Simple test case - ca. 50M of data (UTF8), in 256 byte chunks, reading from file, converting, processing, again converting, writing to file (buffered, i.e. hdd latency is not counted, only CPU time is measured, using gauge{}). P4-1.7, 768M RAM, IDE HDD, Linux.
Results for read => UTF8 => UTF16 => processing => UTF8 => write:
Preparing file... Done! Measuring... 52428915 bytes processed; time spent: 2.900; 0.058000 s/M
Results for read => processing (without conversion) => write:
Preparing file... Done! Measuring... 52428915 bytes processed; time spent: 0.780; 0.015600 s/M
As you can see, the conversion takes 3.6 times more CPU time than plain processing without conversion. This is not 1% and not even 100%. Yes, I ran it several times and times shown above are average for all runs.
The test case is simple but reflects behavior (more or less) of my real application. "processing" in this test case was simulated by search() for something non-existent. Actual amount of data processed is counted in tens of gigabytes, actual processing is similar to search but using regular expressions, XML processing, involves data exctraction and manipulation (= more time & memory spent for 16-bit wide strings).
May be, when 1T of memory and 128GHz CPUs will cost $500, I'll not make any benchmarks, but... :)
How do you intend to implement the flag in that case?
sql->set_encoding(), may be.
Regards, /Al
Indeed a sizeable difference. A flag is in order.
But I don't understand the UTF16 step there. Is it the internal widestring format you label that way? If so, it's not UTF16. One could perhaps call it "dynamically chosen UCS2 or UCS4" (if they are what I recall them to be).
Umm, yup. Didn't read carefully. The test is a bit meaningless unless it measures read/write chain to an actual SQLite connection.
On Thu, Nov 04, 2004 at 02:35:01AM +0100, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
Umm, yup. Didn't read carefully. The test is a bit meaningless unless it measures read/write chain to an actual SQLite connection.
sqlite (and any db) overhead is much higher, of course, but the point is that string conversions are expensive. After all, sqlite is not only application where UTF8 may be used.
I don't agree that if something that is (relatively) "inexpensive" should be ignored completely - it is inexpensive only when you do a little bit of work with it.
It is like... the difference in 1 cent per 1 liter of fuel - it is less that 1% of liter's price, but when you tank a lot, have an army of trucks and count your spending during 1 year - it makes difference.
That's why I prefer search_reverse(x,y) (written in C) instead using search(reverse(x),reverse(y)), which is "almost invisible" when you do it once or twice, but becomes visible when you do it a lot and on huge strings. The same applies to conversions...
I just don't want to pay extra price when I can easily avoid this, that's all.
Back to SQLite - I am against _implicit_ conversion (which cannot be turned off), because this forces me (= leaves no choice) to use conversions.
Regards, /Al
A language like Pike is filled with stuff that is more aimed to make things simple rather than optimally fast. That's the whole point with it. This is just another such instance, and so the added complexity of a flag should be put in relation with the actual benefit it'll do in this case.
Just measuring conversion in some other case doesn't tell anything. You could just as well compare
gauge (a = b);
with
gauge (a = utf8_to_string (string_to_utf8 (b)));
and arrive at the conclusion that the conversion is practically infinitely more expensive.
Now, the added complexity of a flag isn't very much, so the speed gain doesn't have to be very much either. But it should at least be clearly measurable, I think. Otherwise people will see the flag, think that it'll have a worthwhile effect and maybe start troubling themselves with attempts to make use of it, and in reality they won't gain anything for the effort. That's why I asked for actual measurements.
On Thu, Nov 04, 2004 at 06:05:02PM +0100, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
Now, the added complexity of a flag isn't very much, so the speed gain doesn't have to be very much either.
It is not about speed gain (in case of SQLite) - it is about conversion which takes place implicitly. One must know that passed strings must not be UTF8 strings, so conversion won't hurt them. This is neither checked nor documented, and even if documented, it forces user to do something to fullfill this requirement.
measurable, I think. Otherwise people will see the flag, think that it'll have a worthwhile effect and maybe start troubling themselves
I am not that kind of person. I follow the principle "If you don't know what it is for, you don't need it" :) And I am quite sure that I am not alone. Just try to extend this analogy to C library (for instance): "...people will see the function, think that it'll have wortwhile effect and may be start troubling themselves". See? There are thousands of functions available in different libraries, but mere existence doesn't mean that they should be used, nor it troubles anyone.
Same for Pike - I see the module (say) DVB, but it doesn't make me trouble myself in questions like "Perhaps, I can make use of it?".
All this is not about "what is Pike for and what it is not" - it is generic (so far) interpreted language, so it might (and is) used for anything and everything. In turn, it means that more control over what is going on and how is better than less control.
Live example - Stdio.FILE()->set_charset() - it is there, but I never troubled myself asking "How can I use it?" - I just ignored it.
So why mere existence of set_charset() of set_encoding() in SQLite will trouble anyone? Why do you try to foreacts what other people will think or do, instead of giving them a freedom of choice? :)
Or tell me that "Pike is only for this and this, use it like this and never like that" - I can live with it. But never, ever judge what Pike users expect or not - eventually you will be wrong in your judgment.
I just want to understand _why_ SQLite uses implicit conversions when no other DB module does this, that ALL. If core team or module author against my proposal to add this flag - this is OK, just _tell_ me that in _clear text_, instead of trying to convience me that I am wrong, or at least _prove_ that I am wrong. This will be fine - I'll make my _own_ module and use it silently (i.e. no discussions or publishing) in my applications, because I know better what is good (and right) for _my_ applications (I am not talking about anyone else here, just me).
Regards, /Al
It is not about speed gain (in case of SQLite) - it is about conversion which takes place implicitly. One must know that passed strings must not be UTF8 strings, so conversion won't hurt them. This is neither
Conversion to UTF8 doesn't hurt anything (except space).
checked nor documented, and even if documented, it forces user to do something to fullfill this requirement.
The user won't have to do anything, since the strings will be decoded on extraction.
On Thu, Nov 04, 2004 at 06:50:01PM +0100, Henrik Grubbström (Lysator) @ Pike (-) developers forum wrote:
Conversion to UTF8 doesn't hurt anything (except space).
Unless input is UTF8 encoded already. That's what I am trying to tell during this discussion.
The user won't have to do anything, since the strings will be decoded on extraction.
The user must not provide UTF8 encoded strings to module, otherwise those will be double-encoded.
In current implementation, any string which is 8-bit wide and inserted into sqlite using bindings, will be flagged as datatype "blob", which is not always good idea and may differ from user's intentions.
Regards, /Al
On Thu, Nov 04, 2004 at 06:50:01PM +0100, Henrik Grubbström (Lysator) @ Pike (-) developers forum wrote:
Conversion to UTF8 doesn't hurt anything (except space).
Unless input is UTF8 encoded already. That's what I am trying to tell during this discussion.
UTF8-encoding something twice doesn't hurt; just UTF8-decode twice, and you're back where you started.
Regards, /Al
It seems to me that the problem is that a) it seems like an unnecessary duplication of work to be double-encoding and decoding and b) could be a problem for interoperability with other non-pike applications that share the same dataset (which don't have the same encoding behavior). Personally, I'd rather have the automatic encoding be turned off by default, but would accept an option to do so manually.
Bill
On Thu, Nov 04, 2004 at 06:50:01PM +0100, Henrik Grubbstr�m (Lysator) @ Pike (-) developers forum wrote:
Conversion to UTF8 doesn't hurt anything (except space).
Unless input is UTF8 encoded already. That's what I am trying to tell during this discussion.
UTF8-encoding something twice doesn't hurt; just UTF8-decode twice, and you're back where you started.
On Thu, Nov 04, 2004 at 07:25:01PM +0100, Henrik Grubbström (Lysator) @ Pike (-) developers forum wrote:
UTF8-encoding something twice doesn't hurt; just UTF8-decode twice, and you're back where you started.
1) it expands string length;
2) I need to take this into account when I use data outside Pike application, which is not always possible (db is accessed not only from Pike).
3) sqlite won't be able to compare strings using collation sequences, since twice-encoded UTF8 strings won't produce correct Unicode characters.
I don't use Pike for writing mud-clients or servers for small community. I use it for real-life applications, where interoperability is a must. If this (interoperability) is not something that worries Pike team - OK, just let me know, but please, don't try to tell me that I do something wrong just because I do it not like others do :)
Regards, /Al
On 2004-11-04 18:42:50 +0100, Alexander Demenshin wrote:
I just want to understand _why_ SQLite uses implicit conversions
Pike strings are defined as strings of 32-bit values. If you expect to store a string in SQLite and read it back exactly as it was, you have to encode it in some way. Using UTF-8 (which is standardized and in wide use) seems to be better than inventing the umpteenth proprietary encoding.
when no other DB module does this,
I haven't checked if there is really no other DB module which does this, but it is possible that these modules are older than wide strings in Pike (I don't remember when wide strings in Pike were introduced, but I'm fairly sure that at least some database modules already existed at that time) and nobody bothered to update them, or that the author of the module didn't think about what would happen if you tried to store a wide string in a database. Or they did think about it and decided that that should be decided by the application.
I'll have to check what the Oracle module does. I think it should do implicit conversions for varchar2 and clob fields, but not for blob fields (does it matter for numeric fields?)
If core team or module author against my proposal to add this flag - this is OK, just _tell_ me that in _clear text_, instead of trying to convience me that I am wrong, or at least _prove_ that I am wrong.
I don't understand this. If they are against your proposal why shouldn't they argue their point? Why "shut up, we don't like it" better than "we don't like it, because ..."?
hp
Pike strings are defined as strings of 32-bit values. If you expect to store a string in SQLite and read it back exactly as it was, you have to encode it in some way. Using UTF-8 (which is standardized and in wide use) seems to be better than inventing the umpteenth proprietary encoding.
Hm, I have no idea how this is solved in other databases, but generally it seems like this encoding should be forced upon the writer of the program, not implicitly in the glue...
Nothing else what I know of has this encoding implicitely, maybe with the exception of Protocols.HTTP. You can't send strings to for instance write or Image.JPEG.decode if they aren't 8-bit already.
What kind of strings does SQLite expect? Can it use UTF8 strings? Does it support some kind of pattern search that are actually broken by UTF8 encoding?
You can't send strings to for instance write or Image.JPEG.decode if they aren't 8-bit already.
In the case of Image.JPEG.decode it's because the decoding is defined for an octet stream only, of course. In the case of write, or the other generic I/O functions, it's because there's no universally accepted way to write wide chars.
I would like to charactarize this whole discussion as yet another time the confusion between byte-arrays and strings mess things up. The last time was when PCRE was integrated, unless I remember incorrectly.
There are btw. more or less standardized methods to write and read unicode files on some platforms, a method to get a platform-unicode-to-pikestring-file object would be fairly useful.
Windows have it's widechar (files are mostly stored as utf16-le), and modern unixen tends to use utf-8 (for mainly american reasons)
On Thu, Nov 04, 2004 at 07:23:26PM +0100, Peter J. Holzer wrote:
Pike strings are defined as strings of 32-bit values.
Could you please provide the source of this information? AFAIK, Pike strings may be hold characters with 8, 16 or 32-bit in length, according to documentation.
was, you have to encode it in some way. Using UTF-8 (which is standardized and in wide use) seems to be better than inventing the umpteenth proprietary encoding.
Again... :( Is my English so bad or something else is wrong? Currently, SQLite module will apply string_to_utf8() function to _any_ string which is passed to big_query(), except when bindings are in use and string is 8-bit wide.
See what happens:
1) I supply big_query() with UTF8 encoded string. 2) SQLite module converts it (again) to UTF8, which scrabmles Unicode. Note that this conversion is implicit and cannot be turned off, unless I (manually) apply utf8_to_string() before and pass it's result to big_query(). 3) Any external application which expects UTF8 encoded Unicode characters in sqlite database will get it wrong. sqlite by itself (with caseless comparision) will be unable to handle it right as well.
Don't use see the problem?
I haven't checked if there is really no other DB module which does this,
I did. Only SQLite does this.
module didn't think about what would happen if you tried to store a wide string in a database. Or they did think about it and decided that that should be decided by the application.
It should be decided by the application, not SQL module, _always_. That's what I am trying to tell here.
I don't understand this. If they are against your proposal why shouldn't they argue their point? Why "shut up, we don't like it" better than "we don't like it, because ..."?
Because there is no "because". I tell that implicit conversion breaks things (see above), they tell me that "it won't hurt, just decode it twice".
Regards, /Al
I suspect you lump me in with "they" there, so I'd like to point out that I haven't made that suggestion. It's clearly not the right way to do it.
You can get it perfectly well working by decoding your UTF8 encoded queries before feeding them to the glue. Then you get an unnecessary decode/encode cycle, but nothing more serious than that.
On Thu, Nov 04, 2004 at 08:25:00PM +0100, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
I suspect you lump me in with "they" there, so I'd like to point out that I haven't made that suggestion. It's clearly not the right way to do it.
"they" are everybody who did, no personal pointers :)
You can get it perfectly well working by decoding your UTF8 encoded queries before feeding them to the glue. Then you get an unnecessary decode/encode cycle, but nothing more serious than that.
That's extra price I don't want to pay. After all, it might be not UTF8 but anything else which fits in 8bit (any charset which fits in 8 bit, will be scrambled by implicit encoding, while native interface won't do any conversion).
In case if SQL modules would be "high-level" API to any SQL DBMS, I would agree with this, but since those are not, this is (how it is done now) not right way to go, IMHO. Or, in other words, I don't expect that wrapper glue will do anything but wrap.
Regards, /Al
The Sql interfaces have their problems, but by-and-all it's about as high level as such interfaces can get while still leaving the formatting of the sql queries to the user. I think it's enough to make it appropriate for them to take care of encodings whenever it can be done, and in the case of SQLite there's clearly enough of a UTF8 policy.
So I agree with Nilssons choice to make UTF8 conversions implicitly. A flag is motivated, but doing conversions should be the default, I think.
I just took a look at the SQLite API documentation at http://www.sqlite.org/capi3ref.html and it seems it just about everywhere expects data to be entered/extracted as either UTF-8 or UTF-16, and since Nilsson decided to use the UTF-8 variants of the API I also agree with the "implicit" UTF-8 conversions (the only reasonable alternative would have been "implicit" UTF-16 conversion). Note that BLOB fields naturally should not be converted.
But is the choice between UTF8, UTF16, UTF16BE and UTF16LE transparent? Otherwise, isn't it necessary to have a setting for that, for the sake of other clients?
Well, since the db can't know if data inserted as UTF-8 won't be extracted as UTF-16, it has to do perform normalizing internally, so it's most likely a non-issue.
You're right, but it could still be good to have one for performance tuning:
If the text representation specified by the database file (in the file header) does not match the text representation required by the interface routines, then text is converted on-the-fly. Constantly converting text from one representation to another can be computationally expensive, so it is suggested that programmers choose a single representation and stick with it throughout their application.
But then again, it has limited effect right now:
In the current implementation of SQLite, the SQL parser only works with UTF-8 text. So if you supply UTF-16 text it will be converted. This is just an implementation issue and there is nothing to prevent future versions of SQLite from parsing UTF-16 encoded SQL natively.
A toggle between UTF8, UTF16(LE|BE) and none is my suggestion.
On Fri, Nov 05, 2004 at 01:21:12PM +0100, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
with UTF-8 text. So if you supply UTF-16 text it will be converted.
Just a followup - this applies if *16() functions are used. Regular functions won't convert anything.
A toggle between UTF8, UTF16(LE|BE) and none is my suggestion.
Toggle on which level? sqlite (library) provides two interfaces - one for UTF-8 and one for UTF-16. Internal (in file) representation is UTF-8 only (currently, and I believe it will take half a year to change).
Regards, /Al
On Fri, Nov 05, 2004 at 01:21:12PM +0100, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
with UTF-8 text. So if you supply UTF-16 text it will be converted.
Just a followup - this applies if *16() functions are used. Regular functions won't convert anything.
Yes, so how do you know that data you have inserted with the non *16 functions won't be extracted with some *16 function?
Regards, /Al
On Fri, Nov 05, 2004 at 01:45:25PM +0100, Henrik Grubbström (Lysator) @ Pike (-) developers forum wrote:
Yes, so how do you know that data you have inserted with the non *16 functions won't be extracted with some *16 function?
I don't know, and sqlite doesn't too. There will be attempt to make conversion from UTF-8, if *16() function is used, obviously, it may fail - but there is nothing that we can do about this - invalid data may be inserted outside Pike as well, so Pike conversion will fail too. That's why option to turn conversion off is more good than evil (terrible things happen when utf8_to_string() fails inside SQLite module on query processing - I tried this already).
Regards, /Al
On Fri, Nov 05, 2004 at 12:35:39PM +0100, Henrik Grubbström (Lysator) @ Pike (-) developers forum wrote:
http://www.sqlite.org/capi3ref.html and it seems it just about everywhere expects data to be entered/extracted as either UTF-8 or
It doesn't check the validity of encoding nor makes any conversions internally.
UTF-16, and since Nilsson decided to use the UTF-8 variants of the API I also agree with the "implicit" UTF-8 conversions (the only
Again, this prevents direct usage of UTF-8 encoded strings, because then those will be encoded twice. Yes, we already discussed that "one should not use UTF-8 when working with SQLite module", but I strongly disagree with this policy - already explained why, many times.
Note that BLOB fields naturally should not be converted.
Current SQLite module implementation assumes that field is a BLOB if (and only if) it is 8-bit wide string, passed to statement using bindings. This way, UTF-8 encoded string may not be stored as text-type string using bindings.
If opposition is so strong - OK, I'll leave Nillson's module (in CVS) as is and use modified version, which will be simply wrapper, not any kind of "intellectual decision machine knowing what to do better than the user" (sorry, but currently it is - any uncontrollable implicity will be like this).
After all, it seems that I am only real user of sqlite in Pike, at least only one who intends to use it in production mode, and current implementation is too restrictive because of this implicity.
It is one thing to implement something just as a "proof of concept", or "to declare that it exists", but completely another to actually use it...
Just to summarize why current SQLite is restrictive:
1) Already prepared UTF-8 strings cannot be used directly; 2) Anything but UTF-8 cannot be used while sqlite allows this; 3) Enforced conversion add additional overhead - it doesn't matter how small it is, but it is there, while can be avoided.
While (2) and (3) are not very important (at this stage), (1) is _extremely_ important (in my case, at least). No, I don't want to utf8_to_string() first, before passing them to SQLite, just because of this implicit conversion. There is alternative, though - don't make any conversion if string is 8-bit wide (my initial proposal) - this won't hurt anybody, and those who will (because nobody does right now) use 16- or 32-bit strings will see no difference.
Regards, /Al
On Fri, Nov 05, 2004 at 01:40:38PM +0100, Henrik Grubbström (Lysator) @ Pike (-) developers forum wrote:
It doesn't? How does it handle the case where you have inserted data with one of the *16 functions, and want to extract it with one of the non *16 functions then?
When I insert the data with *16() function, it will be converted to UTF-8 before storing to file, then, when I extract with non *16() function, no conversion nor checks will be done at all. If I don't use *16() on insertion, then no conversion nor checks done (see below).
As far as I can see, it does perform conversions internally, and they will most likely fail if the inserted data isn't properly encoded.
It does only if you use *16() functions, or if (not implemented yet) database file is not in UTF-8. From the docs:
"SQLite is not particular about the text it receives and is more than happy to process text strings that are not normalized or even well-formed ^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ UTF-8 or UTF-16. Thus, programmers who want to store IS08859 data can do so using the UTF-8 interfaces. As long as no attempts are made to use a UTF-16 collating sequence or SQL function, the byte sequence of the text will not be modified in any way."
This is from http://www.sqlite.org/version3.html "Support for UTF-8 and UTF-16".
Regards, /Al
On Fri, Nov 05, 2004 at 01:40:38PM +0100, Henrik Grubbström (Lysator) @ Pike (-) developers forum wrote:
It doesn't? How does it handle the case where you have inserted data with one of the *16 functions, and want to extract it with one of the non *16 functions then?
When I insert the data with *16() function, it will be converted to UTF-8 before storing to file, then, when I extract with non *16() function, no conversion nor checks will be done at all. If I don't use *16() on insertion, then no conversion nor checks done (see below).
My question concerned the other way around. ie inserting invalid UTF-8, and attempting to extract it with one of the *16 functions.
As far as I can see, it does perform conversions internally, and they will most likely fail if the inserted data isn't properly encoded.
It does only if you use *16() functions, or if (not implemented yet) database file is not in UTF-8. From the docs:
"SQLite is not particular about the text it receives and is more than happy to process text strings that are not normalized or even well-formed ^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ UTF-8 or UTF-16. Thus, programmers who want to store IS08859 data can do so using the UTF-8 interfaces. As long as no attempts are made to use a UTF-16 collating sequence or SQL function, the byte sequence of the text will not be modified in any way."
Yes, that's a natural consequence of using UTF-8 as the native storage format. Note that it doesn't mention what happens if you have stored iso8859 data with the UTF-8 interface, and attempt to retreive it with the UTF-16 interface.
Regards, /Al
On Fri, Nov 05, 2004 at 02:01:05PM +0100, Henrik Grubbström (Lysator) @ Pike (-) developers forum wrote:
My question concerned the other way around. ie inserting invalid UTF-8, and attempting to extract it with one of the *16 functions.
As I mentinoned in another reply - this invalid data may be inserted elsewhere, so Pike's attempts to "do everything right" won't be completely successfull, right?
format. Note that it doesn't mention what happens if you have stored iso8859 data with the UTF-8 interface, and attempt to retreive it with the UTF-16 interface.
... same as above. Pike has no control over complete database...
Regards, /Al
It doesn't check the validity of encoding nor makes any conversions internally.
Afaics it does, both when necessary in the communication with clients, and when collation etc calls for it. It's clear as day that it got unicode written all over it, and just because it strictly is possible to ignore that doesn't retract from this.
Why would anyone want to store invalid UTF strings in TEXT fields when BLOB fields are available? Besides proving some kind of point to do it just because it can be done?
If opposition is so strong - OK, I'll leave Nillson's module (in CVS) as is and use modified version,
You seem to ignore that as the discussion has progressed, noone has opposed adding a flag to turn it off. Isn't that enough for you? Or do you just continue this kind of sulky the-world-against-me attitude for the sake of it?
- Already prepared UTF-8 strings cannot be used directly;
This point can be reduced to (3) by just decoding the strings before entry. In other words, it's not a matter of versatility but one of performance.
- Anything but UTF-8 cannot be used while sqlite allows this;
I wouldn't say it's allowed just because it doesn't check for invalid strings. Everywhere in the docs I've looked says it's UTF8 or UTF16, period. Is there any guarantee that they won't add a validity checker at some point?
- Enforced conversion add additional overhead - it doesn't matter how small it is, but it is there, while can be avoided.
Valid point, although it still would be nice to see the kind of overhead the extra overhead incur.
There is alternative, though - don't make any conversion if string is 8-bit wide (my initial proposal) - this won't hurt anybody, and those who will (because nobody does right now) use 16- or 32-bit strings will see no difference.
Oh my will this hurt! This is definitely the one thing I absolutely and utterly oppose. How do you know if the string is to be UTF8/16 decoded when you get it back? Using some kind of dwim by trying to decode it and just pass it through if that fails? Then there's always the possibility that it'll decode eight bit raw strings that just happen to not be invalid UTF-8. What if you want to use the sqlite collation functions etc on those strings? They sure as hell won't work correctly on unencoded eight bit chars.
On Fri, Nov 05, 2004 at 01:50:34PM +0100, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
It doesn't check the validity of encoding nor makes any conversions internally.
Afaics it does, both when necessary in the communication with clients, and when collation etc calls for it.
Documentation clearly says it doesn't - I posted a quote before.
You seem to ignore that as the discussion has progressed, noone has opposed adding a flag to turn it off. Isn't that enough for you?
It is enough for me, but if noone opposed why it is still continues? :)
Valid point, although it still would be nice to see the kind of overhead the extra overhead incur.
Why? Is mere fact that I (or anyone else) don't want _any_ (no matter how small) overhead not enough? :)
and utterly oppose. How do you know if the string is to be UTF8/16 decoded when you get it back?
How do you know that string you wrote to file is valid UTF-8 when read back? How do you know what kind of data is stored in file or any other external source if you didn write it?
happen to not be invalid UTF-8. What if you want to use the sqlite collation functions etc on those strings? They sure as hell won't work correctly on unencoded eight bit chars.
There is extermely nice feature in sqlite - I can define my own functions and collation sequences :)
Regards, /Al
Documentation clearly says it doesn't - I posted a quote before.
True. I got that message some time later.
Valid point, although it still would be nice to see the kind of overhead the extra overhead incur.
Why? Is mere fact that I (or anyone else) don't want _any_ (no matter how small) overhead not enough? :)
I've already elaborated on that. Also, I said it would be _nice_, not _necessary_.
How do you know that string you wrote to file is valid UTF-8 when read back? How do you know what kind of data is stored in file or any other external source if you didn write it?
I know from other data (e.g. a user provided file name) and/or conventions (e.g. standard file extensions) in those cases. In this case there would be neither that is able tell whether the string should be decoded or not.
After all, it seems that I am only real user of sqlite in Pike, at least only one who intends to use it in production mode, and current implementation is too restrictive because of this implicity.
It is one thing to implement something just as a "proof of concept", or "to declare that it exists", but completely another to actually use it...
You have a very high horse, sir.
On Fri, Nov 05, 2004 at 05:45:16PM +0100, Martin Nilsson (DivX Networks) @ Pike (-) developers forum wrote:
You have a very high horse, sir.
No offense, please :)
If I use something, I want to use it like I prepared to, or like my application is designed to. I respect your right to do something in a way that you want to, but in return I expect that my way to do something will be respected too, instead of attempting to teach me how to program or how to use API "in a right way", right? :)
I simply asked - is this OK or not (my proposal), I explained why I need this, and instead of simply, plain, clear answer "Do this" or "Don't do this" I got a lecture that my way of thinking is... well, differs :)
This is _not_ what I want. Next time I'll just publish my intention, ask for opinion, and anything that is not clear, straignt "OK, we agree" I'll interpret as "We don't agree" or "We don't understand why do you need this [because we don't need this|because we can live without this, so you can too]".
Regards, /Al
I simply asked - is this OK or not (my proposal), I explained why I need this, and instead of simply, plain, clear answer "Do this" or "Don't do this" I got a lecture that my way of thinking is... well, differs :)
You proposed to make string_to_utf8 unusable, so naturally there was some concerns.
I suggest that instead of making a flag, make a RawSQLite class with all the common code and access code without conversions and then inherit it into SQLite and put code with conversions there.
Believe it or not, but I discuss it out of respect for you. If I didn't respect your thinking, I'd just say something like "stupid, go away". But since I believe you have put some thought into it I'm trying to have a constructive argument with you, to discern the overall best thing to do. That implies to put forth reasons and to defend them, for both sides.
After all, it seems that I am only real user of sqlite in Pike, at least only one who intends to use it in production mode, and current implementation is too restrictive because of this implicity.
You certainly arn't the only "real user" of SQLite. I use it at work all the time nowdays and it works just fine.
Seriously, I'm getting tired of this discussion. If you cannot live with the overhead of converting your inputdata to pike strings, perhaps you should write your code in C or even assembler. You seem to be in great need of performance for your application and in that case, Pike simply isn't the best choice at all times. Another thing that makes me wonder why you want to use pike for this perticulat application is the fact that you don't seem to do anything with your data in Pike. From your posts, you just seem to read UTF-8 data from some source and shove it to SQLite. Doing that, is about 50 lines of C code and you don't get much more efficient than that...
Could you please provide the source of this information? AFAIK, Pike strings may be hold characters with 8, 16 or 32-bit in length, according to documentation.
That's just an implementation detail. Conceptually, a Pike string is a sequence of integers in the range 0..2147483647 representing ISO-10646-1 characters.
That's just an implementation detail. Conceptually, a Pike string is a sequence of integers in the range 0..2147483647 representing ISO-10646-1 characters.
Actually -2147483648..2147483647, where 0..2147483647 represent ISO-10646-1 characters, and the others are for application use.
"\x80000000"[0];
Result: -2147483648
"\x7fffffff"[0];
Result: 2147483647
Note that using negative "characters" forces the string width to 32:
String.width((string)({-1}));
Result: 32
/.../ One must know that passed strings must not be UTF8 strings, so conversion won't hurt them. This is neither checked nor documented, and even if documented, it forces user to do something to fullfill this requirement.
It depends on your point of view. If the point of view is that you have decoded strings internally, which is customary in Pike, then it's exactly the opposite - the implicit conversion enables you to feed and get back decoded strings without fuzz.
If there were a universal way to encode wide chars in normal files then you could be assured that the Stdio functions also would do that encoding implicitly so that you wouldn't have to meddle with encoded strings in your pike program when using the Stdio module.
/.../ There are thousands of functions available in different libraries, but mere existence doesn't mean that they should be used, nor it troubles anyone.
That analogy is faulty. The case is rather the choice between <simple straightforward method> and <more complex method>. If the <more complex method> exists, one will think it must do that for a reason. Thus it's worth investigating it and maybe deploy it even though the code gets more clunky. If the <more complex method> doesn't produce any measurable gain at all compared to the other one, it has no reason to exist.
But by now I think this discussion in itself has become a waste of too much effort compared to either alternative, so the flag is perfectly fine by me for that reason only. I.e. for me you're welcome to go ahead and add it. There are certainly plenty of things in Pike already that are a lot worse any way you look at them but which got in without nearly this amount of debate.
All this is not about "what is Pike for and what it is not" - it is generic (so far) interpreted language, so it might (and is) used for anything and everything. In turn, it means that more control over what is going on and how is better than less control.
I don't agree. There are a lot of things you don't control in Pike. E.g. whether strings gets hashed or not. One could certainly squeeze out a bit of extra performance by telling the interpreter not to hash certain strings. Same thing goes for the mandatory runtime argument checks, the limits of native integers, the hash method used in mappings, the allocation strategies in countless places, the reference counting policy, etc, etc, etc. All these things you don't control, and that is for a reason: To keep the language simple to use and - in some cases - simple to implement.
Also, it's not like these things, nor the implicit encoding in the SQLite interface, impacts the expressiveness of the language. You can still get the work done, just not _precisely_ in the way you might want to do it.
I just want to understand _why_ SQLite uses implicit conversions when no other DB module does this, that ALL.
This is a good question, and note that I haven't argued this aspect so far.
The reason the other db modules don't do it is that there's no set policy in those db's what the encoding is. I take it that SQLite got such a policy and that it's UTF8 (please correct me if I'm wrong here).
So an argument against the SQLite glue doing any conversion is that the others don't do it and they are all accessible through a uniform interface. But otoh that interface doesn't really say anything about providing a uniform string encoding behavior. Actually, you can't do much at all through the Sql.Sql interface without knowing the database you're connected to, since so many details differ anyway.
So all in all, I think it's good that the glue to any db with a well defined encoding policy enforces that policy by default, since it makes the interface to those db's simpler to use. I've heard that newer MySQLs have a Unicode policy too. I'd be positive to embed encoding in that glue too (but there's of course the compatibility aspect to worry about in that case).
On Thu, Nov 04, 2004 at 08:20:02PM +0100, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
If there were a universal way to encode wide chars in normal files
Even in this case we would need a way to differentiate between binary and text files, so the decision would be on the user/application side.
That analogy is faulty. The case is rather the choice between <simple straightforward method> and <more complex method>. If the <more complex method> exists, one will think it must do that for a reason.
Not really. I can use read() or readv() in C - both have advantages and disadvantages, but most people use just read(). And most people in Pike, I guess, use Stdio.FILE() without using set_charset(), despite the fact of it's presence there.
I don't agree. There are a lot of things you don't control in Pike.
Yes, I can't control language internals, but I want to control anything beyond it. The way how files/databases are encoded - at least.
still get the work done, just not _precisely_ in the way you might want to do it.
Yes, I can use asm to get my work done, just not precisely as fast as I can do this in Pike or C :)
The reason the other db modules don't do it is that there's no set policy in those db's what the encoding is.
There is no set policy in sqlite too. In version 3, it is stated that sqlite will treat any [text] string sent as UTF8 encoded (comparisons etc), but it won't check that it is correct UTF8 encoding or not, thus allowing any encoding that user want to use. Also It won't make any conversions internally, either on input nor output, except for UTF16 interface, which is not in use in SQLite module. Thus, impiclit conversion applied by SQLite module restricts possible applications.
In Informix and Mysql, the policy is defined outside of DB interface, and it is up to user to decide which is in effect and to apply it (yes, it is dynamic), thus, I don't see any reason why it should be different in SQLite.
providing a uniform string encoding behavior. Actually, you can't do much at all through the Sql.Sql interface without knowing the database you're connected to, since so many details differ anyway.
That's why I am against implicit conversions at all, regardless of module, if this module communicates with something that is external to Pike (files, DBs, sockets, etc).
Regards, /Al
That analogy is faulty. The case is rather the choice between <simple straightforward method> and <more complex method>. If the <more complex method> exists, one will think it must do that for a reason.
Not really. I can use read() or readv() in C - both have advantages and disadvantages, /.../
Now you're making the same faulty analogy again, since, as you so clearly points out, in that case each alternative has _both_ advantages _and_ disadvantages. Sorry, but I won't make another attempt to get my point across on this.
There is no set policy in sqlite too. In version 3, it is stated that sqlite will treat any [text] string sent as UTF8 encoded (comparisons etc), but it won't check that it is correct UTF8 encoding or not, thus allowing any encoding that user want to use.
Imho that's about as set as a policy can get without getting anal about it.
Tell me, what do comparisons etc do when they come across invalid UTF8 sequences? Is the recovery procedure well defined?
/.../ Thus, impiclit conversion applied by SQLite module restricts possible applications.
Indeed it does. Ok, that's also a valid point for a flag to turn it off, in case one really wants to abuse the db that way. Just out of curiosity, is that what you're going to do in your application?
Another interesting question: Do SQLite also provide data types for octet sequences that don't imply UTF8 encoding?
/.../ Actually, you can't do much at all through the Sql.Sql interface without knowing the database you're connected to, since so many details differ anyway.
That's why I am against implicit conversions at all, regardless of module, if this module communicates with something that is external to Pike (files, DBs, sockets, etc).
I'm sorry, but I completely fail to understand the "that's why" between the differing details in the db backends and this conclusion.
On Thu, Nov 04, 2004 at 02:15:01AM +0100, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
But I don't understand the UTF16 step there. Is it the internal widestring format you label that way?
Yes, it is. The result from conversion of UTF8 encoded string (which contains 16-bit values) with utf8_to_string().
Regards, /Al
The result of utf8_to_string doesn't have to contain 16 bit characters and it is certainly not utf16.
On Thu, Nov 04, 2004 at 05:40:01PM +0100, Martin Nilsson (DivX Networks) @ Pike (-) developers forum wrote:
The result of utf8_to_string doesn't have to contain 16 bit characters and it is certainly not utf16.
It doesn't have to unless the source contains 16 bit characters (regardless of encoding).
When one says that "string is UTF-16 encoded" - this implies that single character in this string is 16-bit wide - or?
Regards, /Al
Saying that a "string is UTF-16 encoded" implies that characters outside the Basic Multilingual Plane are encoded using surrogates (pairs of two 16-bit values).
On Thu, Nov 04, 2004 at 08:05:03PM +0100, Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
Saying that a "string is UTF-16 encoded" implies that characters outside the Basic Multilingual Plane are encoded using surrogates (pairs of two 16-bit values).
"In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks."
And from the RFC: "In the UTF-16 encoding, characters are represented using either one or two unsigned 16-bit integers, depending on the character value." (http://www.ietf.org/rfc/rfc2781.txt)
As I said, UTF-16 implies 16-bit wide characters, hence, 16-bit wide strings in Pike, which clearly explains why I use this term.
Regards, /Al
But strings in pike are not 16bit, nor are they in a format suitable for transmission, nor are characters represented as 1 or 2 characters depending on the code-point.
So there is exactly nothing in common with utf-16 and pikestrings.
Except that you can store UTF16 strings in pikestrings perfectly well, if you really want to. :-)
Well, yes, but then again, you can store an image or movie just as well. So are pikestrings movies?
To clarify: Strings in pike is an array of 32 bit signed values.
Sometimes they are compressed to be an array of 8 or 16 bit values, but thats only a memory optimization.
And most people who think they can get something useful from String.width() are wrong. I was against the introduction of that function, it just causes confusion like the one we are seeing now.
Well, it's useful to see if the string you are going to send to write (or Image.JPEG.decode) is a legal argument, but that's more or less it. :)
It could be seen more clearly if "sequence of octets" was a different datatype from "sequence of ISO-10646-1 characters". Then you'd get a compilation error if you did it wrong.
I'm all for a 'buffer' type.
Combined with #pike compatibilty that causes automatic conversion, it would even be possible, I think. A lot of work, though.
To make an automatic conversion that really works, you'd have to go through all C modules though, since it can't always be seen from the declared type if a conversion is needed.
I don't like stuff that is done *just* to lessen confusion or give compile errors in certain conditions.
Is there any other reason to use a buffer type?
That is true. Hm, didn't I make some sort of Memory or Data or something class for that kind of strings?
Is there *a lot* of benifit in not sharing buffer strings?
On Thu, Nov 04, 2004 at 09:10:03PM +0100, Mirar @ Pike developers forum wrote:
Is there *a lot* of benifit in not sharing buffer strings?
There is - in data pumping applications, or when you want to filter data and store it in intermediate buffer. Current Buffer() is much more faster that string addition operations, but it is sometimes useful to access buffered data in middle, i.e. without the need to Buffer()->get() first (which will create shared string).
One of my apps, for instance, reads data from remote source and periodically checks if (so far accumulated) stream is valid packet (packets may be as big as few megabytes). I cannot use Buffer() for this, since I can't search in buffers, and every buffer += makes things slower and slower with new chunk...
Mutable strings would be nice too, BTW - again, performance gain is significant.
Regards, /Al
I think it was only introduced to be able to calculate the aproximate size of pike data structures. Not that you will get very useful values anyway, since the overhead is sort of hard to calculate.
On Thu, Nov 04, 2004 at 09:00:12PM +0100, Per Hedbor () @ Pike (-) developers forum wrote:
Actully, removing it and adding a Debug.size( whataver ) would be sort of useful.
Why Debug? Taking size of (external representation of) object in bytes useful not only for debugging.
BTW, measuring current memory usage is also useful not only for debugging.
Something like Pike.sizeof() or Pike.memory_usage() would be more appropriate - who knows, may be, one day, existence of Debug module will depend on compile-time switch :)
Regards, /Al
Well, chars are sometimes unsigned too, notably in character escapes. I.e. it's "\37777777777" and not "-1" or something.
I also remember there was a bit of confusion wrt this in %c to sprintf and sscanf quite some time ago, but it seems to be sorted out now.
%c is a character, %1c is an unsigned octet. I think the confusion was with %4c, which was a signed quad-octet. This is long since fixed though, and %4c is now unsigned just like %2c, %3c and %4711c.
And from the RFC: "In the UTF-16 encoding, characters are represented using either one or two unsigned 16-bit integers, depending on the character value." (http://www.ietf.org/rfc/rfc2781.txt)
As I said, UTF-16 implies 16-bit wide characters, hence, 16-bit wide strings in Pike, which clearly explains why I use this term.
Read again, UTF-16 implies 16-bit integers where the characters are build from one *or two* integers.
utf8_to_string does not unpack to UTF-16; values that doesn't fit inside 16 bits are not stored in two integers; they are stored in one integer.
On Thu, Nov 04, 2004 at 08:25:01PM +0100, Mirar @ Pike developers forum wrote:
utf8_to_string does not unpack to UTF-16; values that doesn't fit inside 16 bits are not stored in two integers; they are stored in one integer.
But values which do fit will be encoded as 16-bit integers, which still makes it perfect UTF-16 string (if there are only values or this kind).
Regards, /Al
: But values which do fit will be encoded as 16-bit integers, which still : makes it perfect UTF-16 string (if there are only values or this kind).
Sorry, no, that's not correct. Pikestrings can be 16bit and invalid utf-16 at the same time.
On Thu, Nov 04, 2004 at 08:55:03PM +0100, Per Hedbor () @ Pike (-) developers forum wrote:
Sorry, no, that's not correct. Pikestrings can be 16bit and invalid utf-16 at the same time.
But they can be valid, OTOH.
Unfortunately, there is no (native) way in Pike to tag a string as encoded in UTF-8, UTF-16 or in some charset. This is, in turn, the (true) reason for this discussion...
If there is String.width(), there should be String.encoding() too :)
But strings are not (real) objects in Pike, so...
Regards, /Al
Strings are by defenition always unicode, so a String.encoding would be trivial to implement.
On Thu, Nov 04, 2004 at 09:05:03PM +0100, Per Hedbor () @ Pike (-) developers forum wrote:
Strings are by defenition always unicode, so a String.encoding would be trivial to implement.
Where is this definition? Why this is not enforced? And since in real-life we have to deal with encodings, we need some way to store somewhere encoding of strings which are read from external sources... But because this is not done on language level, I would prefer to prohibit all implicit conversions everywhere (or enforce them everywhere, but in this case, I guess, I'll look for another language :).
Regards, /Al
Maybe "definition" isn't the right word, rather "design goal". It's of course not enforceable. How could it be? Still, it's the reason why the string implementation works the way it does, and why many string handling modules are the way they are.
By design pikestrings are _always_ handles as unicode by all functions that does not use them as simple data (such as read/write etc).
Hence, all data in pike programs should be unicode except for at the interface between the system (file and socket I/O).
Thus, if you have a utf8-string read from a file or the network, convert it to unicode before passing it to any pike functions expecting strings.
On Wed, Nov 03, 2004 at 01:20:01PM +0100, Per Hedbor () @ Pike (-) developers forum wrote:
Hence, all data in pike programs should be unicode except for at the interface between the system (file and socket I/O).
IMHO, SQL database is not really different to file/socket I/O :)
Regards, /Al
It is if it uses unicode internally. Really.
There is a difference between API:s processing data and API:s processing strings. The problem in pike is that strings are often used as raw data containers (we might want to change that, really, and go for the java approach).
Hi,
What should be the correct behavior of string_to_utf8() in case if source string is not wide-string? Currently it just converts all 8-bit values to UTF-8 representation, which is not (really) Right Thing (tm) - imagine UTF-8 string passed on... It will be scrambled...
Yes, that is the _right_ thing to do.
May be it makes sense to do nothing (return source string) in case if source is 8-bit string?
No
? Or (at least) check that source is valid UTF-8 stream (I wouldn't choose this way, though)?
No.
pike-devel@lists.lysator.liu.se