Re: string_to_utf8() behavior on non-wide strings

5 Nov 2004

      ...
It doesn't check the validity of encoding nor makes any conversions
  internally.
Afaics it does, both when necessary in the communication with clients,
and when collation etc calls for it. It's clear as day that it got
unicode written all over it, and just because it strictly is possible
to ignore that doesn't retract from this.
Why would anyone want to store invalid UTF strings in TEXT fields when
BLOB fields are available? Besides proving some kind of point to do it
just because it can be done?
...
If opposition is so strong - OK, I'll leave Nillson's module (in CVS) as
  is and use modified version,
You seem to ignore that as the discussion has progressed, noone has
opposed adding a flag to turn it off. Isn't that enough for you? Or do
you just continue this kind of sulky the-world-against-me attitude for
the sake of it?
...

Already prepared UTF-8 strings cannot be used directly;

This point can be reduced to (3) by just decoding the strings before
entry. In other words, it's not a matter of versatility but one of
performance.
...

Anything but UTF-8 cannot be used while sqlite allows this;

I wouldn't say it's allowed just because it doesn't check for invalid
strings. Everywhere in the docs I've looked says it's UTF8 or UTF16,
period. Is there any guarantee that they won't add a validity checker
at some point?
...

Enforced conversion add additional overhead - it doesn't matter
how small it is, but it is there, while can be avoided.

Valid point, although it still would be nice to see the kind of
overhead the extra overhead incur.
...
There is alternative, though - don't make any conversion if string
  is 8-bit wide (my initial proposal) - this won't hurt anybody, and
  those who will (because nobody does right now) use 16- or 32-bit
  strings will see no difference.
Oh my will this hurt! This is definitely the one thing I absolutely
and utterly oppose. How do you know if the string is to be UTF8/16
decoded when you get it back? Using some kind of dwim by trying to
decode it and just pass it through if that fails? Then there's always
the possibility that it'll decode eight bit raw strings that just
happen to not be invalid UTF-8. What if you want to use the sqlite
collation functions etc on those strings? They sure as hell won't work
correctly on unencoded eight bit chars.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: string_to_utf8() behavior on non-wide strings