Re: Codepage decoding tables have incorrect data

List overview All Threads
Download

newer

older

Hilfe-crash.

Re: Codepage decoding tables have...

Chris Angelico

4 Aug 2014 4 Aug '14

9:57 p.m.

On Tue, Aug 5, 2014 at 6:55 AM, Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum 10353@lyskom.lysator.liu.se wrote:

...

And I'd be inclined to fix any acctual errors, rather than blindly following one or the other. :-)

Just so you know, the MAPPINGS files on unicode.org is not part of the Unicode standard, so they are no more a standard than RFC 1345 is.

Oh! Okay. What is the standard? Where would I find an authoritative set of codepage-to-Unicode character set replacements?

All I can confirm is that, with the changes I put through, Pike agrees with Python 3.4, and codepage 437 "looks right", neither of which is anything official.

ChrisA

Show replies by date

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

4 Aug 4 Aug

10:10 p.m.

New subject: Codepage decoding tables have incorrect data

...

Oh! Okay. What is the standard? Where would I find an authoritative set of codepage-to-Unicode character set replacements?

The authority for mapping for a particular codepage would be the owner of said codepage. So in the case of e.g. IBM 437 it would be IBM. An owner of a codepage has no obligation to provide such a mapping, so in some cases no authorative mapping exists.

...

All I can confirm is that, with the changes I put through, Pike agrees with Python 3.4, and codepage 437 "looks right", neither of which is anything official.

As I said, I'll make a script to check all the box drawing characters (converting thick stroke to double stroke). If it agrees with your changes, I think we can be pretty confident it's the correct fix.

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

7 Aug 7 Aug

4:30 p.m.

New subject: Codepage decoding tables have incorrect data

I have pushed fixes for the box drawing characters, and for the incorrect forms of arabic characters. This includes fixes to IBM 851 and IBM 868, which were not part of your patch.

Many of the remaining changes suggested are a bit dubious though.

For example, in IBM 437 you proposed the following changes:

0xe1: GREEK SMALL LETTER BETA -> LATIN SMALL LETTER SHARP S 0xe6: GREEK SMALL LETTER MU -> MICRO SIGN 0xed: EMPTY SET -> GREEK SMALL LETTER PHI

Now, as is noted on e.g. http://en.wikipedia.org/wiki/IBM_437, the greek characters are used for multiple purposes. This means that a "correct" translation (i.e. one that captures the intent of the document author) needs some kind of context.

If we ignore this fact, and just look at what IBM has defined these code points to be, we find (ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00437.txt)

E1 LS610000 Sharp s Small E6 GM010000 Mu Small ED GF010001 Phi Small (Closed Form)

Looking up these GCGID:s in http://www-01.ibm.com/software/globalization/gcgid/gcgid.html, we find

LS610000 Sharp s Small U00000DF LATIN SMALL LETTER SHARP S GM010000 Mu Small - (resembles SM17) U00003BC GREEK SMALL LETTER MU GF010001 Phi Small (Closed Form) U00003C6 GREEK SMALL LETTER PHI

So the change for code point E1 and ED is according to the authoritative source, but the change for code point E6 goes against it...

Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum

4:50 p.m.

New subject: Codepage decoding tables have incorrect data

You have pushed it where?

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

4:55 p.m.

New subject: Codepage decoding tables have incorrect data

Ooops, there was an "-n" on that command line... :-)

Fixed now.

4024

Age (days ago)

4027

Last active (days ago)

pike-devel@lists.lysator.liu.se

4 comments

3 participants

tags (0)

participants (3)

Chris Angelico
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum
Martin Nilsson (Opera Mini - AFK!) ＠ Pike (-) developers forum