Codepage decoding tables have incorrect data - Pike-devel

3 Aug 2014


      Was doing some codepage work (playing with CP437), and found that the
Pike CP-437 decoder actually produces wrong output, according to the
unicode.org template file. I'm not sure what's going on here; is there
some alternate standard that it's adhering to?
Here are two scripts to compare codepage decoding in Pike and Python:
int main(int argc,array(string) argv)
{
    object decoder=Charset.decoder(argv[1]);
    for (int byte=128;byte<255;++byte)
        write("%02x: %04x\n",byte,decoder->feed((string)({byte}))->drain()[0]);
}
import sys
for byte in range(128,255):
    print("%02x:
%04x"%(byte,ord(bytes([byte]).decode(sys.argv[1],errors="replace"))))
$ diff <(pike charsets.pike 437) <(python3 charsets.py 437)
This should in theory be absolutely silent, but in practice it reports
a number of differences in the decode. Wikipedia cites a file on
www.unicode.org as its source:
http://en.wikipedia.org/wiki/Code_page_437
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT
and that file agrees with Python's decoding. I've put together a
script to download a file from the above URL and patch the info into
the appropriate place in src/modules/_Charset/misc.c, and have run it
on everything in the MICSFT/PC/ directory; the resulting patch is
attached, as is a patch to create that file.
ChrisA