Codepage decoding tables have incorrect data

3 Aug 2014


      Was doing some codepage work (playing with CP437), and found that the
Pike CP-437 decoder actually produces wrong output, according to the
unicode.org template file. I'm not sure what's going on here; is there
some alternate standard that it's adhering to?
Here are two scripts to compare codepage decoding in Pike and Python:
int main(int argc,array(string) argv)
{
    object decoder=Charset.decoder(argv[1]);
    for (int byte=128;byte<255;++byte)
        write("%02x: %04x\n",byte,decoder->feed((string)({byte}))->drain()[0]);
}
import sys
for byte in range(128,255):
    print("%02x:
%04x"%(byte,ord(bytes([byte]).decode(sys.argv[1],errors="replace"))))
$ diff <(pike charsets.pike 437) <(python3 charsets.py 437)
This should in theory be absolutely silent, but in practice it reports
a number of differences in the decode. Wikipedia cites a file on
www.unicode.org as its source:
http://en.wikipedia.org/wiki/Code_page_437
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT
and that file agrees with Python's decoding. I've put together a
script to download a file from the above URL and patch the info into
the appropriate place in src/modules/_Charset/misc.c, and have run it
on everything in the MICSFT/PC/ directory; the resulting patch is
attached, as is a patch to create that file.
ChrisA

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Codepage decoding tables have incorrect data