Was doing some codepage work (playing with CP437), and found that the Pike CP-437 decoder actually produces wrong output, according to the unicode.org template file. I'm not sure what's going on here; is there some alternate standard that it's adhering to?
Here are two scripts to compare codepage decoding in Pike and Python:
int main(int argc,array(string) argv) { object decoder=Charset.decoder(argv[1]); for (int byte=128;byte<255;++byte) write("%02x: %04x\n",byte,decoder->feed((string)({byte}))->drain()[0]); }
import sys for byte in range(128,255): print("%02x: %04x"%(byte,ord(bytes([byte]).decode(sys.argv[1],errors="replace"))))
$ diff <(pike charsets.pike 437) <(python3 charsets.py 437)
This should in theory be absolutely silent, but in practice it reports a number of differences in the decode. Wikipedia cites a file on www.unicode.org as its source:
http://en.wikipedia.org/wiki/Code_page_437 http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT
and that file agrees with Python's decoding. I've put together a script to download a file from the above URL and patch the info into the appropriate place in src/modules/_Charset/misc.c, and have run it on everything in the MICSFT/PC/ directory; the resulting patch is attached, as is a patch to create that file.
ChrisA