Was doing some codepage work (playing with CP437), and found that the Pike CP-437 decoder actually produces wrong output, according to the unicode.org template file. I'm not sure what's going on here; is there some alternate standard that it's adhering to?
Here are two scripts to compare codepage decoding in Pike and Python:
int main(int argc,array(string) argv) { object decoder=Charset.decoder(argv[1]); for (int byte=128;byte<255;++byte) write("%02x: %04x\n",byte,decoder->feed((string)({byte}))->drain()[0]); }
import sys for byte in range(128,255): print("%02x: %04x"%(byte,ord(bytes([byte]).decode(sys.argv[1],errors="replace"))))
$ diff <(pike charsets.pike 437) <(python3 charsets.py 437)
This should in theory be absolutely silent, but in practice it reports a number of differences in the decode. Wikipedia cites a file on www.unicode.org as its source:
http://en.wikipedia.org/wiki/Code_page_437 http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT
and that file agrees with Python's decoding. I've put together a script to download a file from the above URL and patch the info into the appropriate place in src/modules/_Charset/misc.c, and have run it on everything in the MICSFT/PC/ directory; the resulting patch is attached, as is a patch to create that file.
ChrisA
I haven't checked whether the changes to the codepages that you suggest make sense yet (the current tables are generated from RFC1345, so if there is a discrepancy it should be investigated more closely), but I don't really see any point in adding your script to the repository...
Hm, the issue seems to be that RFC1345 does not distinguish between heavy strokes and double strokes in the box drawing characters. I'll make a script to go through those and check them.
pike-devel@lists.lysator.liu.se