Appearantly there's a lot of people that doesn't feel that there is enough character encodings available. I've run through the IANA list of characters encoding alias and come up with a list of encodings that Pike does not support. If you're looking for a bitesized project to spend some minutes on, then this list is a todo for you ;) I've listed them approximately in order of standard accessibility and interest of inclusion in Pike. Lets start with the RFCs. We do want to support the RFCs.
RFC1345: us-dk dk-us MNEMONIC (mnemonic+ascii+38) MNEM (mnemonic+ascii+8200)
RFC1428: UNKNOWN-8BIT
RFC1456: VISCII VIQR
RFC1556: ISO_8859-6-E ISO_8859-6-I ISO_8859-8-E ISO_8869-8-I
RFC1641: UNICODE-1-1
RFC1642: UNICODE-1-1-UTF-7
RFC1815: ISO-10646-Unicode-Latin1 ISO-10646-J-1
RFC1842,1843: HZ-GB-2312
Then the nice people at Unicode never stops to impress each other with new and wider encodings of new and wider characters. We should probably try to support these as well.
Unicode TR19 (http://www.unicode.org/unicode/reports/tr19) UTF-32 UTF-32BE UTF-32LE
Unicode TR26 (http://www.unicode.org/unicode/reports/tr26) CESU-8
Unicode TN6 (http://www.unicode.org/notes/tn6/) BOCU-1
IANA is the one that registers "legal" charsets to be used in XML documents and emails and other stuff people use, so we should make an attempt to support these as well. When reading these granted applications you really would like to write an RFC that forces people to write better applications (Oh, wait. Someone did (2278), but didn't bother update the old information)
http://www.iana.org/assignments/charset-reg/IBM00858 http://www.iana.org/assignments/charset-reg/IBM00924 http://www.iana.org/assignments/charset-reg/IBM01140 http://www.iana.org/assignments/charset-reg/IBM01141 http://www.iana.org/assignments/charset-reg/IBM01142 http://www.iana.org/assignments/charset-reg/IBM01143 http://www.iana.org/assignments/charset-reg/IBM01144 http://www.iana.org/assignments/charset-reg/IBM01145 http://www.iana.org/assignments/charset-reg/IBM01146 http://www.iana.org/assignments/charset-reg/IBM01147 http://www.iana.org/assignments/charset-reg/IBM01148 http://www.iana.org/assignments/charset-reg/IBM01149 http://www.iana.org/assignments/charset-reg/Big5-HKSCS http://www.iana.org/assignments/charset-reg/PTCP154 http://www.iana.org/assignments/charset-reg/SCSU http://www.iana.org/assignments/charset-reg/GBK http://www.iana.org/assignments/charset-reg/GB18030
Charsets thought up at ISO are also legal in IANAs eyes. I have no reference for these, but I guess that you either pay ISO money or google.
ISO-8859-16 ISO-10646-UCS-Basic "ASCII subset of Unicode. Basic Latin = collection 1 See ISO 10646, Appendix A"
Now we are in the land of really bizzare, but granted, character encodings.
HP PCL 5 Comparison Guide (P/N 5021-0329): IBM775 (pp B-13, 1996) ISO-8859-1-Windows-3.0-Latin-1 (PCL id 9U) ISO-8859-1-Windows-3.1-Latin-1 (PCL id 19U) ISO-8859-2-Windows-Latin-2 (PCL id 9E) ISO-8859-9-Windows-Latin-5 (PCL id 5T) Ventura-US (PCL id 14J) Ventura-International (PCL id 13J) PC8-Danish-Norwegian (PCL id 11U) PC8-Turkish (PCL id 9T) HP-Legal (PCL id 1U) HP-Pi-font (PCL id 15U) HP-Math8 (PCL id 8M) HP-DeskTop (PCL id 7J) Ventura-Math (PCL id 6M) Microsoft-Publishing (PCL id 6J)
Post Script Language Reference by Adobe Systems Incorporated, Addison-Wesley, 1990: Adobe-Standard-Encoding (PCL id 10J) Adobe-Symbol-Encoding (PCL id 5M)
ABOUT TYPE: IBM's Technical Reference for Core Interchange Digitized Type, publication number S544-3708-01: IBM-Symbols (CPGID 259) IBM-Thai (CPGID 838)
Thai Industrial Standards Institute (TISI): TIS-620
These encodings has no standard reference, but is only described in the IANA list of character encodings.
ISO-10646-UTF-1 "Universal Transfer Format (1), this is the multibyte encoding, that subsets ASCII-7. It does not have byte ordering issues."
ISO-10646-UCS-2 "the 2-octet Basic Multilingual Plane, aka Unicode this needs to specify network byte order: the standard does not specify (it is a 16-bit integer space)"
ISO-10646-UCS-4 "the full code space. (same comment about byte order, these are 31-bit numbers."
JIS_Encoding "JIS X 0202-1991. Uses ISO 2022 escape sequences to shift code sets as documented in JIS X 0202-1991."
Extended_UNIX_Code_Fixed_Width_for_Japanese "Used in Japan. Each character is 2 octets. code set 0: US-ASCII (a single 7-bit byte set) 1st byte = 00 2nd byte = 20-7E code set 1: JIS X0208-1990 (a double 7-bit byte set) restricted to A0-FF in both bytes code set 2: Half Width Katakana (a single 7-bit byte set) 1st byte = 00 2nd byte = A0-FF code set 3: JIS X0212-1990 (a double 7-bit byte set) restricted to A0-FF in the first byte and 21-7E in the second byte"
ISO-Unicode-IBM-1261 "IBM Latin-2, -3, -5, Extended Presentation Set, GCSGID: 1261"
ISO-Unicode-IBM-1268 "IBM Latin-4 Extended Presentation Set, GCSGID: 1268"
ISO-Unicode-IBM-1276 "IBM Cyrillic Greek Extended Presentation Set, GCSGID: 1276"
ISO-Unicode-IBM-1264 "IBM Arabic Presentation Set, GCSGID: 1264"
ISO-Unicode-IBM-1265 "IBM Hebrew Presentation Set, GCSGID: 1265"
Windows-31J "Windows Japanese. A further extension of Shift_JIS to include NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119). The CCS's are JIS X0201:1997, JIS X0208:1997, and these extensions. This charset can be used for the top-level media type "text", but it is of limited or specialized use (see RFC2278). PCL Symbol Set id: 19K"
The charset IBM1047 (EBCDIC Latin 1/Open Systems) has a standard reference to a pdf on an IBM server. This URL now produces 404.
Finally, for the interested, of the 802 registered IANA encodings and aliases Pike supports 626, given my Hilfe commands were bug free.