_Charset

8 Apr 2003


      Appearantly there's a lot of people that doesn't feel that there is
enough character encodings available. I've run through the IANA list
of characters encoding alias and come up with a list of encodings that
Pike does not support. If you're looking for a bitesized project to
spend some minutes on, then this list is a todo for you ;) I've listed
them approximately in order of standard accessibility and interest of
inclusion in Pike. Lets start with the RFCs. We do want to support the
RFCs.
RFC1345:
us-dk
dk-us
MNEMONIC (mnemonic+ascii+38)
MNEM (mnemonic+ascii+8200)
RFC1428:
UNKNOWN-8BIT
RFC1456:
VISCII
VIQR
RFC1556:
ISO_8859-6-E
ISO_8859-6-I
ISO_8859-8-E
ISO_8869-8-I
RFC1641:
UNICODE-1-1
RFC1642:
UNICODE-1-1-UTF-7
RFC1815:
ISO-10646-Unicode-Latin1
ISO-10646-J-1
RFC1842,1843:
HZ-GB-2312
Then the nice people at Unicode never stops to impress each other with
new and wider encodings of new and wider characters. We should
probably try to support these as well.
Unicode TR19 (http://www.unicode.org/unicode/reports/tr19)
UTF-32
UTF-32BE
UTF-32LE
Unicode TR26 (http://www.unicode.org/unicode/reports/tr26)
CESU-8
Unicode TN6 (http://www.unicode.org/notes/tn6/)
BOCU-1
IANA is the one that registers "legal" charsets to be used in XML
documents and emails and other stuff people use, so we should make an
attempt to support these as well. When reading these granted
applications you really would like to write an RFC that forces people
to write better applications (Oh, wait. Someone did (2278), but didn't
bother update the old information)
http://www.iana.org/assignments/charset-reg/IBM00858
http://www.iana.org/assignments/charset-reg/IBM00924
http://www.iana.org/assignments/charset-reg/IBM01140
http://www.iana.org/assignments/charset-reg/IBM01141
http://www.iana.org/assignments/charset-reg/IBM01142
http://www.iana.org/assignments/charset-reg/IBM01143
http://www.iana.org/assignments/charset-reg/IBM01144
http://www.iana.org/assignments/charset-reg/IBM01145
http://www.iana.org/assignments/charset-reg/IBM01146
http://www.iana.org/assignments/charset-reg/IBM01147
http://www.iana.org/assignments/charset-reg/IBM01148
http://www.iana.org/assignments/charset-reg/IBM01149
http://www.iana.org/assignments/charset-reg/Big5-HKSCS
http://www.iana.org/assignments/charset-reg/PTCP154
http://www.iana.org/assignments/charset-reg/SCSU
http://www.iana.org/assignments/charset-reg/GBK
http://www.iana.org/assignments/charset-reg/GB18030
Charsets thought up at ISO are also legal in IANAs eyes. I have no
reference for these, but I guess that you either pay ISO money or
google.
ISO-8859-16
ISO-10646-UCS-Basic
  "ASCII subset of Unicode.  Basic Latin = collection 1
   See ISO 10646, Appendix A"
Now we are in the land of really bizzare, but granted, character
encodings.
HP PCL 5 Comparison Guide (P/N 5021-0329):
IBM775 (pp B-13, 1996)
ISO-8859-1-Windows-3.0-Latin-1 (PCL id 9U)
ISO-8859-1-Windows-3.1-Latin-1 (PCL id 19U)
ISO-8859-2-Windows-Latin-2 (PCL id 9E)
ISO-8859-9-Windows-Latin-5 (PCL id 5T)
Ventura-US (PCL id 14J)
Ventura-International (PCL id 13J)
PC8-Danish-Norwegian (PCL id 11U)
PC8-Turkish (PCL id 9T)
HP-Legal (PCL id 1U)
HP-Pi-font (PCL id 15U)
HP-Math8 (PCL id 8M)
HP-DeskTop (PCL id 7J)
Ventura-Math (PCL id 6M)
Microsoft-Publishing (PCL id 6J)
Post Script Language Reference by Adobe Systems Incorporated,
Addison-Wesley, 1990:
Adobe-Standard-Encoding (PCL id 10J)
Adobe-Symbol-Encoding (PCL id 5M)
ABOUT TYPE: IBM's Technical Reference for Core Interchange Digitized Type,
publication number S544-3708-01:
IBM-Symbols (CPGID 259)
IBM-Thai (CPGID 838)
Thai Industrial Standards Institute (TISI):
TIS-620
These encodings has no standard reference, but is only described in
the IANA list of character encodings.
ISO-10646-UTF-1
  "Universal Transfer Format (1), this is the multibyte encoding, that
   subsets ASCII-7. It does not have byte ordering issues."
ISO-10646-UCS-2
  "the 2-octet Basic Multilingual Plane, aka Unicode this needs to
   specify network byte order: the standard does not specify (it is a
   16-bit integer space)"
ISO-10646-UCS-4
  "the full code space. (same comment about byte order, these are
   31-bit numbers."
JIS_Encoding
  "JIS X 0202-1991. Uses ISO 2022 escape sequences to shift code sets
   as documented in JIS X 0202-1991."
Extended_UNIX_Code_Fixed_Width_for_Japanese
  "Used in Japan.  Each character is 2 octets.
     code set 0: US-ASCII (a single 7-bit byte set)
                   1st byte = 00
                   2nd byte = 20-7E
     code set 1: JIS X0208-1990 (a double 7-bit byte set)
                 restricted  to A0-FF in both bytes 
     code set 2: Half Width Katakana (a single 7-bit byte set)
                   1st byte = 00
                   2nd byte = A0-FF
     code set 3: JIS X0212-1990 (a double 7-bit byte set)
                 restricted to A0-FF in 
                 the first byte
     and 21-7E in the second byte"
ISO-Unicode-IBM-1261
  "IBM Latin-2, -3, -5, Extended Presentation Set, GCSGID: 1261"
ISO-Unicode-IBM-1268
  "IBM Latin-4 Extended Presentation Set, GCSGID: 1268"
ISO-Unicode-IBM-1276
  "IBM Cyrillic Greek Extended Presentation Set, GCSGID: 1276"
ISO-Unicode-IBM-1264
  "IBM Arabic Presentation Set, GCSGID: 1264"
ISO-Unicode-IBM-1265
  "IBM Hebrew Presentation Set, GCSGID: 1265"
Windows-31J
  "Windows Japanese. A further extension of Shift_JIS to include NEC
   special characters (Row 13), NEC selection of IBM extensions (Rows
   89 to 92), and IBM extensions (Rows 115 to 119). The CCS's are JIS
   X0201:1997, JIS X0208:1997, and these extensions. This charset can
   be used for the top-level media type "text", but it is of limited
   or specialized use (see RFC2278). PCL Symbol Set id: 19K"
The charset IBM1047 (EBCDIC Latin 1/Open Systems) has a standard
reference to a pdf on an IBM server. This URL now produces 404.
Finally, for the interested, of the 802 registered IANA encodings and
aliases Pike supports 626, given my Hilfe commands were bug free.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

_Charset