Thursday, October 18, 2018

Shapefiles: Specification of character encoding

CPG files are part of the multi-file shapefile (sic!) format identifying the encoding used by the DBF. The issue with this file is that it is incredibly hard to find any document on how the encoding is actually identified.At the very best you find that it contains a single string and that you use UTF-8 to identify utf-8 encoded Unicode. That doesn't help a lot if you use codepage 1252. It took my quite some effort to find a blog entry that has a translation of what originally was found on a Belarus website (which seems to have crossed the river Styx).

Encoding ID Encoding name Additional ID Other names
1252 Western iso-8859-1 (*) iso8859-1, iso_8859-1, iso-8859-1, ANSI_X3.4-1968, iso-ir-6, ANSI_X3.4-1986, ISO_646, irv:1991, ISO646-US, us, IBM367, cp367, csASCII, latin1, iso_8859-1:1987, iso-ir-100, ibm819, cp819, Windows-1252
20105 ASCII us-ascii us-acii, ascii
28592 Central European (ISO) iso-8859-2 iso8859-2, iso-8859-2, iso_8859-2, latin2, iso_8859-2:1987, iso-ir-101, l2, csISOLatin2
1250 Central European (Windows) Windows-1250 Windows-1250, x-cp1250
1251 Cyrillic (Windows) Windows-1251 Windows-1251, x-cp1251
1253 Greek (Windows) Windows-1253 Windows-1253
1254 Turkish (Windows) Windows-1254 Windows-1254
932 Japanese (Shift-JIS) shift_jis shift_jis, x-sjis, ms_Kanji, csShiftJIS, x-ms-cp932
51932 Japanese (EUC) x-euc-jp Extended_UNIX_Code_Packed_Format_for_Japanese, csEUCPkdFmtJapanese, x-euc-jp, x-euc
50220 Japanese (JIS) iso-2022-jp csISO2022JP, iso-2022-jp
1257 Baltic (Windows) Windows-1257 windows-1257
950 Traditional Chinese (BIG5) big5 big5, csbig5, x-x-big5
936 Simplified Chinese (GB2312) gb2312 GB_2312-80, iso-ir-58, chinese, csISO58GB231280, csGB2312, gb2312
20866 Cyrillic (KOI8-R) koi8-r csKOI8R, koi8-r
949 Korean (KSC5601) ks_c_5601 ks_c_5601, ks_c_5601-1987, korean, csKSC56011987
1255 (logical) Hebrew (ISO-logical) Windows-1255 iso-8859-8i
1255 (visual) Hebrew (ISO-Visual) iso-8859-8 ISO-8859-8 Visual, ISO-8859-8 , ISO_8859-8, visual
862 Hebrew (DOS) dos-862 dos-862
1256 Arabic (Windows) Windows-1256 Windows-1256
720 Arabic (DOS) dos-720 dos-720
874 Thai Windows-874 Windows-874
1258 Vietnamese Windows-1258 Windows-1258
65001 Unicode UTF-8 UTF-8 UTF-8, unicode-1-1-utf-8, unicode-2-0-utf-8
65000 Unicode UTF-7 UNICODE-1-1-UTF-7 utf-7, UNICODE-1-1-UTF-7, csUnicode11UTF7, utf-7
50225 Korean (ISO) ISO-2022-KR ISO-2022-KR, csISO2022KR
52936 Simplified Chinese (HZ) HZ-GB-2312 HZ-GB-2312
28594 Baltic (ISO) iso-8869-4 ISO_8859-4:1988, iso-ir-110, ISO_8859-4, ISO-8859-4, latin4, l4, csISOLatin4
28585 Cyrillic (ISO) iso_8859-5 ISO_8859-5:1988, iso-ir-144, ISO_8859-5, ISO-8859-5, cyrillic, csISOLatinCyrillic, csISOLatin5
28597 Greek (ISO) iso-8859-7 ISO_8859-7:1987, iso-ir-126, ISO_8859-7, ISO-8859-7, ELOT_928, ECMA-118, greek, greek8, csISOLatinGreek
28599 Turkish (ISO) iso-8859-9 ISO_8859-9:1989, iso-ir-148, ISO_8859-9, ISO-8859-9, latin5, l5, csISOLatin5

(*) except when 128-159 is used, use Windows-1252

No comments:

Post a Comment