Fridge Dweller: 2018

Thursday, October 18, 2018

Shapefiles: Long attribute names

Attribute names used in shapefiles are limited to a ridiculous 10 (in words: ten) characters and there is no official means to map these shortened names to longer, more telling ones. YAML to the rescure.

In order to provide a mapping for our customers we ship an additional, dead-simple YAML file that makes it possible to at least look up what the original, unabridged name was. The format is as follows:
shortName0: "A_rather_lengthy_attribute_name"

shortName1: "Another_rather_lengthy_attribute_name"
shortName2: "An_attribute_name_no_sane_person_would_come_up_with"

So essentially it is a table but as it is in valid YAML format there is no reason why it should not become a common sight and supported by GIS software. Here's example of how generated shapefiles my look like:
linestring.cpg point.cpg polygon.cpg

linestring.dbf        point.dbf        polygon.dbf
linestring.prj        point.prj        polygon.prj
linestring.shp        point.shp        polygon.shp
linestring.shx        point.shx        polygon.shx
linestring.yml        point.yml        polygon.yml

Mandatory files are:

.shp — shape format; the feature geometry itself
.shx — shape index format; a positional index of the feature geometry to allow seeking forwards and backwards quickly
.dbf — attribute format; columnar attributes for each shape, in dBase IV format

Additional standard files:

.prj — projection format; the coordinate system and projection information, a plain text file describing the projection using well-known text format
.cpg — used to specify the code page (only for .dbf) for identifying the character encoding to be used

Additional non-standard file:

.yml — attribute name map (for .dbf) assigns the shortened 10 character attribute name one that can have an (in principle) arbitrary number of characters - File in YAML format as described above.

Shapefiles: Specification of character encoding

CPG files are part of the multi-file shapefile (sic!) format identifying the encoding used by the DBF. The issue with this file is that it is incredibly hard to find any document on how the encoding is actually identified.At the very best you find that it contains a single string and that you use UTF-8 to identify utf-8 encoded Unicode. That doesn't help a lot if you use codepage 1252. It took my quite some effort to find a blog entry that has a translation of what originally was found on a Belarus website (which seems to have crossed the river Styx).

Encoding ID	Encoding name	Additional ID	Other names
1252	Western	iso-8859-1 (*)	iso8859-1, iso_8859-1, iso-8859-1, ANSI_X3.4-1968, iso-ir-6, ANSI_X3.4-1986, ISO_646, irv:1991, ISO646-US, us, IBM367, cp367, csASCII, latin1, iso_8859-1:1987, iso-ir-100, ibm819, cp819, Windows-1252
20105	ASCII	us-ascii	us-acii, ascii
28592	Central European (ISO)	iso-8859-2	iso8859-2, iso-8859-2, iso_8859-2, latin2, iso_8859-2:1987, iso-ir-101, l2, csISOLatin2
1250	Central European (Windows)	Windows-1250	Windows-1250, x-cp1250
1251	Cyrillic (Windows)	Windows-1251	Windows-1251, x-cp1251
1253	Greek (Windows)	Windows-1253	Windows-1253
1254	Turkish (Windows)	Windows-1254	Windows-1254
932	Japanese (Shift-JIS)	shift_jis	shift_jis, x-sjis, ms_Kanji, csShiftJIS, x-ms-cp932
51932	Japanese (EUC)	x-euc-jp	Extended_UNIX_Code_Packed_Format_for_Japanese, csEUCPkdFmtJapanese, x-euc-jp, x-euc
50220	Japanese (JIS)	iso-2022-jp	csISO2022JP, iso-2022-jp
1257	Baltic (Windows)	Windows-1257	windows-1257
950	Traditional Chinese (BIG5)	big5	big5, csbig5, x-x-big5
936	Simplified Chinese (GB2312)	gb2312	GB_2312-80, iso-ir-58, chinese, csISO58GB231280, csGB2312, gb2312
20866	Cyrillic (KOI8-R)	koi8-r	csKOI8R, koi8-r
949	Korean (KSC5601)	ks_c_5601	ks_c_5601, ks_c_5601-1987, korean, csKSC56011987
1255 (logical)	Hebrew (ISO-logical)	Windows-1255	iso-8859-8i
1255 (visual)	Hebrew (ISO-Visual)	iso-8859-8	ISO-8859-8 Visual, ISO-8859-8 , ISO_8859-8, visual
862	Hebrew (DOS)	dos-862	dos-862
1256	Arabic (Windows)	Windows-1256	Windows-1256
720	Arabic (DOS)	dos-720	dos-720
874	Thai	Windows-874	Windows-874
1258	Vietnamese	Windows-1258	Windows-1258
65001	Unicode UTF-8	UTF-8 UTF-8,	unicode-1-1-utf-8, unicode-2-0-utf-8
65000	Unicode UTF-7	UNICODE-1-1-UTF-7	utf-7, UNICODE-1-1-UTF-7, csUnicode11UTF7, utf-7
50225	Korean (ISO)	ISO-2022-KR	ISO-2022-KR, csISO2022KR
52936	Simplified Chinese (HZ)	HZ-GB-2312	HZ-GB-2312
28594	Baltic (ISO)	iso-8869-4	ISO_8859-4:1988, iso-ir-110, ISO_8859-4, ISO-8859-4, latin4, l4, csISOLatin4
28585	Cyrillic (ISO)	iso_8859-5	ISO_8859-5:1988, iso-ir-144, ISO_8859-5, ISO-8859-5, cyrillic, csISOLatinCyrillic, csISOLatin5
28597	Greek (ISO)	iso-8859-7	ISO_8859-7:1987, iso-ir-126, ISO_8859-7, ISO-8859-7, ELOT_928, ECMA-118, greek, greek8, csISOLatinGreek
28599	Turkish (ISO)	iso-8859-9	ISO_8859-9:1989, iso-ir-148, ISO_8859-9, ISO-8859-9, latin5, l5, csISOLatin5

(*) except when 128-159 is used, use Windows-1252

Monday, September 24, 2018

Debian under Windows 10: Mount/Umount

Debian GNU/Linux is available for Windows users through the Windows store as an app for the Windows Subsystem for Linux (WSL). That works quite nicely but, USB drives can be a bit annoying to use - there is no automatic mounting.

To facilitate mounting/unmounting these drives I wrote two BASH functions that take a device letter (it does not matter if you use uppercase or lowercase) as an argument.

wmount mounts the drive (and generates a mount point if necessary).
wumount unmounts the drive (and keeps the mount point for later use)

Maybe you have some use for these functions

wmount () {
  if [[ $# -ne 1 || ! ($1 =~ [a-zA-Z]) ]]; then
    echo Usage $0 [drive_letter]
    return 1
  fi

  drive_letter=$(echo $1|tr '[:lower:]' '[:upper:]')
  mount_dir="/mnt/$(echo $1|tr '[:upper:]' '[:lower:]')"

  if [[ ! -d $mount_dir ]]; then
    sudo mkdir $mount_dir
  fi

 sudo mount -t drvfs $drive_letter: $mount_dir
}

wumount () {
  if [[ $# -ne 1 || ! ($1 =~ [a-zA-Z]) ]]; then
    echo Usage $0 [drive_letter]
    return 1
  fi

  mount_dir="/mnt/$(echo $1|tr '[:upper:]' '[:lower:]')"
  if [[ -d $mount_dir ]]; then
    grep -qs "$mount_dir " /proc/mounts
    if [[ $? ]]; then
      sudo umount $mount_dir
    fi
  fi
}

ldsajffd

Pages

Thursday, October 18, 2018

Shapefiles: Long attribute names

Shapefiles: Specification of character encoding

Monday, September 24, 2018

Debian under Windows 10: Mount/Umount