Dynamic Web Development with Seaside

17.6.2Encodings

An encoding is a mapping between a character (or its code point) and a sequence of bytes, and vice versa.

Simple Mappings. The mapping can be a one-to-one mapping between the character and the byte that represents it. If and only if your character set has 255 or less entries you can directly map each character by its index to a single byte. This is the case for ASCII and ISO-8859-1.

In the latest version of Pharo, the Character class represents a character by storing its Unicode. Since Unicode is a superset of latin1, you can create latin1 strings by specifying their direct values. When a String is composed only of ASCII or latin1 characters, it is encoded in a ByteString (a collection of bytes each one representing a character).

(String with: (Character value: 65) with: (Character value: 66)). 
"-> 'AB'"
'AB' class.
"-> ByteString"
(String with: (Character value: 16r5B) with: (Character value: 16r5D)).
"-> '[]'"
(String with: (Character value: 16rA9)).
"-> the copyright character ©"
Character value: 16rFC.
"-> the u-umlaut character ü"

The characters Character value: 16r5B ([) and Character value: 65 (A) are both available in ASCII and ISO-8859-1. Now Character value: 16rA9 displays © the copyright sign which is only available in ISO-8859-1, similarly Character value: 16rFC displays ü.

Other Mappings. As we already mentioned Unicode is a large superset of Latin-1 with over hundred thousand of characters. Unicode cannot simply be encoded on a single byte. There exist several character encodings for Unicode: the Unicode Transformation Format (UTF) encodings, and the Universal Character Set (UCS) encodings.

The number in the encodings name indicates the number of bits in one code point (for UTF encodings) or the number of bytes per code point (for UCS) encodings. UTF-8 and UTF-16 are probably the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent.

  • UTF-8 (8-bits UCS/Unicode Transformation Format) is a variable length character encoding for Unicode. The Dollar Sign ($) is Unicode U+0024. UTF-8 is able to represent any character of the Unicode character sets, but it is backwards compatible with ASCII. It uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters.
  • UCS-2 which is now obsolete used 2 bytes for all the characters but it could not encode all the Unicode standard.
  • UTF-16 extends UCS-2 to encode character missing from UCS-2. It is a variable size encoding using two bytes in most cases. There are two variants — the little endian and big endian versions: 16rFC 16r00 16r00 16rFC are variant representations of the same encoded character.

If you want to know more on character sets and character encodings, we suggest you read the Unicode Standard book, currently describing the version 5.0.

Copyright © 19 March 2024 Stéphane Ducasse, Lukas Renggli, C. David Shaffer, Rick Zaccone
This book is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 license.

This book is published using Seaside, Magritte and the Pier book publishing engine.