Dynamic Web Development with Seaside

17.6.1Character sets

A character set is really just that, a set of characters. These are the characters of your alphabet. For practical reasons each character is identified by a code point e.g. $A is identified by the code point 65.

Examples of character sets are ASCII, ISO-8859-1, Unicode or UCS (Universal Character Set).

  • ASCII (American Standard Code for Information Interchange) contains 128 characters. It was designed following several constraints such that it would be easy to go from a lowercase character to its uppercase equivalent. You can get the list of characters at http://en.wikipedia.org/wiki/Ascii. ASCII was designed with the idea in mind that other countries could plug their specific characters in it but it somehow failed. ASCII was extended in Extended ASCII which offers 256 characters.
  • ISO-8859-1 (ISO/IEC 8859-1) is a superset of ASCII to which it adds 128 new characters. Also called Latin-1 or latin1, it is the standard alphabet of the latin alphabet, and is well-suited for Western Europe, Americas, parts of Africa. Since ISO-8859-1 did not contain certain characters such as the Euro sign, it was updated into ISO-8859-15. However, ISO-8859-1 is still the default encoding of documents delivered via HTTP with a MIME type beginning with "text/". http://www.utoronto.ca/webdocs/HTMLdocs/NewHTML/iso_table.html shows in particular ISO-8859-1.
  • Unicode is a superset of Latin-1. To accelerate the early adoption of Unicode, the first 256 code points are identical to ISO-8859-1. A character is not described via its glyph but identified by its code point, which is usually referred to using "U+" followed by its hexadecimal value. Note that Unicode also specifies a set of rules for normalization, collation bi-directional display order and much more.
  • UCS — the ‘Universal Character Set’ specified by the ISO/IEC 10646 International Standard contains a hundred thousand characters. Each character is unambiguously identified by a name and an integer also called its code point.

http://www.fileformat.info/info/charset/index.htm shows several character sets.

The Pharo String Hierarchy

In Pharo. Now let us see the concepts exist in Pharo. The String, ByteString, WideString class hierarchy is roughly equivalent to the Integer, SmallInteger, LargeInteger hierarchy. The class Integer is the abstract superclass of SmallInteger which represents number with ranges between -1073741824 and 1073741823, and LargeInteger which represents all the other numbers. In Pharo, the class String is the abstract superclass of the classes ByteString (ISO-8859-1) and WideString (Unicode minus ISO-8859-1). Such classes are about character sets and not encodings.

Copyright © 19 March 2024 Stéphane Ducasse, Lukas Renggli, C. David Shaffer, Rick Zaccone
This book is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 license.

This book is published using Seaside, Magritte and the Pier book publishing engine.