Latin-1 covers most Western European languages such as Albanian, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Galician, Irish, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish. The lack of the ligatures Dutch ij, French oe and old-style ,,German quotation marks is considered tolerable.
Latin-2 supports most Latin-written Slavic and Central European languages: Croatian, Czech, German, Hungarian, Polish, Rumanian, Slovak, and Slovene.
Latin-3 is popular with authors of Esperanto, Galician, and Maltese. (Turkish is now written with 8859-9 instead.)
Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6) and 8859-13 (Latin-7).
Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian, Russian, Serbian and Ukrainian. Ukrainians read the letter ghe with downstroke as heh and would need a ghe with upstroke to write a correct ghe. See the discussion of KOI8-R below.
Supports Arabic. The 8859-6 glyph table is a fixed font of separate letter forms, but a proper display engine should combine these using the proper initial, medial, and final forms.
Supports Modern Greek.
Supports modern Hebrew without niqud (punctuation signs). Niqud and full-fledged Biblical Hebrew are outside the scope of this character set; under Linux, UTF-8 is the preferred encoding for these.
This is a variant of Latin-1 that replaces Icelandic letters with Turkish ones.
Latin 6 adds the last Inuit (Greenlandic) and Sami (Lappish) letters that were missing in Latin 4 to cover the entire Nordic area. RFC 1345 listed a preliminary and different latin6. Skolt Sami still needs a few more accents than these.
This only exists as a rejected draft standard. The draft standard was identical to TIS-620, which is used under Linux for Thai.
This set does not exist. While Vietnamese has been suggested for this space, it does not fit within the 96 (non-combining) characters ISO 8859 offers. UTF-8 is the preferred character set for Vietnamese use under Linux.
Supports the Baltic Rim languages; in particular, it includes Latvian characters not found in Latin-4.
This is the Celtic character set, covering Gaelic and Welsh. This charset also contains the dotted characters needed for Old Irish.
This adds the Euro sign and French and Finnish letters that were missing in Latin-1.
This set covers many of the languages covered by 8859-2, and supports Romanian more completely then that set does.
Linux represents Unicode using the 8-bit Unicode Transformation Format (UTF-8). UTF-8 is a variable length encoding of Unicode. It uses 1 byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.
Let 0,1,x stand for a zero, one, or arbitrary bit. A byte 0xxxxxxx stands for the Unicode 00000000 0xxxxxxx which codes the same symbol as the ASCII 0xxxxxxx. Thus, ASCII goes unchanged into UTF-8, and people using only ASCII do not notice any change: not in code, and not in file size.
A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy is assembled into 00000xxx xxyyyyyy. A byte 1110xxxx is the start of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled into xxxxyyyy yyzzzzzz. (When UTF-8 is used to code the 31-bit ISO 10646 then this progression continues up to 6-byte codes.)
For most people who use ISO-8859 character sets, this means that the characters outside of ASCII are now coded with two bytes. This tends to expand ordinary text files by only one or two percent. For Russian or Greek users, this expands ordinary text files by 100%, since text in those languages is mostly outside of ASCII. For Japanese users this means that the 16-bit codes now in common use will take three bytes. While there are algorithmic conversions from some character sets (esp. ISO-8859-1) to Unicode, general conversion requires carrying around conversion tables, which can be quite large for 16-bit codes.
Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other byte is the head of a code. Note that the only way ASCII bytes occur in a UTF-8 stream, is as themselves. In particular, there are no embedded NULs or /s that form part of some larger code.
Since ASCII, and, in particular, NUL and /, are unchanged, the kernel does not notice that UTF-8 is being used. It does not care at all what the bytes it is handling stand for.
Rendering of Unicode data streams is typically handled through subfont tables which map a subset of Unicode to glyphs. Internally the kernel uses Unicode to describe the subfont loaded in video RAM. This means that in UTF-8 mode one can use a character set with 512 different symbols. This is not enough for Japanese, Chinese and Korean, but it is enough for most other purposes.
At the current time, the console driver does not handle combining characters. So Thai, Sioux and any other script needing combining characters cant be handled on the console.