Revision as of 22:44, 9 July 2012

This is the start of a few pages of largely version independent reference

In the spirit of wiki and in response to the post here: http://moodle.org/mod/forum/discuss.php?d=205117

Go for it Guillermo.

Maybe start here, and later on add a category:Reference and see what happens. I've set a watch on, and I may do a page on JSON errors.

Database

I'll start writing here about database and utf8 issues, but I guess that the idea would be for this to be the main reference table of contents page so it then links to each main topic page.

Basic concepts

Byte, character, character set, character encoding and collation

A byte (or octet) is the smallest unit of storage that can be allocated and, in almost all modern computers, it consists of 8 bits, where a bit can only have two values: 0 or 1.

A character is a symbol used in a writing system, such as a letter of the alphabet.

A character set (or a coded character set), refers to the set of characters and numbers, or code points, used to represent them. A code point can be understood as the position the character occupies in the set.

A character encoding (or a character encoding form), refers to how the code points are converted (encoded) into code values to be saved or stored in a computer.

Unicode is a character set developed to consistently encode, represent and handle text in most of the world's writing systems in use today. Unicode can be implemented by different character encodings, where the two most commonly used are UTF-8 and UTF-16:

UTF-8 is a variable-width character encoding that can represent every character in the Unicode character set. UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four bytes.
UTF-16 is a variable-length character encoding for Unicode that uses one or two 16-bit code units, where each unit takes two 8-bit bytes, and the order of the bytes may depend on the byte order (endianness) of the computer architecture. To assist in recognizing the byte order of code units, UTF-16 allows a Byte Order Mark (BOM), a code unit with the hexadecimal value U+FEFF, to precede the first actual coded value.

Eventhough UTF-8 is the preferred character encoding nowadays, many others have been used, for example:

ISO-8859-1 (or Latin-1). A character encoding for the Latin script, where each character is encoded as a single 8-bit code value. This character-encoding scheme consists of 191 characters and it is used throughout The Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages.
ISO-8859-15 (or Latin-9). A character encoding similar to Latin-1, except that it substituted some rarely used characters with the euro sign and a few other letters.
Shift JIS. A character encoding for the Japanese language.

To illustrate the concepts given above, the € (euro) character has a code point equal to 8364 (or U+20AC, in hexadecimal notation) in the Unicode character set. This code point can be stored in different ways, depending on the character encoding used:

Character encoding	Code value
ISO-8859-15 (hexadecimal)	A4
UTF-8 (hex)	E2 82 AC
UTF-16 (hex)	20AC

Finally, a Collation determines how data is sorted and how strings are compared to each other.

Encoding and decoding

One of the most important things one must keep in mind when working with databases (or files, for that matter) is to be aware of how characters (letters, numbers and non-alphanumeric characters) are actually being saved or encoded and how bytes (represented with hexadecimal values) are actually being interpreted or decoded.

For example, it is not enough to say that character “é” (small letter e with acute) has been saved in a text file, one also has to be aware of how was it encoded. Why? Because if it was encoded as Latin-1, the file will have one byte of hex value E9, but if it was encoded as UTF-8, the file would then have two bytes of hex value C3 and A9:

Character	Latin-1 hex value	UTF-8 hex values
é	E9	C3 A9

In the same way, it is not enough to say that an UTF-8 encoded character “é” is going to be read from a text file, because eventhough we might know that two bytes of hex value “C3 A9” will be read, if they are decoded (interpreted) as Latin-1, we will get characters Ã©, but if they are decoded as UTF-8, we would then get character “é”:

Hex value	Latin-1 characters	UTF-8 character
C3 A9	Ã©	é

Being aware of these two concepts will become relevant when trying to understand how data is encoded and decoded through the many stages (source, client, server) in order to be saved or restored.

MySQL

Character sets and collations

Until version 4.1, MySQL tables used the latin1 character set by default. From version 4.1, not only the concepts of "character set" and "collation" were introduced, but the character set by default (at the configuration file) was changed to utf8.

In MySQL, a character set is a set of symbols and encodings, and a collation is a set of rules for comparing characters in a character set.

Note. Eventhough in MySQL the term "character set" is used, actually it should be "character encoding" ("charset").

The MySQL server supports multiple character sets, like ascii, koi8r, latin1, sjis and utf8. Each character set has at least one collation, however, as it may have many more, a default one is always defined for each character set.

Character set	Default collation
ascii	ascii_general_ci
koi8r	koi8r_general_ci
latin1	latin1_swedish_ci
sjis	sjis_japanese_ci
utf8	utf8_general_ci

Collation names follow a convention: they start with the name of the character set with which they are associated, they usually include a language name, and they end with _ci (case insensitive), _cs (case sensitive), or _bin (binary). From this, is should be evident that two different character sets cannot have the same collation.

Default settings

The MySQL database server uses a very flexible scheme where default settings for character sets and collations can be set at four levels: Server, Database, Table and Column.

Server: The server charset and collation are used as default values when the database charset and collation are not specified in CREATE DATABASE statements.; The current server character set and collation can be determined from the values of the character_set_server and collation_server system variables.

Database: The database charset and collation are used as default values for table definitions when the table charset and collation are not specified in CREATE TABLE statements.; The character set and collation for the default database can be determined from the values of the character_set_database and collation_database system variables.

Table: The table charset and collation are used as default values for column definitions when the column charset and collation are not specified in individual column definitions.; The table character set and collation are MySQL extensions; there do not exist in standard SQL.

Column: A character set and a collation may be specified for every "character" column (that is, columns of type CHAR, VARCHAR, or TEXT).; The column character set and collate clauses are standard SQL.

@@ Line 52: / Line 52: @@
 Finally, a ''Collation'' determines how data is sorted and how strings are compared to each other.
-=== Character sets and collations in MySQL ===
+=== Encoding and decoding ===
-Until version 4.1, MySQL tables used the latin1 character set by default. From version 4.1, not only the concepts of "character set" and "collation" were introduced, but the character set by default was changed to utf8.
+One of the most important things one must keep in mind when working with databases (or files, for that matter) is to be aware of how characters (letters, numbers and non-alphanumeric characters) are actually being saved or encoded and how bytes (represented with hexadecimal values) are actually being interpreted or decoded.
+For example, it is not enough to say that character “é” (small letter e with acute) has been saved in a text file, one also has to be aware of how was it encoded. Why? Because if it was encoded as Latin-1, the file will have one byte of hex value E9, but if it was encoded as UTF-8, the file would then have two bytes of hex value C3 and A9:
+{| class="nicetable"
+! Character
+! Latin-1 hex value
+! UTF-8 hex values
+|-
+| é
+| E9
+| C3 A9
+|}
+In the same way, it is not enough to say that an UTF-8 encoded character “é” is going to be read from a text file, because eventhough we might know that two bytes of hex value “C3 A9” will be read, if they are decoded (interpreted) as Latin-1, we will get characters Ã©, but if they are decoded as UTF-8, we would then get character “é”:
+{| class="nicetable"
+! Hex value
+! Latin-1 characters
+! UTF-8 character
+|-
+| C3 A9
+| Ã©
+| é
+|}
+Being aware of these two concepts will become relevant when trying to understand how data is encoded and decoded through the many stages (source, client, server) in order to be saved or restored.
+== MySQL ==
+=== Character sets and collations ===
+Until version 4.1, MySQL tables used the latin1 character set by default. From version 4.1, not only the concepts of "character set" and "collation" were introduced, but the character set by default (at the configuration file) was changed to utf8.
 In MySQL, a ''character set'' is a set of symbols and encodings, and a ''collation'' is a set of rules for comparing characters in a character set.
@@ Line 84: / Line 116: @@
 Collation names follow a convention: they start with the name of the character set with which they are associated, they usually include a language name, and they end with _ci (case insensitive), _cs (case sensitive), or _bin (binary). From this, is should be evident that two different character sets cannot have the same collation.
-==== Default settings ====
+=== Default settings ===
 The MySQL database server uses a very flexible scheme where default settings for character sets and collations can be set at four levels: Server, Database, Table and Column.
@@ Line 98: / Line 130: @@
 ; Table
 : The table charset and collation are used as default values for column definitions when the column charset and collation are not specified in individual column definitions.
-: The table character set and collation are MySQL extensions; there are no such things in standard SQL.
+: The table character set and collation are MySQL extensions; there do not exist in standard SQL.
 ; Column
 : A character set and a collation may be specified for every "character" column (that is, columns of type CHAR, VARCHAR, or TEXT).
 : The column character set and collate clauses are standard SQL.

Documentation

Reference: Difference between revisions