This is the start of a few pages of largely version independent reference

In the spirit of wiki and in response to the post here: http://moodle.org/mod/forum/discuss.php?d=205117

Go for it Guillermo.

Maybe start here, and later on add a category:Reference and see what happens. I've set a watch on, and I may do a page on JSON errors.

Database

I'll start writing here about database and utf8 issues, but I guess that the idea would be for this to be the main reference table of contents page so it then links to each main topic page.

Basic concepts

Byte, character, character set, character encoding and collation

A byte (or octet) is the smallest unit of storage that can be allocated and, in almost all modern computers, it consists of 8 bits, where a bit can only have two values: 0 or 1.

A character is a symbol used in a writing system, such as a letter of the alphabet.

A character set, or a coded character set, refers to the set of characters and numbers, or code points, used to represent them. A code point can be understood as the position the character occupies in the set.

A character encoding, or a character encoding form, refers to how the code points are converted (encoded) into code values to be saved or stored in a computer.

Unicode is a character set developed to consistently encode, represent and handle text in most of the world's writing systems in use today. Unicode can be implemented by different character encodings, where the two most commonly used are UTF-8 and UTF-16:

UTF-8 is a variable-width character encoding that can represent every character in the Unicode character set. UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four bytes.

Note. MySQL utf8 Unicode character set uses a maximum of three bytes per multi-byte character (http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8.html).

UTF-16 is a variable-length character encoding for Unicode that uses one or two 16-bit code units, where each unit takes two 8-bit bytes, and the order of the bytes may depend on the byte order (endianness) of the computer architecture. To assist in recognizing the byte order of code units, UTF-16 allows a Byte Order Mark (BOM), a code unit with the hexadecimal value U+FEFF, to precede the first actual coded value.

Eventhough UTF-8 is the preferred character encoding nowadays, many others have been used, for example:

ISO-8859-1 (or Latin-1). A character encoding for the Latin script, where each character is encoded as a single 8-bit code value. This character-encoding scheme consists of 191 characters and it is used throughout The Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages.
ISO-8859-15 (or Latin-9). A character encoding similar to Latin-1, except that it substituted some rarely used characters with the euro sign and a few other letters.
Shift JIS. A character encoding for the Japanese language.

To illustrate the concepts given above, the € (euro) character has a code point equal to 8364 (or U+20AC, in hexadecimal notation) in the Unicode character set. This code point can be stored in different ways, depending on the character encoding used:

Character encoding	Code value
ISO-8859-15 (hexadecimal)	A4
UTF-8 (hex)	E2 82 AC
UTF-16 (hex)	20AC

Finally, a Collation determines how data is sorted and how strings are compared to each other.

Encoding and decoding

One of the most important things one must keep in mind when working with databases (or files, for that matter) is to be aware of how characters (letters, numbers and non-alphanumeric characters) are actually being saved or encoded and how bytes (represented with hexadecimal values) are actually being interpreted or decoded.

For example, it is not enough to say that character “é” (small letter e with acute) has been saved in a text file, one also has to be aware of how was it encoded. Why? Because if it was encoded as Latin-1, the file will have one byte of hex value E9, but if it was encoded as UTF-8, the file would then have two bytes of hex value C3 and A9:

Character	Latin-1 hex value	UTF-8 hex values
é	E9	C3 A9

In the same way, it is not enough to say that an UTF-8 encoded character “é” is going to be read from a text file, because eventhough we might know that two bytes of hex value “C3 A9” will be read, if they are decoded (interpreted) as Latin-1, we will get characters Ã©, but if they are decoded as UTF-8, we would then get character “é”:

Hex value	Latin-1 characters	UTF-8 character
C3 A9	Ã©	é

Being aware of these two concepts will become relevant when trying to understand how data is encoded and decoded through the many stages (source, client, server) in order to be saved or restored.

MySQL

Character sets and collations

Until version 4.1, MySQL tables used the latin1 character set by default. From version 4.1, not only the concepts of "character set" and "collation" were introduced, but the character set by default (at the configuration file) was changed to utf8.

In MySQL, a character set is a set of symbols and encodings, and a collation is a set of rules for comparing characters in a character set.

Note. Eventhough in MySQL the term "character set" is used, actually it should be "character encoding" ("charset").

The MySQL server supports multiple character sets, like ascii, koi8r, latin1, sjis and utf8. Each character set has at least one collation, however, as it may have many more, a default one is always defined for each character set.

Character set	Default collation	Other collations
ascii	ascii_general_ci	ascii_bin
koi8r	koi8r_general_ci	koi8r_bin
latin1	latin1_swedish_ci	latin1_bin, latin1_general_ci, latin1_spanish_ci
sjis	sjis_japanese_ci	sjis_bin
utf8	utf8_general_ci	utf8_bin, utf8_unicode_ci, utf8_turkish_ci

Collation names follow a convention: they start with the name of the character set with which they are associated, they usually include a language name, and they end with _ci (case insensitive), _cs (case sensitive), or _bin (binary). From this, is should be evident that two different character sets cannot have the same collation.

Default settings

The MySQL database server uses a very flexible scheme where default settings for character sets and collations can be set at four levels: Server, Database, Table and Column.

Server: The server charset and collation are used as default values when the database charset and collation are not specified in CREATE DATABASE statements.; The current server character set and collation can be determined from the values of the character_set_server and collation_server system variables.

Database: The database charset and collation are used as default values for table definitions when the table charset and collation are not specified in CREATE TABLE statements.; The character set and collation for the default database can be determined from the values of the character_set_database and collation_database system variables.

Table: The table charset and collation are used as default values for column definitions when the column charset and collation are not specified in individual column definitions.; The table character set and collation are MySQL extensions; there do not exist in standard SQL.

Column: A character set and a collation may be specified for every "character" column (that is, columns of type CHAR, VARCHAR, or TEXT).; The column character set and collate clauses are standard SQL.

Configuration files, settings and default values

To run the commands given in this section, one has to have direct access to the MySQL server.

From here on it will be assumed that the server is already running. Also, for Windows users, commands must be entered from a Windows command (system) window.

Server and client

All MySQL settings, both for the server and the client components, are defined under one configuration or option file: my.ini (Windows) or my.cnf (Linux), although many copies of this file might be read at startup.

Settings in the configuration file are divided in option groups, where each one contains the settings for: the server, all clients and each particular client:

Settings for	Option group
The server	[mysqld], [server], etc.
All clients	[client]
Each particular client	[mysql], [mysqldump], [myisamchk], [mysqlhotcopy], etc.

Running a particular program with the parameters --verbose --help:

mysqld --verbose --help
mysql --verbose --help
mysqldump --verbose --help

will list four pieces of information:

The location of the configuration files.
The option groups read by the program.
The allowed options or variables (parameters) with their abreviations and descriptions.
The default values of variables that can be set from the command line.

Location of the configuration files

Default options are read from the following files in the given order:

Windows: WINDIR\my.ini WINDIR\my.cnf C:\my.ini C:\my.cnf INSTALLDIR\my.ini INSTALLDIR\my.cnf; Where WINDIR is the Windows directory (e.g. C:\Windows), and INSTALLDIR is the MySQL installation directory (e.g. C:\MySqlServer5.5).

Linux: /etc/my.cnf /etc/mysql/my.cnf SYSCONFDIR/my.cnf $MYSQL_HOME/my.cnf

The option groups read

The option groups (in the configuration file) read by each program; for example:

Program	The following groups are read
mysqld	mysqld server mysqld-5.5
mysql	mysql client
mysqldump	mysqldump client

Allowed options (variables)

The variables accepted (in the command line) by the program, with their abreviations and descriptions; for example, the following are some of those listed for the mysql client program:

Variable (option)	Description
-D, --database=name	Database to use.
--default-character-set=name	Set the default character set.

Variables and their values

The default values of variables that can be set from the command line for each program; for example, the following are some of those listed by the mysqld server program:

Variable	Default value
basedir	C:/MySqlServer5.5/
character-set-server	latin1
datadir	C:\ MySqlServer5.5\Data\
date-format	%Y-%m-%d
innodb	ON
tmpdir	C:\Tmp

System values used by a running server

While the --help --verbose options of the MySQL server (mysqld) show command line options and their default values, there are also other variables that can only be seen after the server has been started and that provide information about its operation.

To see what system values is using a running server, the mysqladmin program can be used with the variables option:

mysqladmin [-u root -pyour_password] variables

The following are some of the variables and values listed:

Variable_name	Value
character_set_client	latin1
character_set_connection	latin1
collation_connection	latin1_swedish_ci
connect_timeout	10

Server status information

The server status information provided by the mysqladmin extended-status option is rather technical and so it will only be useful for DBA’s, however it is worth knowing how to display it:

mysqladmin [-u root -pyour_password] extended-status

Database

For each database created, its character set and collation values are saved in a text file named db.opt, located in the corresponding database folder (dbname) under the datadir directory.

more C:\MySqlServer5.5\data\dbname\db.opt

Outputs:

default-character-set=latin1
default-collation=latin1_swedish_ci

When a database is first created, if the character set and collation options are not defined in the CREATE DATABASE statement, then they are taken from the server charset and collation values. These two values can be modified via the ALTER DATABASE statement.

Gathering useful information

The first step before trying to fix a character set or a collation problem in a database, table or column is to determine how are they actually defined. The aim of this section then, is to show how to get specific pieces of information about these components.

To run all the SQL queries shown in this section, one can work from:

A mysql command window, in which case the mysql client program must be running:

mysql -u root -pyour_password

Starts a mysql session:

mysql>

The phpMyAdmin tool, through the “Run SQL query/queries on database dbname” window.

Saving the output of a query into a file

As results from many commands might be a bit long to be analyzed from a window, it is useful to know how to send them into a file.

The following commands can only be run from a mysql command window.

Windows

Use command tee to save the output of a query into a file, and command notee to stop writing into a file. For example:

mysql> tee e:/tmp/vars.out;
mysql> show variables;
mysql> notee;
Outfile disabled.

To determine into which file output is being written, just type tee by itself:

mysql> tee;
Currently logging to file 'e:/tmp/vars.out'

Unix

Use command pager to save the output of a query into a file, and command nopager to stop writing into a file. For example:

mysql> pager cat > /tmp/vars.out
mysql> show variables;
mysql> nopager;

phpMyAdmin

Once a query is run and the results are displayed, the Print view or the Print view (with full texts) links from the Query results operations section can be used to save them.

Basic server status information

The STATUS command will get basic status information from the server, like the current database in use, the server version, the TCP port or the characterset defined for the server, the database, the client and the connection. This command can only be run from a mysql command window.

STATUS;

Outputs:

Current database:
mysql  Ver 14.14 Distrib 5.5.8, for Win32 (x86)
Connection id:          232
Current database:
Current user:           root@localhost
SSL:                    Not in use
Using delimiter:        ;
Server version:         5.5.8 MySQL Community Server (GPL)
Protocol version:       10
Connection:             localhost via TCP/IP
Server characterset:    latin1
Db     characterset:    latin1
Client characterset:    latin1
Conn.  characterset:    latin1
TCP port:               3306
Uptime:                 18 hours 25 min 38 sec

If a database hasn’t been set active, the Db characterset variable will contain the global default, otherwise its value will reflect that of the selected database.

Server status information

The following will list the same variables as the mysqladmin extended-status command:

SHOW STATUS;

As the list of variables is quite long, the LIKE clause can be used to list only those that match.

System variables with their values

The following will list the same system variables as the mysqladmin variables command, plus a few more:

SHOW VARIABLES;

As the list of variables is quite long, the LIKE clause can be used to list only those that match.

To list all the character_set variables:

SHOW VARIABLES LIKE 'character_set\_%';

Outputs:

+--------------------------+-----------------------------------+
| Variable_name            | Value                             |
+--------------------------+-----------------------------------+
| character_set_client     | latin1                            |
| character_set_connection | latin1                            |
| character_set_database   | latin1                            |
| character_set_filesystem | binary                            |
| character_set_results    | latin1                            |
| character_set_server     | latin1                            |
| character_set_system     | utf8                              |
+--------------------------+-----------------------------------+

If a database hasn’t been set active, the character_set_database variable will contain the global default, otherwise its value will reflect that of the selected database.

To list all the collations defined:

SHOW VARIABLES LIKE 'coll%';

Outputs:

+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+

Database

How was it created

The following command shows how a database (schema) was created (i.e., the complete CREATE DATABASE statement used to create it).

show create database db_name;

Outputs:

+----------+-------------------------------------------------------------------+
| Database | Create Database                                                   |
+----------+-------------------------------------------------------------------+
| test02   | CREATE DATABASE `test02` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+-------------------------------------------------------------------+

As mentioned before, when a database is created or when its characterset and/or collation are modified, these two values are saved in the corresponding db.opt file.

Character set and collation

The following SQL statement gives both, the character set and collation of a schema.

SELECT schema_name, default_character_set_name "character_set", default_collation_name "collation"
FROM information_schema.schemata
WHERE schema_name = 'db_name';

Outputs:

+-------------+---------------+-------------------+
| schema_name | character_set | collation         |
+-------------+---------------+-------------------+
| test02      | latin1        | latin1_swedish_ci |
+-------------+---------------+-------------------+

It is important to remember that these values only act as defaults to define the character set and the collation of newly created tables.

Tables

A database must first be selected before trying to use the table queries.

From a mysql command window:

use db_name;

From phpMyAdmin, just select the database to be used.

How was it created

SHOW CREATE TABLE tbl_name;

Outputs:

CREATE TABLE `t1_latin` (
  `userid` smallint(5) unsigned DEFAULT NULL,
  `coll_swd` char(16) NOT NULL,
  PRIMARY KEY (`coll_swd`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Character set and collation

The following SQL query outputs the character set and collation of a table in a particular schema.

SELECT t.table_schema, t.table_name, ccsa.character_set_name "table_character_set", t.table_collation
FROM information_schema.`tables` t,
     information_schema.`collation_character_set_applicability` ccsa
WHERE ccsa.collation_name = t.table_collation
AND t.table_schema = 'db_name'
AND t.table_name = 'tbl_name';

Outputs:

+--------------+------------+---------------------+-------------------+
| table_schema | table_name | table_character_set | table_collation   |
+--------------+------------+---------------------+-------------------+
| test02       | t1_latin   | latin1              | latin1_swedish_ci |
+--------------+------------+---------------------+-------------------+

To list the same information for all the tables in a particular schema, one only needs to remove the last AND clause from the previous query.

SELECT t.table_schema, t.table_name, ccsa.character_set_name "table_character_set", t.table_collation
FROM information_schema.`tables` t,
     information_schema.`collation_character_set_applicability` ccsa
WHERE ccsa.collation_name = t.table_collation
AND t.table_schema = 'db_name';

Outputs:

+--------------+------------+---------------------+-------------------+
| table_schema | table_name | table_character_set | table_collation   |
+--------------+------------+---------------------+-------------------+
| test02       | t1_latin   | latin1              | latin1_swedish_ci |
| test02       | t2_latin   | latin1              | latin1_swedish_ci |
+--------------+------------+---------------------+-------------------+

Columns

The following SQL query outputs the character set and collation of all columns in a particular table.

SELECT table_schema, table_name, column_name, data_type, character_set_name, collation_name, column_type
FROM information_schema.columns
WHERE table_schema = 'dbname'
AND table_name = 'tblname'
AND data_type IN ('char', 'varchar', 'tinytext', 'text', 'mediumtext', 'longtext', 'enum', 'set');

Outputs:

+--------------+------------+-------------+-----------+--------------------+-------------------+-------------+
| table_schema | table_name | column_name | data_type | character_set_name | collation_name    | column_type |
+--------------+------------+-------------+-----------+--------------------+-------------------+-------------+
| test02       | t1_latin   | coll_swd    | char      | latin1             | latin1_swedish_ci | char(16)    |
+--------------+------------+-------------+-----------+--------------------+-------------------+-------------+

Documentation

Reference

Contents

This is the start of a few pages of largely version independent reference

Database

Basic concepts

Byte, character, character set, character encoding and collation

Encoding and decoding

MySQL

Character sets and collations

Default settings

Configuration files, settings and default values

Server and client

Location of the configuration files

The option groups read

Allowed options (variables)

Variables and their values

System values used by a running server

Server status information

Database

Gathering useful information

Saving the output of a query into a file

Windows

Unix

phpMyAdmin

Basic server status information

Server status information

System variables with their values

Database

How was it created

Character set and collation

Tables

How was it created

Character set and collation

Columns