Note:

If you want to create a new page for developers, you should create it on the Moodle Developer Resource site.

Server clustering improvements proposal: Difference between revisions

From MoodleDocs
Line 194: Line 194:
* Some add-ons/patches use the information from sessions table to implement limits for concurrent user logins.
* Some add-ons/patches use the information from sessions table to implement limits for concurrent user logins.


There is a fallback for sites that cannot use db sessions.
By default new installations are using a debases session driver locking support. There is a legacy file based session driver fallback for sites that cannot use db sessions, but it does not support any advanced features above and is not recommended.


The biggest prohlem here is locking, without locking the changes in user session may be lost resulting in hard to diagnose problems. Locking is the slowest part, PHP locking is unreliable.
The biggest prohlem here is locking, without locking the changes in user session may be lost resulting in hard to diagnose problems. Locking is the slowest part, PHP locking is unreliable.

Revision as of 09:16, 1 August 2013

Moodle 2.6


Author: Petr Škoda (škoďák) 4 July 2013 (WST) and others

MDL-40979 was raised to group together this issues that are raised out of this document and the discussion. All issues should be linked to that EPIC.

Cluster configuration requirements

  1. Cluster node clocks must be synchronised with difference <1s.
  2. Database setup must be ACID compliant.
  3. $CFG->dataroot must be a shared directory.
  4. $CFG->cachedir must be a shared directory.

Note that these last two often cause confusion. These directories must be shared because a lot of Moodle relies on the data in there, and it must be perfectly synchronised for all nodes in a cluster. Cluster headaches do occur when there is a lot of data in here, so our aim is to REDUCE dependence on these directories as much as possible (ie down to zero, eventually).

config.php settings

Each node in cluster may use local set of php files including config.php, these may be synchronised via git for example, rsync, etc.

$CFG->wwwroot

It must be the same on all nodes, it must be the public facing URL. It cannot be dynamic.

$CFG->sslproxy

Enable if you have https:// wwwroot but the SSL is not done by Apache.

Do not forget to enable Secure cookies only to make your site really secure, without it cookie stealing is still trivial.

$CFG->reverseproxy

Enable if your nodes are accessed via different URL. Please note that it is not compatible with $CFG->loginhttps.

$CFG->dirroot

It is strongly recommended that $CFG->dirroot (which is automatically set via realpath(config.php)) contains the same path on all nodes. It does not need to point to the same shared directory though. The reason is that some some low level code may use the dirroot value for cache invalidation.

The simplest solution is to have the same directory structure on each cluster node and synchronise these during each upgrade.

The dirroot should be always read only for apache process because otherwise built in add-on installation and plugin uninstallation would get the nodes out of sync.

$CFG->dataroot

This MUST be a shared directory where each cluster node is accessing the files directly. It must be very reliable, data must not be manipulated directly there.

Locking support is not required, if any code tries to use file locks in dataroot outside of cachedir or tempdir it is a bug.

$CFG->tempdir

It is recommended to use separate ram disks on each node. Scripts may use this directory during one request only. The contents of this directory may be deleted if there is no pending HTTP request, otherwise delete only files with older timestamps.

Always purge this directory when starting cluster node, you may get some notices or temporary errors if you purge this directory while other requests are active.

If any script tries to use temp files that were not created during current request it is a bug that needs to be fixed.

$CFG->cachedir

Requirement: $CFG->cachedir is a directory shared by all cluster nodes.

Cache invalidation is done by deleting or overriding of existing files, this needs to be automatically replicated to all cluster nodes, otherwise the cache may return stale or invalid data which often leads to fatal errors or data loss.

The existing documentation clearly states that this directory must be shared, all existing code including add-ons may expect this, it is therefore not possible to change this requirement without breaking backwards compatibility.

Please note that MUC caches have similar problem, the current design expects that each node is contacting the same caches. It should be possible to create new MUC backends that do not require shared cache backends in a way similar to the proposed $CFG->localcachedir .

You can safely purge cachedir when restarting the whole cluster, you may get some notices or temporary errors if you purge this directory while other requests are active.

Netspot admins came up with a hypothesis that "cachedir does not have to be shared", it is easy to disapprove it - we need to find a single case where the sharing is required. Let's look at the caching of data from config table in $CFG global variable. If one node modifies CFG value others would not be notified about the change and would continue using stale value. Workaround seems to be to create a new revision number that is incremented after any change in $CFG and share it with all nodes. The question is where to put it so that it is shared by all nodes? Storing it in database would require one extra read on each page and one extra write on each update, that is not acceptable. We can not put the revision into a shared filesystem because we would not have it. We can not rely on presence of memcache or any shared data storage in default installation either. In most other cases we store the revision flags in CFG itself, it is clearly not possible in this case because it would not be shared by nodes. Somebody said "but it works without sharing for all our clients for at least a year" - that does not prove anything, you would have to make sure it works also for everybody else with all add-ons, custom modifications and minimal number of PHP extensions required by Moodle.

In any case cachedir is shared by definition since the beginning of its existence, it is a fact that developers writing code may depend on. If you do not like it then use local cache dir.

New $CFG->localcachedir

The difference from normal $CFG->cachedir is that the directory does not have to be shared by all cluster nodes, the files stored in $CFG->localcachedir MUST NOT change! Default directory is "$CFG->dataroot/localcache".

All files there must use unique revision numbers or hashes because the cached data can not be invalidated by any means - it will support only adding of new files, but no file deletes or overrides. The $CFG->localcachedir will be growing over time, all local cache files will be deleted during purge_all_caches() (automatically during upgrades).

When using hashes please make sure the number of files in one directory is kept to some reasonable number because some filesystems might have problems with large number of files in one directory.

Note that some caches cannot be easily moved to localcachedir because you need a reliable revision number here, but sometimes it is either not possible or very slow because you need to store the revision number somewhere and you need to bump up the revision number after any change and you would need to store it in database. Caching is not free, each usage must be carefully evaluated.

Design requirements:

  • Everything must be automatic without extra administrative overhead.
  • purge_all_caches() on any node triggers local cache purging on all other nodes before adding any new files there.
  • Performance on standalone servers must not be worse.

Usage restrictions:

  • No file deletes
  • No file modifications
  • No file overwriting
  • Soft limit on maximum number of files in one directory (few thousands)

Benefits:

  • Clustered servers may use fast local filesystem such as tempfs.
  • No opcache cache invalidation problems if php files included from localcachedir (because the files must not change).

Candidates for migration from $CFG->cachedir:

Implemented in Moodle 2.6, see MDL-40545.

Component cache

Class core_component is using a $CFG->cachedir/core_component.php cache that contains a complete list of all plugins and all classes present in $CFG->dirroot. The implementation must be as fast as possible and the results must be extremely reliable.

The cache is automatically invalidated on admin/index.php page and during installation and every upgrade. It is also cleared during purge_all_caches(), but that is only a side effect of storing it in cachedir and it is not required.

core_component class cannot depend on database, MUC or any core libraries - that is the reason why there cannot be any revision flag, there is nowhere to store it, the sha1() of that file content would be the revision. But we would have to go through all plugin directories to find out if it was changed or what revision to use. There are only two options - either we store this file in shared cache or admin invalidates this cache manually, the second option it clearly not acceptable for standard installations.

Alternative component cache location

This is useful for clustered installations only. Typically the $CFG->alternative_component_cache = '/local/cache/dir/core_component.php' would point to local node cache directory. Before upgrade the administrator would have to manually execute following on each node:

$ php admin/cli/alternative_component_cache.php --rebuild

Alternatively you could put the cache file directly into dirroot and distribute it together with new PHP source files. Yet another possibility would be to purge all local caches on all nodes before upgrade or simply change the script name in config.php on all nodes before upgrade.

Implemented in Moodle 2.6, see MDL-40475.

MUC and clustering

The requirement is to make MUC stores aware of revision numbers, if we do that we can store the data in multiple backends without getting the caches stale. Another benefit is that we would not have to purge all existing data which would help on shared servers.

Default MUC file stores could decide to use either $CFG->cachedir and $CFG->localcachedir depending on availability of the revision number. Current workaround is to include the revision number in cache key, but that does not solve the problem with $CFG->cachedir that must be shared by all nodes.

There is already $CFG->altcacheconfigpath which may be used to set up different caching config on each node.

Not implemented yet.

Theme caching

All large installs and clusters should use transparent proxies (such as Cloudflare or custom nginx).

Small installs without transparent caches (or when in designer mode) depend on performance of theme serving scripts (theme/image.php, theme/styles.php, etc.).

The serving scripts receive revision number in URL, this solves the problem with cache invalidation. We are not using MUC there intentionally because we are not in already open HTTP request with database connection and all standard libraries already loaded, the data there must be served quickly with as little memory and CPU cycles as possible.

The standard theme caches may be stored in local node caches, we could store the theme caches in $CFG->localcachedir/theme/ directory.

Theme designer mode is not using any revisions. For performance reasons we must use some cache, the only option seems to be the standard shared $CFG->cachedir/theme/. Using localcachedir is not an option here, we might try to improve the general performance here and if possible skip all caching in theme designer mode.

Theme caching was improved in 2.6, is is using localcachedir, see MDL-40563.

Javascript caching

We used JS caching in cachedir for performance reasons because we were doing on-the-fly JS compression. Ideally we should switch to build system for all JS files and store the minimised files along the source files in git and just add proper headers in lib/javascript.php script.

JavaScript caching was improved in 2.6, is is using localcachedir, see MDL-40546.

Language installation and customisations

The key issue in language pack space is accesses to the shared storage. Loading or even timestamp checking of those files can be slow. The proposal below addresses that and would make a different to that issue.

Ideally all non-english language strings including site customisations should be stored in database. The problem is that sometimes we need these strings before contacting the database.

The fastest solution was to create caches in php format, because that is the fastest cache type especially when using opcode caches (MUC caches are going to be always slower).

The ideal solution for performance would be:

  • keep downloaded language packs in $CFG->dataroot/lang
  • automatically invalidate opcache after lang pack import
  • invent string revision number
  • use MUC with revision number
  • if MUC is slow let's revert back to versioned php cache files in localcachedir

On cluster nodes we would need to use file timestamps or blacklist $CFG->dataroot/lang in opcache.

Not implemented.

PHP accelerators

Potential problems:

  • file time stamps must be verified in Moodle 2.5 and bellow

opcache extension

Standard opcache extension is strongly recommended for Moodle 2.6 forward, it is the only solution officially supported by PHP developers. Starting with Moodle 2.6 we should recommend the opcache extension during environment test. All other PHP accelerators should be actively discourage (warning on environment page).

Please note that HQ developers are not required to test current stable branches with any PHP accelerators.

The goal is to use opcaching without checking of file modification times - problem here are dynamically generated PHP files that are stored in dataroot (component cache, lang packs, local lang modifications, muc config, etc.), we need to invalidate the op code cache explicitly.

OPcache extension is fully supported in Moodle 2.6, see MDL-40415.

Browser sessions

User session data should be stored in some reliable shared storage, all cluster nodes need to operate with the same session data, the data must not be disappearing randomly.

All sessions drivers should be updating the list of active sessions in the session table, the serialised session data can be stored somewhere else.

Moodle session drivers need to support some special features:

  • Session timeout is controlled from Moodle code - it is required for reliable SSO implementations.
  • We need to be able to query list of all sessions belonging to one user - this allows us to logout deleted users (for example spammers).
  • Locking is strongly recommended because race conditions may create hard to diagnose problems - users need to logout/login to recover from these problems.
  • Sometimes the structure of session data changes and then upgrade requires logout of all active users.
  • Some add-ons/patches use the information from sessions table to implement limits for concurrent user logins.

By default new installations are using a debases session driver locking support. There is a legacy file based session driver fallback for sites that cannot use db sessions, but it does not support any advanced features above and is not recommended.

The biggest prohlem here is locking, without locking the changes in user session may be lost resulting in hard to diagnose problems. Locking is the slowest part, PHP locking is unreliable.

  • MDL-31501 is the ticket for implementing a workable memcached session storage backend.
  • MDL-25500 is the ticket to implement a general locking mechanism.

Not implemented yet.

File pool

At present there is a filedir in dataroot where we store all contents of files in Moodle. Theoretically the code could be abstracted to allow storage of file contents in non-filesystem storage.

The default location is $CFG->dataroot/filedir, alterantive value can be specified in $CFG->filedir setting. $CFG->filedir does not require locking, it may be outside of $CFG->dataroot. The file system/sharing may use very aggressive caching techniques because the files never change, the node needs to contact the master when writing or on cache miss.

The original idea was to have a setting for preventing of file removals from the $CFG->filedir directory, this would allow you to put files from multiple installations into just one giant file pool that could be replicated to all nodes - files would be only added there and never deleted automatically. The only potential problem is size of this huge dir and there would have to be a special new tool that exports used files from one site to another that is not using the shared file pool. The setting was supposed to be in file_storage::deleted_file_cleanup() method (one if that simply exits method).

MDL-40528 contains a refactoring of trashdir operations that could help anybody who decides to reimplement the fole_storage class. The sample backup recovery class might be also user for simple local filedir cache on cluster nodes.

Installation procedure

1. Build a 1 node cluster
  - NFS backing storage
  - Configuration parameters like there are lots of nodes
  - Database setup on own host and such.
  - Build load balancer configuration.
  - Run moodle install cli
  - Add appropriate configuration items to get the system working behind load balancer
2. Add more nodes to the cluster by replicating the first node.

Upgrade procedure

Sample estimation

  1. tweak load balancer to redirect all users to some static non-moodle page
  2. shutdown all nodes except one
  3. optionally enable CLI maintenance mode
  4. upload new PHP files to all nodes
  5. rebuild component cache if alternative read only location used
  6. upgrade the node from CLI
  7. disable CLI maintenance mode
  8. start other nodes (this guarantees localcachedir and tempdir purging)
  9. enable normal operation


Large in-house NFS backend VM upgrade

CLI Maintenance mode may be useful, but either way, you we stop all web nodes and also ensure that the current cron run has finished, otherwise you can get into all sort of strife. Trying to do an upgrade during the forum digest time or stats run is not possible. Both can run for >1 hour.

Opcode resets are a problem when moving from CLI to apache for fcgi. As I understand it, each has its own opcache location and that needs to be managed properly. Hence the just shutdown the web nodes to force resets.

1. Disable all users except administrator who is upgrading. (This is done via IP at our load balancer)
2. Stop apache on all nodes
3. Stop cron from running and ensure it's finished.
We then ignore the fact we are in a cluster for a few instructions and operate on one node.
3.1 Backup/snapshot database and dataroot
4. Upload new php files to 'master' node.
5. Upgrade 'master' node.  admin/cli/upgrade.php
6. Start up 'master' node apache, and test with admin login.
7. If all successful, copy code to all other nodes
8. Turn on other nodes.
9. Browse around a bit to ensure you have warmed the opcache (cold cache can cause floods with 500+ users logging in within 10 seconds of go-live)
10. Turn the system back on for all users.

Database clustering

Anything ACID compliant should be ok. For example even-odd master-master MySQL is not ok.

TODO: add descriptions of ACID compliant DB cluster setups

More information