Server clustering improvements proposal
Author: Petr Škoda (škoďák) 4 July 2013 (WST) and others
MDL-40979 was raised to group together this issues that are raised out of this document and the discussion. All issues should be linked to that EPIC.
Cluster configuration requirements
- Cluster node clocks must be synchronised with difference <1s.
- Database setup must be ACID compliant.
- $CFG->dataroot must be a shared directory.
- $CFG->cachedir must be a shared directory.
Note that these last two often cause confusion. These directories must be shared because Moodle code relies on that, all cluster nodes must see the same data there. Cluster headaches do occur when there is a lot of data in here, so our aim is to REDUCE dependence on shared caches as much as possible, technically it is not possible to have only local caches.
Each node in cluster may use local set of php files including config.php, these may be synchronised via git for example, rsync, etc.
It must be the same on all nodes, it must be the public facing URL. It cannot be dynamic.
Enable if you have https:// wwwroot but the SSL is not done by Apache.
Do not forget to enable Secure cookies only to make your site really secure, without it cookie stealing is still trivial.
Enable if your nodes are accessed via different URL. Please note that it is not compatible with $CFG->loginhttps.
It is strongly recommended that $CFG->dirroot (which is automatically set via realpath(config.php)) contains the same path on all nodes. It does not need to point to the same shared directory though. The reason is that some some low level code may use the dirroot value for cache invalidation.
The simplest solution is to have the same directory structure on each cluster node and synchronise these during each upgrade.
The dirroot should be always read only for apache process because otherwise built in add-on installation and plugin uninstallation would get the nodes out of sync.
This MUST be a shared directory where each cluster node is accessing the files directly. It must be very reliable, administrators cannot manipulate files directly there.
Locking support is not required, if any code tries to use file locks in dataroot outside of cachedir or muc directory it is a bug.
It is recommended to use separate ram disks on each node. Scripts may use this directory during one request only. The contents of this directory may be deleted if there is no pending HTTP request, otherwise delete only files with older timestamps.
Always purge this directory when starting cluster node, you may get some notices or temporary errors if you purge this directory while other requests are active.
If any script tries to use temp files that were not created during current request it is a bug that needs to be fixed.
Requirement: $CFG->cachedir is a directory shared by all cluster nodes.
Cache invalidation is done by deleting or overriding of existing files, this needs to be automatically replicated to all cluster nodes, otherwise the cache may return stale or invalid data which often leads to fatal errors or data loss.
The existing documentation clearly states that this directory must be shared, all existing code including add-ons may expect this, it is therefore not possible to change this requirement without breaking backwards compatibility.
Please note that MUC caches have similar problem, the current design expects that each node is contacting the same caches. It should be possible to create new MUC backends that do not require shared cache backends in a way similar to the proposed $CFG->localcachedir .
You can safely purge cachedir when restarting the whole cluster, you may get some notices or temporary errors if you purge this directory while other requests are active.
Netspot admins came up with a hypothesis that "cachedir does not have to be shared", it is easy to disapprove it - we need to find a single case where the sharing is required. Let's look at the caching of data from config table in $CFG global variable. If one node modifies CFG value others would not be notified about the change and would continue using stale value. Workaround seems to be to create a new revision number that is incremented after any change in $CFG and share it with all nodes. The question is where to put it so that it is shared by all nodes? Storing it in database would require one extra read on each page and one extra write on each update, that is not acceptable. We cannot put the revision into a shared filesystem because we would not have it. We cannot rely on presence of memcache or any shared data storage in default installation either. In most other cases we store the revision flags in CFG itself, it is clearly not possible in this case because it would not be shared by nodes. Somebody said "but it works without sharing for all our clients for at least a year" - that does not prove anything, you would have to make sure it works also for everybody else with all add-ons, custom modifications and minimal number of PHP extensions required by Moodle.
In any case cachedir is shared by definition since the beginning of its existence, it is a fact that developers writing code may depend on. If you do not like it then use local cache dir.
The difference from normal $CFG->cachedir is that the directory does not have to be shared by all cluster nodes, the files stored in $CFG->localcachedir MUST NOT change! Default directory is "$CFG->dataroot/localcache".
All files there must use unique revision numbers or hashes because the cached data cannot be invalidated by other mechanism - it will support only adding of new files, but no file deletes or overrides. The $CFG->localcachedir will be growing over time, all local cache files will be deleted during purge_all_caches() (automatically during upgrades).
When using hashes please make sure the number of files in one directory is kept to some reasonable number because some filesystems might have problems with large number of files in one directory.
Note that some caches cannot be easily moved to localcachedir because you need a reliable revision number here, but sometimes it is either not possible or very slow because you need to store the revision number somewhere and you need to bump up the revision number after any change and you would need to store it in database. Caching is not free, each usage must be carefully evaluated.
- Everything must be automatic without extra administrative overhead.
- purge_all_caches() on any node triggers local cache purging on all other nodes before adding any new files there.
- Performance on standalone servers must not be worse.
- No file deletes
- No file modifications
- No file overwriting
- Soft limit on maximum number of files in one directory (few thousands)
- Clustered servers may use fast local filesystem such as tempfs.
- No opcache cache invalidation problems if php files included from localcachedir (because the files must not change).
Candidates for migration from $CFG->cachedir:
Implemented in Moodle 2.6, see MDL-40545.
Class core_component is using a $CFG->cachedir/core_component.php cache that contains a complete list of all plugins and all classes present in $CFG->dirroot. The implementation must be as fast as possible and the results must be extremely reliable.
The cache is automatically invalidated on admin/index.php page and during installation and every upgrade. It is also cleared during purge_all_caches(), but that is only a side effect of storing it in cachedir and it is not required.
core_component class cannot depend on database, MUC or any core libraries - that is the reason why there cannot be any revision flag, there is nowhere to store it, the sha1() of that file content would be the revision. But we would have to go through all plugin directories to find out if it was changed or what revision to use. There are only two options - either we store this file in shared cache or admin invalidates this cache manually, the second option it clearly not acceptable for standard installations.
Alternative component cache location
This is useful for clustered installations only. Typically the $CFG->alternative_component_cache = '/local/cache/dir/core_component.php' would point to local node cache directory. Before upgrade the administrator would have to manually execute following on each node:
$ php admin/cli/alternative_component_cache.php --rebuild
Alternatively you could put the cache file directly into dirroot and distribute it together with new PHP source files. Yet another possibility would be to purge all local caches on all nodes before upgrade or simply change the script name in config.php on all nodes before upgrade.
Implemented in Moodle 2.6, see MDL-40475.
MUC and clustering
At present administrators are free to create new stores and assign them to existing definitions of each cache. Unfortunately the documentation or admin UI does not mention that by default all those definitions require cache stores to be shared by all cluster nodes! Some cache keys may include revision numbers and all settings which makes them compatible with local caches, we need to document it somewhere.
I would like to propose a new flag in cache definition localcache which would indicate that it is safe to use local cache stores on cluster nodes. For now we can at least document it in db/caches.php as inline comments. This information should be clearly visible on the cache store/definition settings page. Maybe we could add setting to cache stores which indicates if it is a shared or local cache. The only potential problem is that it might confuse admins that know little about clustering or cache invalidation.
In theory the installation could create two file stores by default - one using current $CFG->cachedir and second using $CFG->localcachedir and assign them to definitions based on localcache flag. This could help new admins when setting up clusters, all they would have to do would be to move localcachedir to cluster node ram disks.
Note that MODE_REQUEST are local by definition and MODE_SESSION are always shared (because session is shared), this means we need to consider only MODE_APPLICATION definitions here.
List of caches/definitions that are compatible with local caching on cluster nodes:
All large installs and clusters should use transparent proxies (such as Cloudflare or custom nginx).
Small installs without transparent caches (or when in designer mode) depend on performance of theme serving scripts (theme/image.php, theme/styles.php, etc.).
The serving scripts receive revision number in URL, this solves the problem with cache invalidation. We are not using MUC there intentionally because we are not in already open HTTP request with database connection and all standard libraries already loaded, the data there must be served quickly with as little memory and CPU cycles as possible.
The standard theme caches may be stored in local node caches, we could store the theme caches in $CFG->localcachedir/theme/ directory.
Theme designer mode is not using any revisions. For performance reasons we must use some cache, the only option seems to be the standard shared $CFG->cachedir/theme/. Using localcachedir is not an option here, we might try to improve the general performance here and if possible skip all caching in theme designer mode.
Theme caching was improved in 2.6, it is using localcachedir, see MDL-40563.
Language installation and customisations
Each string is constructed from the following sources:
- en pack in plugin/core - always distributed with plugin
- current language pack, including all its parents - downloaded into $CFG->langotherroot (usually $CFG->dataroot/lang)
- local string tweaks - created with tool_customlang, stored in $CFG->langlocalroot in directories with "_local" suffix (usually $CFG->dataroot/lang)
We can use existing $CFG->langrev in cache keys to add support for local lang caches on cluster nodes.
Please note that language packs and local modifications are PHP files, you need to verify file timestamps or blacklist these locations in OP code caches on cluster nodes.
There is another cache for available translations which is stored in $CFG->langmenucachefile, we can move this to MUC and allow local node caches too.
Lang caching is improved in 2.6, see MDL-41019.
- file time stamps must be verified in Moodle 2.5 and bellow
Standard opcache extension is strongly recommended for Moodle 2.6 forward, it is the only solution officially supported by PHP developers. Starting with Moodle 2.6 we should recommend the opcache extension during environment test. All other PHP accelerators should be actively discourage (warning on environment page).
Please note that HQ developers are not required to test current stable branches with any PHP accelerators.
The goal is to use opcaching without checking of file modification times - problem here are dynamically generated PHP files that are stored in dataroot (component cache, lang packs, local lang modifications, muc config, etc.), we need to invalidate the op code cache explicitly.
OPcache extension is fully supported in Moodle 2.6, see MDL-40415.
User session data should be stored in some reliable shared storage, all cluster nodes need to operate with the same session data, the data must not be disappearing randomly.
All sessions drivers should be updating the list of active sessions in the session table, the serialised session data can be stored somewhere else.
Moodle session drivers need to support some special features:
- Session timeout is controlled from Moodle code - it is required for reliable SSO implementations.
- We need to be able to query list of all sessions belonging to one user - this allows us to logout deleted users (for example spammers).
- Locking is strongly recommended because race conditions may create hard to diagnose problems - users need to logout/login to recover from these problems.
- Sometimes the structure of session data changes and then upgrade requires logout of all active users.
- Some add-ons/patches use the information from sessions table to implement limits for concurrent user logins.
By default new installations are using a databases session driver with locking support. There is a legacy file based session driver for sites that cannot use db sessions, but it does not support any advanced features above and is therefore not recommended.
PHP lacks a reliable cross-platform locking support, this is one if major problems when implementing new session drivers. Without locking the changes in user session may be lost resulting in hard to diagnose problems. Another problem when adding new session driver is that the sessions table needs to be updated at least once per minute.
- MDL-31501 is the ticket for implementing a workable memcached session storage backend.
- MDL-25500 is the ticket to implement a general locking mechanism.
Not implemented yet.
At present there is a filedir in dataroot where we store all contents of files in Moodle. Theoretically the code could be abstracted to allow storage of file contents in non-filesystem storage.
The default location is $CFG->dataroot/filedir, alterantive value can be specified in $CFG->filedir setting. $CFG->filedir does not require locking, it may be outside of $CFG->dataroot. The file system/sharing may use very aggressive caching techniques because the files never change, the node needs to contact the master when writing or on cache miss.
The original idea was to have a setting for preventing of file removals from the $CFG->filedir directory, this would allow you to put files from multiple installations into just one giant file pool that could be replicated to all nodes - files would be only added there and never deleted automatically. The only potential problem is size of this huge dir and there would have to be a special new tool that exports used files from one site to another that is not using the shared file pool. The setting was supposed to be in file_storage::deleted_file_cleanup() method (one if that simply exits method).
MDL-40528 contains a refactoring of trashdir operations that could help anybody who decides to reimplement the fole_storage class. The sample backup recovery class might be also user for simple local filedir cache on cluster nodes.
1. Build a 1 node cluster - NFS backing storage - Configuration parameters like there are lots of nodes - Database setup on own host and such. - Build load balancer configuration. - Run moodle install cli - Add appropriate configuration items to get the system working behind load balancer 2. Add more nodes to the cluster by replicating the first node.
- tweak load balancer to redirect all users to some static non-moodle page
- shutdown all nodes except one
- optionally enable CLI maintenance mode
- upload new PHP files to all nodes
- rebuild component cache if alternative read only location used
- upgrade the node from CLI
- disable CLI maintenance mode
- start other nodes (this guarantees localcachedir and tempdir purging)
- enable normal operation
Large in-house NFS backend VM upgrade
CLI Maintenance mode may be useful, but either way, you we stop all web nodes and also ensure that the current cron run has finished, otherwise you can get into all sort of strife. Trying to do an upgrade during the forum digest time or stats run is not possible. Both can run for >1 hour.
Opcode resets are a problem when moving from CLI to apache for fcgi. As I understand it, each has its own opcache location and that needs to be managed properly. Hence the just shutdown the web nodes to force resets.
1. Disable all users except administrator who is upgrading. (This is done via IP at our load balancer) 2. Stop apache on all nodes 3. Stop cron from running and ensure it's finished.
We then ignore the fact we are in a cluster for a few instructions and operate on one node.
3.1 Backup/snapshot database and dataroot
4. Upload new php files to 'master' node. 5. Upgrade 'master' node. admin/cli/upgrade.php 6. Start up 'master' node apache, and test with admin login. 7. If all successful, copy code to all other nodes 8. Turn on other nodes. 9. Browse around a bit to ensure you have warmed the opcache (cold cache can cause floods with 500+ users logging in within 10 seconds of go-live) 10. Turn the system back on for all users.
Anything ACID compliant should be ok. For example even-odd master-master MySQL is not ok.
TODO: add descriptions of ACID compliant DB cluster setups