Note:

If you want to create a new page for developers, you should create it on the Moodle Developer Resource site.

File API internals: Difference between revisions

From MoodleDocs
m (fixing typos)
m (Protected "File API internals": Developer Docs Migration ([Edit=Allow only administrators] (indefinite)))
 
(132 intermediate revisions by 18 users not shown)
Line 1: Line 1:
This page outlines the current thinking about implementing file storage and access in Moodle 2.0.  It's a SPECIFICATION UNDER CONSTRUCTION!
{{Template:Migrated|newDocId=/docs/apis/subsystems/files/internals}}
 
{{Infobox Project
The page is open for everyone so everyone can help correct mistakes and help with the evolution of this document.  However, if you have questions, problems to report or major changes to suggest please add them to the [[Development_talk:File_API|page comments]], or start a discussion in the [http://moodle.org/mod/forum/view.php?id=1807 Repositories forum]. We'll endeavour to merge all such suggestions into the main spec before we start development.
|name = File API
 
|state = Implemented
|tracker = MDL-14589
|discussion = n/a
|assignee = [[User:Petr Škoda (škoďák)|Petr Škoda (škoďák)]]
}}
{{Moodle 2.0}}


==Objectives==
==Objectives==


# Allow files to be added directly into Moodle (as we do now)
The goals of the new File API are:
# Remember where files came from
# Give modules control over the access to files using capabilities and other local rules
# Consistent and simple approach for ALL file handling throughout Moodle


* allow files to be stored within Moodle, as part of the content (as we do now).
* use a consistent and flexible approach for all file handling throughout Moodle.
* give modules control over which users can access a file, using capabilities and other local rules.
* make it easy to determine which parts of Moodle use which files, to simplify operations like backup and restore.
* track where files originally came from.
* avoid redundant storage, when the same file is used twice.
* fully support Unicode file names, irrespective of the capabilities of the underlying file system.


==Overview==
==Overview==


The File API is a core set of interfaces that all Moodle code will use to:
The File API is a set of core interfaces to allow the rest of Moodle to store, serve and manage files. It applies only to files that are part of the Moodle site's content. It is not used for internal files, such as those in the following subdirectories of dataroot: temp, lang, cache, environment, filter, search, sessions, upgradelogs, ...
# store files within Moodle
# display files to Moodle users
 
It applies only to "user" files. It will NOT apply to local files and caches created by Moodle such as these directories in dataroot: temp, lang, cache, environment, filter, rss, search, sessions, upgradelogs etc
 
The API will be split into several independent parts:
# File serving API
## file.php
## pluginfile.php
## userfile.php
## rssfile.php
# File storage API
## optional access control
## optional repo sync
# File management API
## File browsing
## File linking (editor integration)
## Upload from repository
 
==File serving API==
Deals with serving of files - browser requests file, Moodle sends it back. We have three main files. It is important to setup slasharguments on server (file.php/some/thing/xxx.jpg), any content that relies on relative links can not work without it (scorm, uploaded html pages, etc.).
 
===file.php===
Serves course files.


Implements basic file access. Ideally only images and files linked from course sections should be there, no XSS protection required - we expect javascript, sw, etc. there, no way to make it "secure". The access control is not critical any more if we move most of the files into modules
To learn how to use File API, please visit [[Using the File API]].


The file name and parameter structure is critical for backwards compatibility of existing course content.
The API can be subdivided into the following parts:
; File storage
: Low level file storage without access control information. Stores the content of files on disc, with metadata in associated database tables.
; File serving
: Lets users accessing a Moodle site get the files (file.php, draftfile.php, pluginfile.php, userfile.php, etc.)
:* Serve the files on request
:* with appropriate security checks
; File related user interfaces
: Provides the interface for (lib/form/file.php, filemanager.php, filepicker.php and files/index.php, draftfiles.php)
:* Form elements allowing users to select a file using the file picker, and have it stored within Moodle.
:* UI for users to manage their files, replacing the old course files UI
; File browsing API
: Allows code to browse and optionally manipulate the file areas
:* find information about available files in each area.
:* print links to files.
:* optionally move/rename/copy/delete/etc.


/file.php/courseid/dir/dir/filename.ext
== File API internals ==


Internally the files would be stored in <code>array('contextid'=>$coursecontextid, 'filearea'=>'content', 'itemid'=>0)</code>
=== File storage on disk ===


===pluginfile.php===
Files are stored in $CFG->dataroot (also known as moodledata) in the filedir subfolder.
(aka modfile.php)
Sends module, block, question files.
* modules decide about access control
* optional XSS protection - student submitted files must not be served with normal headers, we have to force download instead; ideally there should be second wwwroot for serving of untrusted files
* only internal links to selected areas are supported - you can link images in summary area, but not the assignment submissions


Absolute file links need to be rewritten if html editing allowed in module. The links are stored internally as relative links. Before editing or display the internal link representation is converted to absolute links using simple str_replace() @@thipluginlink/summary@@/image.jpg --> /pluginfile.php/assignmentcontextid/summary/image.jpg, it is converted back to internal links before saving.
Files are stored according to the SHA1 hash of their content. This means each file with particular contents is stored once, irrespective of how many times it is included in different places, even if it is referred to by different names. (This idea comes from the git version control system.) To relate a file on disc to a user-comprehensible path or filename, you need to use the ''files'' database table. See the next section.


::Can the distinct file areas supported by one plugin be declared somehow in order add some information about them? For example, I think it can be interesting to declare:
Suppose a file has SHA1 hash 081371cb102fa559e81993fddc230c79205232ce in the files table contenthash field. Then it will be stored in on disc as moodledata/filedir/08/13/081371cb102fa559e81993fddc230c79205232ce.
::* assignment_summary:
::** relpath='summary'
::** userdata=false
::** anotherproperty=anothervalue
::* assignment_submission:
::** relpath='submission/@@USERID@@'
::** userdata=false
::** anotherproperty=anothervalue
::* and so on...
::And then, when the editor "receives" one "assignment_summary" areaname, if knows what to show and so on? Also that info could be useful to know, in backup & restore if some fileareas have to be processed or no (userdata=false). Or also, when reconstructing the links (str_replace() above). And will cause to have a well defined list of fileareas by module, instead of coding them in a free way (prone to errors). [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 16:35, 28 June 2008 (CDT)


::Something like this will be part of file management API, hardcoding this in file storage would make it less flexible imo [[User:Skodak|Skodak]]
This means Moodle can not store two files with the same SHA1 hash, luckily it is extremely unlikely that this would ever happen. Technically it is also possible to implement reliable collision tests (with some performance cost), for now we just test file lengths in addition to SHA1 hash.


::Yup, yup. Storage doesn't know anything but get/put files (nothing else). It's part of management, absolutely. [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 11:21, 29 June 2008 (CDT)
As files on-disk are named with their sha1 hash, there is a simple way of validating files have not been corrupted since upload by using the 'sha1sum' command available in most GNU/Linux distributions.
In the below example, the user changed directory to the known location of a file on disk (sourced by finding the contenthash of the relevant file in the mdl_files database table).  
Then, in this directory the 'sha1sum' command is issued with the file that you wish to hash the content of. Returned from the command is the hash of the content (on the left) and the file name (on the right).


/pluginfile.php/contextid/areaname/arbitrary/params/or/dirs/filename.ext
Where a file is NOT corrupted after upload, these two strings will match.  


pluginfile.php detects the type of plugin from context table, fetches basic info (like $course or $cm if appropriate) and calls plugin function (or later method) which does the access control and finally sends the file to user. ''areaname'' separates files by type and divides the context into several subtrees - for example ''summary'' files (images used in module intros), post attachments, etc.
  $ cd /moodlepath/moodledata/filedir/1d/df/
  $ sha1sum 1ddf5b375fcb74929cdd7efda4f47efc61414edf
  1ddf5b375fcb74929cdd7efda4f47efc61414edf  1ddf5b375fcb74929cdd7efda4f47efc61414edf


====assignment example====
Where a file IS corrupted after upload, these will differ:


/pluginfile.php/assignmentcontextid/summary/someimage.jpg
  $ cd /moodlepath/moodledata/filedir/42/32/
/pluginfile.php/assignmentcontextid/submission/submissionid/attachmentname.ext
  $ sha1sum 42327aac8ce5741f51f42be298fa63686fe81b7a
  /pluginfile.php/assignmentcontextid/extra/allsubmissionfiles.zip
  9442188152c02f65267103d78167d122c87002cd 42327aac8ce5741f51f42be298fa63686fe81b7a
This is a very handy trick as in the case of any disk corruption (shared storage issues, hard drive issues, disk sector issues etc) the corrupted files can be detected without resorting to manually comparing to previous backups.


::Uhm... all those files together? What's going to differentiate the "submission" path in the example above from the "summary" path? Is it supposed that the editor, or the filemanager won't allow , for example to pick-up one file from the "submission" area to be used in the summary of one assignment and only the "summary" area will be showed? That means multiple file managers by context and it's against the clean "one file manager per context" agreed below [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 21:28, 26 June 2008 (CDT)
=== Files table ===
::Yes Eloy, the different areas (summary, submission) etc. have different uses, different access control. There are two types of file manager - the two pane file manager which lists all contexts+areas user may access, and minimalistic manager in html editor which shows only subset of areas from current plugin (because you can not link anything else).


====scorm example====
This table contains one entry for each usage of a file. Enough information is kept here so that the file can be fully identified and retrieved again if necessary. It is necessary because some databases have hard limit on index size.


/pluginfile.php/scormcontextid/summary/someimage.jpg
If, for example, the same image is used in a user's profile, and a forum post, then there will be two rows in this table, one for each use of the file, and Moodle will treat the two as separate files, even though the file is only stored once on disc.
/pluginfile.php/scormcontextid/content/revisionnumber/dir/somescormfile.js
 
The revision counter is incremented when any file changes in order to prevent caching problems. The lifetime should be adjustable in module settings.
 
====quiz example====
 
pluginfile.php/quizcontextid/summary/niceimage.jpg
pluginfile.php/quizcontextid/report/type/export.ods
 
====questions example====
 
pluginfile.php/SYSCONTEXTID/question/questionid/file.jpg
 
====blog example====
Blog entries or notes in general do not have context id (because they live in system context, SYSCONTEXTID below is the id of system context).
The note attachments are always served with XSS protection on, ideally we should use separate wwwroot for this. Access control can be hardcoded.
 
/pluginfile.php/SYSCONTEXTID/blog/blogentryid/attachmentname.ext
 
Internally stored in <code>array('contextid'=>SYSCONTEXTID, 'filearea'=>'blog', 'itemid'=>$blogentryid)</code>
 
====backup example====
It would be nice to have some special protection of backup files - new capabilities for backup file download, upload. Backups contain a lot of personal info, we could block restoring of backups from other sites too.
 
/pluginfile.php/coursecontextid/backup/backupfile.zip
 
Internally stored in <code>array('contextid'=>$coursecontextid, 'filearea'=>'backup', 'itemid'=>0)</code>
 
===userfile.php===
Personal file storage, intended as an online storage of work in progress like assignments before the submission.
* read/write own files only for now
* option to share with others later
* personal "websites" will not be supported (security)
 
/userfile.php/userid/dir/dir/filename.ext
 
===rssfile.php===
Replaces rss/file.php which is kept only for backwards compatibility.
RSS files should not require sessions/cookies, URLs should contain some sort of security token/key.
Internally the files may be stored in database or together with other files.
Performance improvements - we should support both Etag (cool) and Last-Modified (more used), when we receive If-None-Match/If-Modified-Since => 304
 
/rssfile.php/contextid/any/parameters/module/wants/rss.xml
/rssfile.php/SYSCONTEXTID/blog/userid/rss.xml
 
Again modules and plugins decide what gets sent to user.
 
===Temporary files===
Temporary files are usually used during the lifetime of one script only.
uses:
* exports
* imports
* zipping/unzipping
* processing by executable files (latex, mimetex)
 
Ideally these files should never use utf-8 (which is a major problem for zipping at the moment).
Proposed new sha1 based file storage is not suitable both for performance and technical reasons.
 
===Legacy file storage and serving===
Going to use good-old separate directories in $CFG->dataroot.
 
file serving and storage:
# user avatars - user/pix.php
# group avatars - user/pixgroup.php
# tex, algebra - filter/tex/* and filter/algebra/*
# rss cache (?full rss rewrite soon?) - backwards compatibility only rss/file.php
 
only storage:
#sessions
 
==File storage API==
Modules in general work only with local Moodle files. One of the major reason is performance when accessing external repository files. It will be possible to use repositories instead of file uploading and also to keep local files synced with external repository.
 
File contents are stored in moodledata/filepool indexed using SHA1 hashes instead of file names; file names, relative paths and other metadata will be stored in file(_xxx) database tables. This should be fully abstracted so that modules do not actually know where the files are located. When storing files the content is sent as string or file handle, when reading content it is returned as file handle.
 
===files table===
 
This table contains one entry for every file.  Enough information is kept here so that the file can be fully identified and retrieved again if necessary.
 
note: plural used because file is a reserved word
 
{| border="1" cellpadding="2" cellspacing="0"
|'''Field'''
|'''Type'''
|'''Default'''
|'''Info'''


{| class="wikitable"
! Field
! Type
! Default
! Info
|-
| '''id'''
| int(10) 
| auto-incrementing
| The unique ID for this file.
|-
|-
|'''id'''  
| '''contenthash'''
|int(10)
| varchar(40)
|
|
|autoincrementing
| The sha1 hash of content.
 
|-
|-
|sha1hash
| '''pathnamehash'''
|varchar(40)
| varchar(40)
|  
|  
|The sha1 hash of content.
| The sha1 hash of "/contextid/component/filearea/itemid/filepath/filename.ext" - prevents file duplicates and allows fast lookup.  It is necessary because some databases have hard limit on index size.
 
|-
|-
|'''contextid'''  
| '''contextid'''  
|int(10)
| int(10)
|  
|  
|The context id defined in context table - identifies the instance of plugin owning the file.
| The context id defined in context table - identifies the instance of plugin owning the file.
 
|-
|-
|filearea
| '''component'''
|varchar(50)
| varchar(50)
|
|
|Like "submissions", "intro" and "content" (images and swf linked from summaries), etc.; "blogs" and "userfiles" are special case that live at the system context.
| Like "mod_forum", "course", "mod_assignment", "backup"
 
|-
|-
|itemid
| '''filearea'''
|int(10)
| varchar(50)
|
|
|Some plugin specific item id (eg. forum post, blog entry or assignment submission or user id for user files)
| Like "submissions", "intro" and "content" (images and swf linked from summaries), etc.; "blogs" and "userfiles" are special case that live at the system context.
 
|-
| '''itemid'''
| int(10)
|
| Some plugin specific item id (eg. forum post, blog entry or assignment submission or user id for user files)
|-
| filepath
| text
|
| relative path to file from module content root, useful in Scorm and Resource mod - most of the mods do not need this
|-
| filename
| varchar(255)
|
| The full Unicode name of this file (case sensitive)
|-
|-
|filepath
| '''userid'''
|text
| int(10) 
|
| NULL
|relative path to file from module content root, useful in Scorm and Resource mod - most of the mods do not need this
| (optional)  Almost always this is the user that created the file, although some modules may choose to use this field for other purposes.
 
|-
|-
|filename
| filesize
|varchar(255)
| int(10)
|
|  
|The full Unicode name of this file (case sensitive)
| size of file - bytes
 
|-
|-
|filesize
| mimetype
|int(10)
| varchar(100)
|
| NULL
|size of file - bytes
| type of file
 
|-
|-
|mimetype
| status
|varchar(100)
| int(10)
|NULL
|  
|type of file
| general file status flag - will be used for lost or infected files
 
|-
|-
|'''userid'''
| source
|int(10) 
| text
|NULL
|  
|Optional - general user id field - meaning depending on plugin
| file source - usually url
 
|-
|-
|timecreated
| author
|int(10)
| varchar(255)
|
|
|The time this file was created
| original author of file, used when importing from other systems
 
|-
|-
|timemodified
| license
|int(10)
| varchar(255)
|
|
|The last time the file was modified
| license type, empty means site default
|-
| timecreated
| int(10)
|
| The time this file was created
|-
| timemodified
| int(10)
|
| The last time the file was last modified
|}
|}


index on "contextid, filearea, itemid" and "sha1hash"
Indexes:
* non-unique index on (contextid, component, filearea, itemid)
* non-unique index on (contenthash)
* unique index on (pathnamehash).
 
The plugin type does not need to be specified because it can be derived from the context. Items like blog that do not have their own context will use their own file area inside a suitable context. In this case, the user context.
 
Entries with filename = '.' represent directories. Directory entries like this are created automatically when a file is added within them.
 
Note: 'files' plural is used even thought that goes against the [[Database|coding guidelines]] because 'file' is a reserved word in some SQL dialects.
 
===Implementation of basic operations===
 
'''Each plugin may directly access only files in own context and areas!'''
 
Low level access API is defined in ''file_storage'' class which is obtained from <syntaxhighlight lang="php">get_file_storage()</syntaxhighlight>.
 
====Storing a file====
 
# Calculate the SHA1 hash of the file contents.
# Check if a file with this SHA1 hash already exists on disc in file directory or file trash. If not, store the file there.
# Add the record for this file to the files table using the low level address
 
====Reading a file====
 
# Fetch the record (which includes the SHA1 hash) for the file you want from the files table. You can fetch either all area files or quickly get one file with a specific contenthash.
# Retrieve the contents using the SHA1 hash from the file directory.
 
====Deleting a file====
 
# Delete the record from the files table.
# Verify if some other file is still needing the content, if not move the content file into file trash
# Later, admin/cron.php deletes content files from trash directory
 
== File serving ==
 
Deals with serving of files - browser requests file, Moodle sends it back. We have three main files. It is important to setup slasharguments on server properly (file.php/some/thing/xxx.jpg), any content that relies on relative links can not work without it (scorm, uploaded html pages, etc.).
 
=== legacy file.php ===


Plugin type is not specified because it is derived from contextid, items like blog that do not have own context will use own filearea usually from systemcontextid.
Serves legacy course files, the file name and parameter structure is critical for backwards compatibility of existing course content.


::Perhpas we could also hash filepath and filename and index by them, to save some text limitations in the DB side (length limits of indexes, not indexable, complex retrieval...). [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 11:54, 29 June 2008 (CDT)
/file.php/courseid/dir/dir/filename.ext


::Also, perhaps we should store finally the plugin type there to save some queries per request, using it to drive to the correct file handling of each plugin. [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 18:54, 29 June 2008 (CDT)
Internally the files are stored in <syntaxhighlight lang="php">array('contextid'=>$coursecontextid, 'component;=>'course', 'filearea'=>'legacy', 'itemid'=>0)</syntaxhighlight>


=== files_cleanup table ===
The legacy course files are completely disabled in all new courses created in 2.0. The major problem here is to how to educate our users that they can not make huge piles of files in each course any more.


This table contains candidates for deletion from the file pool. Files are not deleted immediately, cron uses the files_cleanup table, verifies the file is not used any more and deletes it from pool. Reasons for cron clean-up are performance and prevention of collision - there could be a problem with concurrent uploads and deletes, we will probably need to add some table-based locking during the clean-up.
=== pluginfile.php ===
All plugins should use this script to serve all files.
* plugins decide about access control
* optional XSS protection - student submitted files must not be served with normal headers, we have to force download instead; ideally there should be second wwwroot for serving of untrusted files
* links to these files are constructed on the fly from the relative links stored in database, this means that plugin may link only own files


We might add an extra script that does deep validation of pool area - report missing files, report orphaned files, content not matching the sha1 filename, etc. - this would very very time consuming.
Absolute file links need to be rewritten if html editing allowed in plugin. The links are stored internally as relative links. Before editing or display the internal link representation is converted to absolute links using simple str_replace() @@thipluginlink/summary@@/image.jpg --> /pluginfile.php/assignmentcontextid/intro/image.jpg, it is converted back to internal links before saving.


{| border="1" cellpadding="2" cellspacing="0"
Script parameters are virtual file names, in most cases the parameters match the low level file storage, but they do not have to:
|'''Field'''
|'''Type'''
|'''Default'''
|'''Info'''


|-
  /pluginfile.php/contextid/areaname/arbitrary/params/or/dirs/filename.ext
|'''id'''
|int(10)  
|
|autoincrementing


|-
pluginfile.php detects the type of plugin from context table, fetches basic info (like $course or $cm if appropriate) and calls plugin function (or later method) which does the access control and finally sends the file to user. ''areaname'' separates files by type and divides the context into several subtrees - for example ''summary'' files (images used in module intros), post attachments, etc.
|sha1hash
|varchar(40)
|


|}
==== Assignment example ====


=== files_metadata table ===
/pluginfile.php/assignmentcontextid/mod_assignment/intro/someimage.jpg
/pluginfile.php/assignmentcontextid/mod_assignment/submission/submissionid/attachmentname.ext
/pluginfile.php/assignmentcontextid/mod_assignment/allsubmissions/groupid/allsubmissionfiles.zip


This table contains extra metadata about files.  Repositories could provide this, or it could be manually edited in the local copy.
The last line example of virtual file that should created on the fly, it is not implemented yet.


{| border="1" cellpadding="2" cellspacing="0"
====scorm example====
|'''Field'''
|'''Type'''
|'''Default'''
|'''Info'''


|-
/pluginfile.php/scormcontextid/mod_scorm/intro/someimage.jpg
|'''id'''
  /pluginfile.php/scormcontextid/mod_scorm/content/revisionnumber/dir/somescormfile.js
|int(10)  
|
|autoincrementing


|-
The revision counter is incremented when any file changes in order to prevent caching problems.
|'''fileid'''
|int(10)
|
|Id of file.


|-
====quiz example====
|'''name'''
|varchar(255)
|
|The name of extra metadata


|-
pluginfile.php/quizcontextid/mod_quiz/intro/niceimage.jpg
|value
|text
|
|Value


|}
====questions example====


===files_acl table===
This section was out of date. See [[File_storage_conversion_Quiz_and_Questions]] for the latest thinking.


This table describes optional ACL for file. This is not required in majority of cases, modules usually hardcode the file access logic, course files should not be used much any more.
====blog example====
Blog entries or notes in general do not have context id (because they live in system context, SYSCONTEXTID below is the id of system context).
The note attachments are always served with XSS protection on, ideally we should use separate wwwroot for this. Access control can be hardcoded.


{| border="1" cellpadding="2" cellspacing="0"
/pluginfile.php/SYSCONTEXTID/blog/attachment/blogentryid/attachmentname.ext
|'''Field'''
|'''Type'''
|'''Default'''
|'''Info'''


|-
Internally stored in <syntaxhighlight lang="php">array('contextid'=>SYSCONTEXTID, 'component'=>'blog', 'filearea'=>'attachment', 'itemid'=>$blogentryid)</syntaxhighlight>
|'''id'''  
|int(10)
|
|autoincrementing


|-
  /pluginfile.php/SYSCONTEXTID/blog/post/blogentryid/embeddedimage.ext
|'''fileid'''
|int(10)  
|
|The file we are defining access for


|-
Internally stored in <syntaxhighlight lang="php">array('contextid'=>SYSCONTEXTID, 'component'=>'blog', 'filearea'=>'post', 'itemid'=>$blogentryid)</syntaxhighlight>
|'''contextid'''
|int(10)
|
|The context where this file is being published


|-
|'''capability'''
|text
|
|The capability that is required to see this file.
|}


====acl notes====
* this is missing some concept similar to '''user/group/others''', for example in case of user files typical user can not assign permissions or view them - this becomes useless there
* it is more important to synchronise the availability of file link and the file itself - having link pointing to inaccessible file or file which is accessible when not wanted are both problems
* browser/proxy caching works against us here - "secret" files should not be cached


===files_sync table===
=== Temporary files ===
Temporary files are usually used during the lifetime of one script only.
uses:
* exports
* imports
* processing by executable files (latex, mimetex)


This table contains information how to synchronise data with repositories. Data would be synchronised from cron.php or on demand from file manager. The sync would be one way only (repository-->local file).
These files should never use utf-8 file names.


{| border="1" cellpadding="2" cellspacing="0"
=== Legacy file storage and serving ===
|'''Field'''
Going to use good-old separate directories in $CFG->dataroot.
|'''Type'''
|'''Default'''
|'''Info'''


|-
file serving and storage:
|'''id'''
# user avatars - user/pix.php
|int(10)
# group avatars - user/pixgroup.php
|
# tex, algebra - filter/tex/* and filter/algebra/*
|autoincrementing
# rss cache (?full rss rewrite soon?) - backwards compatibility only rss/file.php


|-
only storage:
|'''fileid'''
#sessions
|int(10)
|
|Id of file.


|-
== File browsing API ==
|'''repositoryid'''
|int(10)
|
|The repository instance this is associated with, see [[Repository_API]]


|-
This is what other parts of Moodle use to access files that they do not own.
|updates
|int(10)
|
|Specifies the update schedule (0 = none, 1 = on demand, other = some period in seconds)


|-
|repositorypath
|text
|
|The full path to the original file on the repository


|-
=== Class: file_browser ===
|timeimportfirst
|int(10)
|
|The first time this file was imported into Moodle


|-
=== Class: file_info and subclasses ===
|timeimportlast
|int(10)
|
|The most recent time that this file was imported into Moodle
|}


===File content storage===
== File related user interfaces ==
Originally the file storage hierarchy contained a lot of metadata including userids, entry ids, filenames, etc. The file content will now be stored separately from file metadata. It must supports utf8 on all platforms.


File storing:
All files are obtained through from the file repositories.
# calculate SHA1 hash of content
# check if file with SHA1 name exists, if not add the file to file pool
# remove SHA1 from list of deleted files if found there
# store file in ''file'' table, use SHA1 as file pool identifier


File reading:
=== Formslib fields ===
#fetch file record from 'file' table - probably using file id or combination of contextid+instanceid
* file picker
#fetch content of file  
* file manager
* file upload (obsolete, do not use)


File deleting:
=== Integration with the HTML editor ===
#delete record from ''file'' table, remember file SHA1
#store the deleted SHA1 in deleted files table, do not remove the physical file yet
#wait for cron cleanup script to actually delete the file named SHA1 (proper table locking needed to prevent race conditions when adding/deleting files)


====File pool details====
Each instance of the HTML editor can be told to store related files in a particular file area.
located in $CFG->dataroot/filepool/, all files can not be stored in one directory due to OS limitations, it uses 3 levels based on first three characters of sha1 hash. It is unlikely that there will be thousands of files with the same first 3 chars in sha1 hash of their content.


This type of storage saves a lot of disk space when storing multiple copies of the same large file. It can also help substantially when synchronising data with external repositories. Another benefit is we can detect inconsistencies in file content.
During editing, files are stored in a draft files area. Then when the form is submitted they are moved into the real file area.


File read performance is similar to previous code, file write performance will be slower - due to hashing and extra database access.
Files are selected using the repository file picker.


dataroot
=== Legacy file manager ===
    /filepool
      /00
      /01
      ...
      /'''23'''
        /00
        /01
        ...
        /'''1e'''
            /00
            /01
            ...
            /'''2d'''
              /231e2dc421be4fcd0172e5afceea3970e2f3d940.jpg
      ...
      /fe
      /ff


==File management API==
Available only for legacy reasons. It is not supposed to be used.


This section describes following:
All the contexts, file areas and files now form a single huge tree structure, although each user only has access to certain parts of that tree. The file manager (files/index.php) allow users to browse this tree, and manage files within it, according to the level of permissions they have.
#file manager
#integration with html editor
#interactions with repos


===File manager===
Single pane file manager is hard to implement without drag & drop which is notoriously problematic in web based applications. I propose to implement a two pane commander-style file manager. Two pane manager allows you to easily copy/move files between two different contexts (ex: courses).
Single pane file manager is hard to implement without drag & drop which is notoriously problematic in web based applications. I propose to implement a two pane commander-style file manager. Two pane manager allows you to easily copy/move files between two different contexts (ex: courses).


File manager must not interact directly with filesystem API, instead each module should return traversable tree of files and directories with both real and localised names (localised names are needed for dirs like backupdata).
File manager must not interact directly with filesystem API, instead each module should return traversable tree of files and directories with both real and localised names (localised names are needed for dirs like backupdata).


Originally there was a single file tree for each course. We need to fully separate each module/block from the course files and there might be also independent file areas in modules (ex: module introduction, content files, submissions, post attachments). File area may be defined as a small tree where we can use relative paths. These file areas are hanging from the branches of the context tree (this needs a picture).
== Backwards compatibility ==
 
=== Content backwards compatibility ===
 
This should be preserved as much as possible. This will involve rewriting links in content during the upgrade to 2.0.
 
Some new features (like resource sharing - if implemented) may not work with existing data that still uses files from course files area.
 
There might be a breakage of links due to special characters stripping in uploaded files which will not match the links in uploaded html files any more. This should not be very common I hope.


===Integration with htmleditor===
===Code backwards compatibility===
Html editor should be able to browse only relevant files - for example when editing resource introduction only images from into file area of that resource should be available, when editing html resource page only the content area images should be listed.


There are several problem here:
Other Moodle code (for example plugins) will have to be converted to the new APIs. See [[Using_the_file_API]] for guidance.
# when adding new resource its context does not exist yet, we will have to create some table to handle temporary file storage for adding of new stuff, not easy but should be solvable - maybe we could abuse the course context id or store it temporarily in some special user file area
# we can not use absolute address relinking for pluginfile.php links, instead we can use the absolute links only when editing and before storage convert them to something like @@thispluginfile/intro@@/2112/112/image.jpg before storage. the local links would be converted to full absolute links before display or editing. Not all file areas will support this (ex: linking to assignment submission does not make sense because nobody else may access it anyway). This would allow us to implement image preview in html editor.


Html editor should contain simplified single pane file manager with basic operations only - select file area, browse file area, upload file/copy user file/use repo file, delete. The editor will communicate with modules and core through ajax call to some script specified by module embedding the editor. The callback script would use different logic to construct the tree of files than the File manager, it needs to know only about files that other ppl viewing the resulting html may access.
It is not possible to provide backwards-compatibility here. For example, the old $CFG->dataroot/$courseid/ will no longer exist, and there is no way to emulate that, so we won't try.


===Interactions with repos===
Repositories may serve as a replacement for file uploading. They may be also used to synchronise files between courses. The repo option should be available whenever there is a file upload field, sometimes with extra "keep synchronised" option (this would not make sense for stuff like assignment submissions).


==Upgrade, migration and backwards compatibility==
== Upgrade and migration ==
It is going to be a pain again like DML/DDL ;-)


===Code backwards compatibility===
When a site is upgraded to Moodle 2.0, all the files in moodledata will have to be migrated. This is going to be a pain, like DML/DDL was :-(
0% backwards compatibility related to file storage. New objects will be mandatory to use. Old $CFG->dataroot/$courseid/ will be empty, $CFG->dataroot/blog/ too, etc.


===Content backwards compatibility===
The upgrade process should be interruptible (like the Unicode upgrade was) so it can be stopped/restarted any time.
Means existing courses should not loose images, flash, etc. Though some new features (like resource sharing - if implemented) may not work with existing data that still uses files from course files area.


There might be a breakage of links due to special characters stripping in uploaded files which will not match the links in uploaded html files any more. This should not be very common I hope.
=== Migration of content ===


===Migration of content===
* resources - move files to new resource content file area; can be done automatically for pdf, image resources; definitely not accurate for uploaded web pages
* resources - move files to new resource content file area; can be done automatically for pdf, image resources; definitely not accurate for uploaded web pages
* questions - image file moved to new are, image tag appended to questions
* questions - image file moved to new area, image tag appended to questions
* moddata files - the easiest part, just move to new storage
* moddata files - the easiest part, just move to new storage
* coursefiles - there might be a lot of outdated files :-( :-(
* coursefiles - there might be many outdated files :-( :-(
* rss feeds links in readers - will be broken, the new security related code would break it anyway
* rss feeds links in readers - will be broken, the new security related code would break it anyway


===Moving files to files table and file pool===
=== Moving files to files table and file pool ===
 
The migration process must be interruptable because it might take a very long time. The files would be moved from old location, the restarting would be straightforward.
The migration process must be interruptable because it might take a very long time. The files would be moved from old location, the restarting would be straightforward.
Proposed stages:
Proposed stages:
#migration of all course files except moddata - finish marked by some $CFG->files_migrated=true; - this step breaks the old file manager and html editor integration
#migration of all course files except moddata - finish marked by some $CFG->files_migrated=true; - this step breaks the old file manager and html editor integration
Line 507: Line 373:
#migration of moddata files - each module is responsible to copy data from converted coursefiles or directly from moddata which is not converted automatically
#migration of moddata files - each module is responsible to copy data from converted coursefiles or directly from moddata which is not converted automatically


Some ppl use symbolic links in coursefiles - we must make sure that those will be copied to new storage in both places, though they can not be linked any more - anybody wanting to have content synced will need to move the files to some repository and set up the sync again.
Some people use symbolic links in coursefiles - we must make sure that those will be copied to new storage in both places, though they can not be linked any more - anybody wanting to have content synced will need to move the files to some repository and set up the sync again.


::Talked about a double task here, when migrating course files to module areas:
::Talked about a double task here, when migrating course files to module areas:
Line 519: Line 385:
::[[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 19:00, 29 June 2008 (CDT)
::[[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 19:00, 29 June 2008 (CDT)


==Backup/restore changes==
File handling in backups needs to be fully rewritten - list of files in xml + pool of sha1 named files with contents. This solves the utf-8 trouble here, yay!!
==Quotas==
File size will be stored in files table, we can use simple queries to find out how much space is used, however this may not be accurate because the sha1 hash based storage eliminates duplicate files.
*total course files - find out all contexts used in course, query files table with contextid IN ($listofcontexts)
*module files - find module context and calculate space per file area
*user files quota - inside the personal area only, counting all attachments in all mods might take a while
We could also divide the file size by number of instances that are using it, this might be considered more accurate in some scenarios.
==Other==
*antivirus scanning + upload manager rewrite/integration with forms lib
*zip compression and extraction


==Major problems==
List of hard to solve prolbems


===unicode zip support===
Unicode chars in zip files uploaded by teachers - unfortunately there is no 100% solution that will work for anybody because most zip programs do not support unicode, it is usually ''garbage in/garbage out'' which works in some cases only


Latest WinZIP 11.2 and Total Commander seem to support some very limited form of utf-8 encodings of file names. I managed to create a zip file in Windows (Czech locale) and extract them in linux with native PHP zip functions, the only step I needed to add was conversion cp852(DOS charset for Czech locale) -> UTF-8. The native windows zipping did not work for me though, because it does some different borking of charsets.
== Other issues ==


In any case it seems likely that native PHP support in PHP 5.2.x should be better than current pclzip or infozip binary.
=== Unicode support in zip format ===


===empty directories===
Zip format is an old standard for compressing files. It was created long before Unicode existed, and Unicode support was only recently added. There are several ways used for encoding of non-ASCII characters in path names, but unfortunately it is not very standardised. Most Windows packers use DOS encoding.
Hmm, thinking a bit more about Justin's comment I realised there is no support for empty directories in this proposal. This will require either new table or some hack in files table - maybe we could add files with "." as name and just skip them when iterating directory content.


===file overwriting===
Client software:
Concept of file overwriting does not exist anymore here, the path+filename are not enforced to be unique - we can not make index because sloppy mssql does not allow indexes larger than 900 bytes :-( We will haev to emulate it somehow and deal with collisions if found.
* Windows built-in compression - bundled with Windows, non-standard DOS encoding only
* WinZip - shareware, Unicode option (since v11.2)
* TotalCommander - shareware, single byte(DOS) encoding only
* 7-Zip - free, Unicode or DOS encoding depending on characters used in file name (since v4.58beta)
* Info-ZIP - free, uses some weird character set conversions


== Some little comments to be considered (to avoid forgetting them) ==
PHP extraction:
* Info-ZIP binary execution - no Unicode support at all, mangles character sets in file names (depends on OS, see docs), files must be copied to temp directory before compression and after extraction
* PclZip PHP library - reads single byte encoded names only, problems with random problems and higher memory usage.
* Zip PHP extension - kind of works in latest PHP versions


* each context will have its own "file manager"
Large file support:
* separate "file manager context" files (FMF) and "internal context" (ICF) files (current modedit files, submissions, attachements...)
PHP running under 32bit operating systems does not support files >2GB (do not expect fix before PHP 6). This might be a potential problem for larger backups.
* /pluginfile.php/SYSCONTEXTID/{blog|question} and so... will have own FMF too? Or only ICF ?
* Way to copy between contexts
* Links = -1 for them
* Deletion strategy (locks, quarantine status...)
* include support for quotas per user, per course, etc
* upgrade process should be interruptable (like the unicode upgrade) so it can be stopped/restarted any time


==Justin's thinking out loud==
Tar Alternative:
* tar with gzip compression - easy to implement in PHP + zlib extension (PclTar, Tar from PEAR or custom code)
* no problem with unicode in *nix, Windows again expects DOS encoding :-(
* seems suitable for backup/restore - yay!


I'm actually working on implementing this along with extending an existing Alfresco integration to work together with the whole File / Repository system and I wanted to get some of my comments and thoughts in here for feedback.  Go easy on me.  =)
Summary:
# added zip processing class that fully hides the underlying library
# using single byte encoding "garbage in/garbage out" approach for encoding of files in zip archives; add new 'zipencoding' string into lang packs (ex: cp852 DOS charset for Czech locale) and use it during extraction (we might support true unicode later when PHP Zip extension does that)


So far I've only got one that I'd like to solicit some feedback on (BTW, if this would be better suited to a forum discussion, let me know):
=== Tar packer ===


===Not storing the full ''filepath'' with each entry in the '''file''' table===
A .tar.gz format packer is available from Moodle 2.6 (requires zlib extension) and can be selected for use in Moodle backup via an experimental option. MDL-41838.
*For browsing a directory structure, determining things like child directories or a parent directory given a filepath requires a lot of extraneous coding in PHP. I think it might be better served to create a new '''file_directory''' table, storing only a directory name, and reference to a parent directory record. The benefits here are that we're storing a lot of duplicate text field values in the '''file''' table and browsing through the file picker for local files doesn't require a lot of PHP overhead to calculate links to parent / child directories.
*Given that file permissions are no longer calculated using structured file paths, using the complete, full, path to a given file would most likely never be needed.


*The '''repositorypath''' field in the '''repository_sync''' table still makes sense, though.
The packer is currently limited to ASCII filenames and individual files are limited to 8GB each, but unlike zip there is no limit on the total filesize. It uses the old POSIX format and is compatible with GNU tar using default options.


*The '''file_directory''' table:
== Not implemented yet ==
* antivirus scanning - this needs a different api because the upload of files is now handled via repository plugins


{| border="1" cellpadding="2" cellspacing="0"
== See also ==
|'''Field'''
|'''Type'''
|'''Default'''
|'''Info'''
 
|-
|'''id'''
|int(10) 
|
|autoincrementing
 
|-
|'''parent'''
|int(10)
|
|ID of directory that this record is a child of.
 
|-
|'''directoryname'''
|varchar(255)
|
|The actual name of this directory.
|}
 
skodak: filepath is stored in files table - its root is the corresponding filearea, the file manager will use the context tree to find all plugins/courses and ask them to return the list of areas with all those small branches inside it
 
==See also==


* [[Using the File API]]
* [[Repository API]]
* [[Repository API]]
* [[Portfolio API]]
* [[Portfolio API]]
* [[Resource module file API migration]]
* MDL-14589 - File API Meta issue
[[Category:Files]]
[[Category:Interfaces]]

Latest revision as of 07:27, 6 May 2022

Important:

This content of this page has been updated and migrated to the new Moodle Developer Resources. The information contained on the page should no longer be seen up-to-date.

Why not view this page on the new site and help us to migrate more content to the new site!

File API
Project state Implemented
Tracker issue MDL-14589
Discussion n/a
Assignee Petr Škoda (škoďák)

Moodle 2.0


Objectives

The goals of the new File API are:

  • allow files to be stored within Moodle, as part of the content (as we do now).
  • use a consistent and flexible approach for all file handling throughout Moodle.
  • give modules control over which users can access a file, using capabilities and other local rules.
  • make it easy to determine which parts of Moodle use which files, to simplify operations like backup and restore.
  • track where files originally came from.
  • avoid redundant storage, when the same file is used twice.
  • fully support Unicode file names, irrespective of the capabilities of the underlying file system.

Overview

The File API is a set of core interfaces to allow the rest of Moodle to store, serve and manage files. It applies only to files that are part of the Moodle site's content. It is not used for internal files, such as those in the following subdirectories of dataroot: temp, lang, cache, environment, filter, search, sessions, upgradelogs, ...

To learn how to use File API, please visit Using the File API.

The API can be subdivided into the following parts:

File storage
Low level file storage without access control information. Stores the content of files on disc, with metadata in associated database tables.
File serving
Lets users accessing a Moodle site get the files (file.php, draftfile.php, pluginfile.php, userfile.php, etc.)
  • Serve the files on request
  • with appropriate security checks
File related user interfaces
Provides the interface for (lib/form/file.php, filemanager.php, filepicker.php and files/index.php, draftfiles.php)
  • Form elements allowing users to select a file using the file picker, and have it stored within Moodle.
  • UI for users to manage their files, replacing the old course files UI
File browsing API
Allows code to browse and optionally manipulate the file areas
  • find information about available files in each area.
  • print links to files.
  • optionally move/rename/copy/delete/etc.

File API internals

File storage on disk

Files are stored in $CFG->dataroot (also known as moodledata) in the filedir subfolder.

Files are stored according to the SHA1 hash of their content. This means each file with particular contents is stored once, irrespective of how many times it is included in different places, even if it is referred to by different names. (This idea comes from the git version control system.) To relate a file on disc to a user-comprehensible path or filename, you need to use the files database table. See the next section.

Suppose a file has SHA1 hash 081371cb102fa559e81993fddc230c79205232ce in the files table contenthash field. Then it will be stored in on disc as moodledata/filedir/08/13/081371cb102fa559e81993fddc230c79205232ce.

This means Moodle can not store two files with the same SHA1 hash, luckily it is extremely unlikely that this would ever happen. Technically it is also possible to implement reliable collision tests (with some performance cost), for now we just test file lengths in addition to SHA1 hash.

As files on-disk are named with their sha1 hash, there is a simple way of validating files have not been corrupted since upload by using the 'sha1sum' command available in most GNU/Linux distributions. In the below example, the user changed directory to the known location of a file on disk (sourced by finding the contenthash of the relevant file in the mdl_files database table). Then, in this directory the 'sha1sum' command is issued with the file that you wish to hash the content of. Returned from the command is the hash of the content (on the left) and the file name (on the right).

Where a file is NOT corrupted after upload, these two strings will match.

 $ cd /moodlepath/moodledata/filedir/1d/df/
 $ sha1sum 1ddf5b375fcb74929cdd7efda4f47efc61414edf
 1ddf5b375fcb74929cdd7efda4f47efc61414edf  1ddf5b375fcb74929cdd7efda4f47efc61414edf

Where a file IS corrupted after upload, these will differ:

 $ cd /moodlepath/moodledata/filedir/42/32/
 $ sha1sum 42327aac8ce5741f51f42be298fa63686fe81b7a
 9442188152c02f65267103d78167d122c87002cd  42327aac8ce5741f51f42be298fa63686fe81b7a

This is a very handy trick as in the case of any disk corruption (shared storage issues, hard drive issues, disk sector issues etc) the corrupted files can be detected without resorting to manually comparing to previous backups.

Files table

This table contains one entry for each usage of a file. Enough information is kept here so that the file can be fully identified and retrieved again if necessary. It is necessary because some databases have hard limit on index size.

If, for example, the same image is used in a user's profile, and a forum post, then there will be two rows in this table, one for each use of the file, and Moodle will treat the two as separate files, even though the file is only stored once on disc.

Field Type Default Info
id int(10) auto-incrementing The unique ID for this file.
contenthash varchar(40) The sha1 hash of content.
pathnamehash varchar(40) The sha1 hash of "/contextid/component/filearea/itemid/filepath/filename.ext" - prevents file duplicates and allows fast lookup. It is necessary because some databases have hard limit on index size.
contextid int(10) The context id defined in context table - identifies the instance of plugin owning the file.
component varchar(50) Like "mod_forum", "course", "mod_assignment", "backup"
filearea varchar(50) Like "submissions", "intro" and "content" (images and swf linked from summaries), etc.; "blogs" and "userfiles" are special case that live at the system context.
itemid int(10) Some plugin specific item id (eg. forum post, blog entry or assignment submission or user id for user files)
filepath text relative path to file from module content root, useful in Scorm and Resource mod - most of the mods do not need this
filename varchar(255) The full Unicode name of this file (case sensitive)
userid int(10) NULL (optional) Almost always this is the user that created the file, although some modules may choose to use this field for other purposes.
filesize int(10) size of file - bytes
mimetype varchar(100) NULL type of file
status int(10) general file status flag - will be used for lost or infected files
source text file source - usually url
author varchar(255) original author of file, used when importing from other systems
license varchar(255) license type, empty means site default
timecreated int(10) The time this file was created
timemodified int(10) The last time the file was last modified

Indexes:

  • non-unique index on (contextid, component, filearea, itemid)
  • non-unique index on (contenthash)
  • unique index on (pathnamehash).

The plugin type does not need to be specified because it can be derived from the context. Items like blog that do not have their own context will use their own file area inside a suitable context. In this case, the user context.

Entries with filename = '.' represent directories. Directory entries like this are created automatically when a file is added within them.

Note: 'files' plural is used even thought that goes against the coding guidelines because 'file' is a reserved word in some SQL dialects.

Implementation of basic operations

Each plugin may directly access only files in own context and areas!

Low level access API is defined in file_storage class which is obtained from

get_file_storage()

.

Storing a file

  1. Calculate the SHA1 hash of the file contents.
  2. Check if a file with this SHA1 hash already exists on disc in file directory or file trash. If not, store the file there.
  3. Add the record for this file to the files table using the low level address

Reading a file

  1. Fetch the record (which includes the SHA1 hash) for the file you want from the files table. You can fetch either all area files or quickly get one file with a specific contenthash.
  2. Retrieve the contents using the SHA1 hash from the file directory.

Deleting a file

  1. Delete the record from the files table.
  2. Verify if some other file is still needing the content, if not move the content file into file trash
  3. Later, admin/cron.php deletes content files from trash directory

File serving

Deals with serving of files - browser requests file, Moodle sends it back. We have three main files. It is important to setup slasharguments on server properly (file.php/some/thing/xxx.jpg), any content that relies on relative links can not work without it (scorm, uploaded html pages, etc.).

legacy file.php

Serves legacy course files, the file name and parameter structure is critical for backwards compatibility of existing course content.

/file.php/courseid/dir/dir/filename.ext

Internally the files are stored in

array('contextid'=>$coursecontextid, 'component;=>'course', 'filearea'=>'legacy', 'itemid'=>0)

The legacy course files are completely disabled in all new courses created in 2.0. The major problem here is to how to educate our users that they can not make huge piles of files in each course any more.

pluginfile.php

All plugins should use this script to serve all files.

  • plugins decide about access control
  • optional XSS protection - student submitted files must not be served with normal headers, we have to force download instead; ideally there should be second wwwroot for serving of untrusted files
  • links to these files are constructed on the fly from the relative links stored in database, this means that plugin may link only own files

Absolute file links need to be rewritten if html editing allowed in plugin. The links are stored internally as relative links. Before editing or display the internal link representation is converted to absolute links using simple str_replace() @@thipluginlink/summary@@/image.jpg --> /pluginfile.php/assignmentcontextid/intro/image.jpg, it is converted back to internal links before saving.

Script parameters are virtual file names, in most cases the parameters match the low level file storage, but they do not have to:

/pluginfile.php/contextid/areaname/arbitrary/params/or/dirs/filename.ext

pluginfile.php detects the type of plugin from context table, fetches basic info (like $course or $cm if appropriate) and calls plugin function (or later method) which does the access control and finally sends the file to user. areaname separates files by type and divides the context into several subtrees - for example summary files (images used in module intros), post attachments, etc.

Assignment example

/pluginfile.php/assignmentcontextid/mod_assignment/intro/someimage.jpg
/pluginfile.php/assignmentcontextid/mod_assignment/submission/submissionid/attachmentname.ext
/pluginfile.php/assignmentcontextid/mod_assignment/allsubmissions/groupid/allsubmissionfiles.zip

The last line example of virtual file that should created on the fly, it is not implemented yet.

scorm example

/pluginfile.php/scormcontextid/mod_scorm/intro/someimage.jpg
/pluginfile.php/scormcontextid/mod_scorm/content/revisionnumber/dir/somescormfile.js

The revision counter is incremented when any file changes in order to prevent caching problems.

quiz example

pluginfile.php/quizcontextid/mod_quiz/intro/niceimage.jpg

questions example

This section was out of date. See File_storage_conversion_Quiz_and_Questions for the latest thinking.

blog example

Blog entries or notes in general do not have context id (because they live in system context, SYSCONTEXTID below is the id of system context). The note attachments are always served with XSS protection on, ideally we should use separate wwwroot for this. Access control can be hardcoded.

/pluginfile.php/SYSCONTEXTID/blog/attachment/blogentryid/attachmentname.ext

Internally stored in

array('contextid'=>SYSCONTEXTID, 'component'=>'blog', 'filearea'=>'attachment', 'itemid'=>$blogentryid)
/pluginfile.php/SYSCONTEXTID/blog/post/blogentryid/embeddedimage.ext

Internally stored in

array('contextid'=>SYSCONTEXTID, 'component'=>'blog', 'filearea'=>'post', 'itemid'=>$blogentryid)


Temporary files

Temporary files are usually used during the lifetime of one script only. uses:

  • exports
  • imports
  • processing by executable files (latex, mimetex)

These files should never use utf-8 file names.

Legacy file storage and serving

Going to use good-old separate directories in $CFG->dataroot.

file serving and storage:

  1. user avatars - user/pix.php
  2. group avatars - user/pixgroup.php
  3. tex, algebra - filter/tex/* and filter/algebra/*
  4. rss cache (?full rss rewrite soon?) - backwards compatibility only rss/file.php

only storage:

  1. sessions

File browsing API

This is what other parts of Moodle use to access files that they do not own.


Class: file_browser

Class: file_info and subclasses

File related user interfaces

All files are obtained through from the file repositories.

Formslib fields

  • file picker
  • file manager
  • file upload (obsolete, do not use)

Integration with the HTML editor

Each instance of the HTML editor can be told to store related files in a particular file area.

During editing, files are stored in a draft files area. Then when the form is submitted they are moved into the real file area.

Files are selected using the repository file picker.

Legacy file manager

Available only for legacy reasons. It is not supposed to be used.

All the contexts, file areas and files now form a single huge tree structure, although each user only has access to certain parts of that tree. The file manager (files/index.php) allow users to browse this tree, and manage files within it, according to the level of permissions they have.

Single pane file manager is hard to implement without drag & drop which is notoriously problematic in web based applications. I propose to implement a two pane commander-style file manager. Two pane manager allows you to easily copy/move files between two different contexts (ex: courses).

File manager must not interact directly with filesystem API, instead each module should return traversable tree of files and directories with both real and localised names (localised names are needed for dirs like backupdata).

Backwards compatibility

Content backwards compatibility

This should be preserved as much as possible. This will involve rewriting links in content during the upgrade to 2.0.

Some new features (like resource sharing - if implemented) may not work with existing data that still uses files from course files area.

There might be a breakage of links due to special characters stripping in uploaded files which will not match the links in uploaded html files any more. This should not be very common I hope.

Code backwards compatibility

Other Moodle code (for example plugins) will have to be converted to the new APIs. See Using_the_file_API for guidance.

It is not possible to provide backwards-compatibility here. For example, the old $CFG->dataroot/$courseid/ will no longer exist, and there is no way to emulate that, so we won't try.


Upgrade and migration

When a site is upgraded to Moodle 2.0, all the files in moodledata will have to be migrated. This is going to be a pain, like DML/DDL was :-(

The upgrade process should be interruptible (like the Unicode upgrade was) so it can be stopped/restarted any time.

Migration of content

  • resources - move files to new resource content file area; can be done automatically for pdf, image resources; definitely not accurate for uploaded web pages
  • questions - image file moved to new area, image tag appended to questions
  • moddata files - the easiest part, just move to new storage
  • coursefiles - there might be many outdated files :-( :-(
  • rss feeds links in readers - will be broken, the new security related code would break it anyway

Moving files to files table and file pool

The migration process must be interruptable because it might take a very long time. The files would be moved from old location, the restarting would be straightforward.

Proposed stages:

  1. migration of all course files except moddata - finish marked by some $CFG->files_migrated=true; - this step breaks the old file manager and html editor integration
  2. migration of blog attachments
  3. migration of question files
  4. migration of moddata files - each module is responsible to copy data from converted coursefiles or directly from moddata which is not converted automatically

Some people use symbolic links in coursefiles - we must make sure that those will be copied to new storage in both places, though they can not be linked any more - anybody wanting to have content synced will need to move the files to some repository and set up the sync again.

Talked about a double task here, when migrating course files to module areas:
  1. Parse html files to detect all the dependencies and move them together.
  2. Fallback in pluginfile.php so, if something isn't found in module filearea, search for it in course filearea, copying it and finally, serving it.
Also we talked about the possibility of add a new setting to resource in order to define if it should work against old coursefiles or new autocontained file areas. Migrated resources will point to old coursefiles while new ones will enforce autocontained file areas.
it seems that only resource files will be really complex (because allow arbitrary HTML inclusion). The rest (labels, intros... doesn't) and should be easier to parse.
Eloy Lafuente (stronk7) 19:00, 29 June 2008 (CDT)



Other issues

Unicode support in zip format

Zip format is an old standard for compressing files. It was created long before Unicode existed, and Unicode support was only recently added. There are several ways used for encoding of non-ASCII characters in path names, but unfortunately it is not very standardised. Most Windows packers use DOS encoding.

Client software:

  • Windows built-in compression - bundled with Windows, non-standard DOS encoding only
  • WinZip - shareware, Unicode option (since v11.2)
  • TotalCommander - shareware, single byte(DOS) encoding only
  • 7-Zip - free, Unicode or DOS encoding depending on characters used in file name (since v4.58beta)
  • Info-ZIP - free, uses some weird character set conversions

PHP extraction:

  • Info-ZIP binary execution - no Unicode support at all, mangles character sets in file names (depends on OS, see docs), files must be copied to temp directory before compression and after extraction
  • PclZip PHP library - reads single byte encoded names only, problems with random problems and higher memory usage.
  • Zip PHP extension - kind of works in latest PHP versions

Large file support: PHP running under 32bit operating systems does not support files >2GB (do not expect fix before PHP 6). This might be a potential problem for larger backups.

Tar Alternative:

  • tar with gzip compression - easy to implement in PHP + zlib extension (PclTar, Tar from PEAR or custom code)
  • no problem with unicode in *nix, Windows again expects DOS encoding :-(
  • seems suitable for backup/restore - yay!

Summary:

  1. added zip processing class that fully hides the underlying library
  2. using single byte encoding "garbage in/garbage out" approach for encoding of files in zip archives; add new 'zipencoding' string into lang packs (ex: cp852 DOS charset for Czech locale) and use it during extraction (we might support true unicode later when PHP Zip extension does that)

Tar packer

A .tar.gz format packer is available from Moodle 2.6 (requires zlib extension) and can be selected for use in Moodle backup via an experimental option. MDL-41838.

The packer is currently limited to ASCII filenames and individual files are limited to 8GB each, but unlike zip there is no limit on the total filesize. It uses the old POSIX format and is compatible with GNU tar using default options.

Not implemented yet

  • antivirus scanning - this needs a different api because the upload of files is now handled via repository plugins

See also