Note: You are currently viewing documentation for Moodle 3.3. Up-to-date documentation for the latest stable version of Moodle is probably available here: File API.

Development:File API: Difference between revisions

From MoodleDocs
No edit summary
Line 1: Line 1:
{{Moodle_2.0}}
{{Moodle_2.0}}
This specification has now largely been implemented in Moodle 2.0. Inevitably, in the implementation, some details may have changed. If in doubt, read the code. (Then come back and correct this page ;-))
This specification has now largely been implemented in Moodle 2.0. Inevitably, in the implementation, some details may have changed. If in doubt, read the code. (Then come back and correct this page ;-))
The implementation of this specification is being tracked at MDL-14589.




==Objectives==
==Objectives==


* Allow files to be stored within Moodle, as part of the content (as we do now).
The goals of these changes are to:
* Use a consistent and simple approach for all file handling throughout Moodle.
 
* Give modules control over which users can access a file, using capabilities and other local rules.
* allow files to be stored within Moodle, as part of the content (as we do now).
* Make it easy to determine which parts of Moodle use which files, to simplify operations like backup and restore.
* use a consistent and simple approach for all file handling throughout Moodle.
* Track where files originally came from.
* give modules control over which users can access a file, using capabilities and other local rules.
* Avoid redundant storage, when the same file is used twice.
* make it easy to determine which parts of Moodle use which files, to simplify operations like backup and restore.
* Fully support Unicode file names, irrespective of the capabilities of the underlying file system.
* track where files originally came from.
* avoid redundant storage, when the same file is used twice.
* fully support Unicode file names, irrespective of the capabilities of the underlying file system.




Line 23: Line 27:
The API can be subdivided into the following parts:
The API can be subdivided into the following parts:
; Serving files
; Serving files
: Lets users accessing a Moodle site get the files (file.php, draftfile.php, pluginfile.php, userfile.php)
: Lets users accessing a Moodle site get the files (file.php, draftfile.php, pluginfile.php, userfile.php, etc.)
:* Serve the files on request
:* Serve the files on request
:* with appropriate security checks
:* with appropriate security checks
Line 269: Line 273:


The plugin type does not need to be specified because it can be derived from the context. Items like blog that do not have their own context will use their own file area inside a suitable context. In this case, the user context.
The plugin type does not need to be specified because it can be derived from the context. Items like blog that do not have their own context will use their own file area inside a suitable context. In this case, the user context.
Entries with filename = '.' represent directories. Directory entries like this are created automatically when a file is added within them.


Note: 'files' plural used even thought that goes against the [[Development:Coding#Database_structures|coding guidelines]] because 'file' is a reserved word.
Note: 'files' plural used even thought that goes against the [[Development:Coding#Database_structures|coding guidelines]] because 'file' is a reserved word.
Line 373: Line 379:
|  
|  
|}
|}
In future, we might add an extra maintenance script that does deep validation of the files on disc:
* Report when there is a row in the files table, but the corresponding file is missing from the file system.
* Report files that still exist on disc, even though no references remain.
* Report orphaned files.
* Verify that the SHA1 hash of the contents of each file matches the file name.


===Implementation of basic operations===
===Implementation of basic operations===
Line 404: Line 404:


== File management API ==
== File management API ==
These are the ways that other parts of the code manipulate files.
== File management user-interface ==


This section describes following:
This section describes following:
Line 411: Line 417:


===File manager===
===File manager===
Single pane file manager is hard to implement without drag & drop which is notoriously problematic in web based applications. I propose to implement a two pane commander-style file manager. Two pane manager allows you to easily copy/move files between two different contexts (ex: courses).
Single pane file manager is hard to implement without drag & drop which is notoriously problematic in web based applications. I propose to implement a two pane commander-style file manager. Two pane manager allows you to easily copy/move files between two different contexts (ex: courses).


Line 418: Line 425:


===Integration with htmleditor===
===Integration with htmleditor===
Html editor should be able to browse only relevant files - for example when editing resource introduction only images from the file area of that resource should be available; when editing html resource page only the content area images should be listed.
Html editor should be able to browse only relevant files - for example when editing resource introduction only images from the file area of that resource should be available; when editing html resource page only the content area images should be listed.


Line 427: Line 435:


===Interactions with repos===
===Interactions with repos===
Repositories may serve as a replacement for file uploading. They may be also used to synchronise files between courses. The repo option should be available whenever there is a file upload field, sometimes with extra "keep synchronised" option (this would not make sense for stuff like assignment submissions).
Repositories may serve as a replacement for file uploading. They may be also used to synchronise files between courses. The repo option should be available whenever there is a file upload field, sometimes with extra "keep synchronised" option (this would not make sense for stuff like assignment submissions).


==Upgrade, migration and backwards compatibility==
It is going to be a pain again like DML/DDL ;-)


===Code backwards compatibility===
== Backwards compatibility ==
0% backwards compatibility related to file storage. New objects will be mandatory to use. Old $CFG->dataroot/$courseid/ will be empty, $CFG->dataroot/blog/ too, etc.


===Content backwards compatibility===
===Content backwards compatibility===
Means existing courses should not loose images, flash, etc. Though some new features (like resource sharing - if implemented) may not work with existing data that still uses files from course files area.
Means existing courses should not loose images, flash, etc. Though some new features (like resource sharing - if implemented) may not work with existing data that still uses files from course files area.


There might be a breakage of links due to special characters stripping in uploaded files which will not match the links in uploaded html files any more. This should not be very common I hope.
There might be a breakage of links due to special characters stripping in uploaded files which will not match the links in uploaded html files any more. This should not be very common I hope.


===Migration of content===
===Code backwards compatibility===
 
Other Moodle code (for example plugins) will have to be converted to the new APIs. See [[Development:Using_the_file_API]] for guidance.
 
It is not possible to provide backwards-compatibility here. For example, the old $CFG->dataroot/$courseid/ will no longer exist, and there is no way to emulate that, so we won't try.
 
 
== Upgrade and migration ==
 
When a site is upgraded to Moodle 2.0, all the files in moodledata will have to be migrated. This is going to be a pain, like DML/DDL was :-(
 
The upgrade process should be interruptible (like the Unicode upgrade was) so it can be stopped/restarted any time.
 
=== Migration of content ===
 
* resources - move files to new resource content file area; can be done automatically for pdf, image resources; definitely not accurate for uploaded web pages
* resources - move files to new resource content file area; can be done automatically for pdf, image resources; definitely not accurate for uploaded web pages
* questions - image file moved to new are, image tag appended to questions
* questions - image file moved to new area, image tag appended to questions
* moddata files - the easiest part, just move to new storage
* moddata files - the easiest part, just move to new storage
* coursefiles - there might be many outdated files :-( :-(
* coursefiles - there might be many outdated files :-( :-(
* rss feeds links in readers - will be broken, the new security related code would break it anyway
* rss feeds links in readers - will be broken, the new security related code would break it anyway


===Moving files to files table and file pool===
=== Moving files to files table and file pool ===
 
The migration process must be interruptable because it might take a very long time. The files would be moved from old location, the restarting would be straightforward.
The migration process must be interruptable because it might take a very long time. The files would be moved from old location, the restarting would be straightforward.
Proposed stages:
Proposed stages:
#migration of all course files except moddata - finish marked by some $CFG->files_migrated=true; - this step breaks the old file manager and html editor integration
#migration of all course files except moddata - finish marked by some $CFG->files_migrated=true; - this step breaks the old file manager and html editor integration
Line 455: Line 478:
#migration of moddata files - each module is responsible to copy data from converted coursefiles or directly from moddata which is not converted automatically
#migration of moddata files - each module is responsible to copy data from converted coursefiles or directly from moddata which is not converted automatically


Some ppl use symbolic links in coursefiles - we must make sure that those will be copied to new storage in both places, though they can not be linked any more - anybody wanting to have content synced will need to move the files to some repository and set up the sync again.
Some people use symbolic links in coursefiles - we must make sure that those will be copied to new storage in both places, though they can not be linked any more - anybody wanting to have content synced will need to move the files to some repository and set up the sync again.


::Talked about a double task here, when migrating course files to module areas:
::Talked about a double task here, when migrating course files to module areas:
Line 467: Line 490:
::[[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 19:00, 29 June 2008 (CDT)
::[[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 19:00, 29 June 2008 (CDT)


==Backup/restore changes==
=== Other parts of Moodle that need to be modified ===
 
==== Backup/restore ====
 
File handling in backups needs to be fully rewritten - list of files in xml + pool of sha1 named files with contents. This solves the utf-8 trouble here, yay!!
File handling in backups needs to be fully rewritten - list of files in xml + pool of sha1 named files with contents. This solves the utf-8 trouble here, yay!!


==Quotas==
==== Antivirus scanning ====
File size will be stored in files table, we can use simple queries to find out how much space is used, however this may not be accurate because the sha1 hash based storage eliminates duplicate files.
 
::If sha1 string is stored in the files table (non-unique), then it can be used to detect duplicates within the files table and only counting their size once. To avoid having to search this often, we could do it periodically and store the filesize with each file record (only the first of the duplicates gets a filesize, the others get 0). [[User:Nicolas Connault|Nicolas Connault]]


*total course files - find out all contexts used in course, query files table with contextid IN ($listofcontexts)
== Issues that need to be resolved ==
*module files - find module context and calculate space per file area
*user files quota - inside the personal area only, counting all attachments in all mods might take a while
We could also divide the file size by number of instances that are using it, this might be considered more accurate in some scenarios.


==Other==
=== Unicode support in zip format ===
*antivirus scanning + upload manager rewrite/integration with forms lib
*zip compression and extraction


==Major problems==
''This has now been solved by using the built in zip in PHP 5.2.8.''
List of hard to solve prolbems


===zip support===
Zip format is an old standard for compressing files. It was created long before Unicode existed, and Unicode support was only recently added. There are several ways used for encoding of non-ASCII characters in path names, but unfortunately it is not very standardised. Most Windows packers use DOS encoding.
Zip format is an old standard for compression of files, it was created long before the unicode and added support for it just recently. There are several ways used for encoding of non-ascii characters in path names, unfortunately it is not much standardised. Most windows packers use DOS encoding.


Client software:
Client software:
* Windows built-in compression - bundled with Windows, non-standard DOS encoding only
* Windows built-in compression - bundled with Windows, non-standard DOS encoding only
* WinZip - shareware, unicode option (since v11.2)
* WinZip - shareware, Unicode option (since v11.2)
* TotalCommander - shareware, single byte(DOS) encoding only
* TotalCommander - shareware, single byte(DOS) encoding only
* 7-Zip - free, unicode or DOS encoding depending on characters used in file name (since v4.58beta)
* 7-Zip - free, Unicode or DOS encoding depending on characters used in file name (since v4.58beta)
* Info-ZIP - free, uses some weird charset conversions
* Info-ZIP - free, uses some weird character set conversions


PHP extraction:
PHP extraction:
* Info-ZIP binary execution - no unicode support at all, mangles charsets in filename (depends on OS, see docs), files must be copied to temp directory before compression and after extraction
* Info-ZIP binary execution - no Unicode support at all, mangles character sets in file names (depends on OS, see docs), files must be copied to temp directory before compression and after extraction
* PclZip PHP library - reads single byte encoded names only, problems with random problems and higher memory usage
* PclZip PHP library - reads single byte encoded names only, problems with random problems and higher memory usage.
* Zip PHP extension - reads single byte encoded names only, 64bit operating system can not open/create archives with more than 500 files (depends on sum of lengths of all filenames and directories, to be fixed in PHP 5.3 and external PECL library, no PHP 5.2.x backport planned!), adding of files is limited by number of free file handles (around 1000 - depends on OS and other PHP code, workaround is to close and reopen archive)
* Zip PHP extension - reads single byte encoded names only, 64bit operating system can not open/create archives with more than 500 files (depends on sum of lengths of all filenames and directories, to be fixed in PHP 5.3 and external PECL library, no PHP 5.2.x backport planned!), adding of files is limited by number of free file handles (around 1000 - depends on OS and other PHP code, workaround is to close and reopen archive)


Line 513: Line 531:
# use single byte encoding "garbage in/garbage out" approach for encoding of files in zip archives; add new 'zipencoding' string into lang packs (ex: cp852 DOS charset for Czech locale) and use it during extraction, we might support true unicode later when PHP Zip extension does that
# use single byte encoding "garbage in/garbage out" approach for encoding of files in zip archives; add new 'zipencoding' string into lang packs (ex: cp852 DOS charset for Czech locale) and use it during extraction, we might support true unicode later when PHP Zip extension does that


===empty directories===


Hmm, thinking a bit more about Justin's comment I realised there is no support for empty directories in this proposal. This will require either new table or some hack in files table - maybe we could add files with "." as name and just skip them when iterating directory content.
== Possible future ideas ==


Note: solved with files named '.'; the dirs are automatically created when adding files, the area root dir is created when browsing.
=== Files maintenance report ===
 
===file overwriting===
 
Concept of file overwriting does not exist anymore here, the path+filename are not enforced to be unique - we can not make index because sloppy mssql does not allow indexes larger than 900 bytes :-( We will haev to emulate it somehow and deal with collisions if found.
 
Note: solved with unique index on pathnamehash field.


This would do deep validation of the files on disc. It would:
* Report when there is a row in the files table, but the corresponding file is missing from the file system.
* Report files that still exist on disc, even though no references remain.
* Report orphaned files.
* Report total disc space usage by context (as much as is possible with content-addressible file storage).
* Verify that the SHA1 hash of the contents of each file matches the file name.


== Some little comments to be considered (to avoid forgetting them) ==
In addition, it might offer options to fix these problems, where possible.


* each context will have its own "file manager"
=== Support quotas per user, course, etc. ===
* separate "file manager context" files (FMF) and "internal context" (ICF) files (current modedit files, submissions, attachements...)
* /pluginfile.php/SYSCONTEXTID/{blog|question} and so... will have own FMF too? Or only ICF ?
* Way to copy between contexts
* Links = -1 for them
* Deletion strategy (locks, quarantine status...)
* include support for quotas per user, per course, etc  
* upgrade process should be interruptable (like the unicode upgrade) so it can be stopped/restarted any time


People want this.


==Justin's thinking out loud==
If we have implemented the above report, it would then just be a matter of adding the interface hooks to stop people uploading more files once the quota has been reached.
 
I'm actually working on implementing this along with extending an existing Alfresco integration to work together with the whole File / Repository system and I wanted to get some of my comments and thoughts in here for feedback.  Go easy on me.  =)
 
So far I've only got one that I'd like to solicit some feedback on (BTW, if this would be better suited to a forum discussion, let me know):
 
 
===Not storing the full ''filepath'' with each entry in the '''file''' table===
*For browsing a directory structure, determining things like child directories or a parent directory given a filepath requires a lot of extraneous coding in PHP.  I think it might be better served to create a new '''file_directory''' table, storing only a directory name, and reference to a parent directory record.  The benefits here are that we're storing a lot of duplicate text field values in the '''file''' table and browsing through the file picker for local files doesn't require a lot of PHP overhead to calculate links to parent / child directories.
*Given that file permissions are no longer calculated using structured file paths, using the complete, full, path to a given file would most likely never be needed.
 
*The '''repositorypath''' field in the '''repository_sync''' table still makes sense, though.
 
*The '''file_directory''' table:
 
{| border="1" cellpadding="2" cellspacing="0"
|'''Field'''
|'''Type'''
|'''Default'''
|'''Info'''
 
|-
|'''id'''
|int(10) 
|
|autoincrementing
 
|-
|'''parent'''
|int(10)
|
|ID of directory that this record is a child of.
 
|-
|'''directoryname'''
|varchar(255)
|
|The actual name of this directory.
|}


skodak: filepath is stored in files table - its root is the corresponding filearea, the file manager will use the context tree to find all plugins/courses and ask them to return the list of areas with all those small branches inside it
(We could also divide the file size by number of instances that are using it, this might be considered more accurate in some scenarios.)





Revision as of 07:47, 2 February 2009

Template:Moodle 2.0 This specification has now largely been implemented in Moodle 2.0. Inevitably, in the implementation, some details may have changed. If in doubt, read the code. (Then come back and correct this page ;-))

The implementation of this specification is being tracked at MDL-14589.


Objectives

The goals of these changes are to:

  • allow files to be stored within Moodle, as part of the content (as we do now).
  • use a consistent and simple approach for all file handling throughout Moodle.
  • give modules control over which users can access a file, using capabilities and other local rules.
  • make it easy to determine which parts of Moodle use which files, to simplify operations like backup and restore.
  • track where files originally came from.
  • avoid redundant storage, when the same file is used twice.
  • fully support Unicode file names, irrespective of the capabilities of the underlying file system.


Overview

The File API is a set of core interfaces to allow the rest of Moodle to:

  1. store files, and
  2. display files to users.

It applies only to files that are part of the Moodle site's content. It is not used for internal files, such as those in the following subdirectories of dataroot: temp, lang, cache, environment, filter, search, sessions, upgradelogs, ...

The API can be subdivided into the following parts:

Serving files
Lets users accessing a Moodle site get the files (file.php, draftfile.php, pluginfile.php, userfile.php, etc.)
  • Serve the files on request
  • with appropriate security checks
File API internals
Stores the files on disc, with metadata in associated database tables.
  • Content-addressed storage.
File management API
Allows code to manipulate the stored files (lib/filelib.php)
  • find information about stored files.
  • print links to files.
  • move/rename/copy/delete/etc.
  • keep a file synchronised with an external repository.
File management user interface
Provides the interface for (lib/form/file.php, filemanager.php, filepicker.php and files/index.php, draftfiles.php)
  • Form elements allowing users to select a file using the Repository API, and have it stored within Moodle.
  • UI for users to manage their files, replacing the old course files UI


Serving files

Deals with serving of files - browser requests file, Moodle sends it back. We have three main files. It is important to setup slasharguments on server (file.php/some/thing/xxx.jpg), any content that relies on relative links can not work without it (scorm, uploaded html pages, etc.).

file.php

Serves course files.

Implements basic file access. Ideally only images and files linked from course sections should be there, no XSS protection required - we expect javascript, sw, etc. there, no way to make it "secure". The access control is not critical any more if we move most of the files into modules

The file name and parameter structure is critical for backwards compatibility of existing course content.

/file.php/courseid/dir/dir/filename.ext

Internally the files would be stored in array('contextid'=>$coursecontextid, 'filearea'=>'coursefiles', 'itemid'=>0)

pluginfile.php

(aka modfile.php) Sends module, block, question files.

  • modules decide about access control
  • optional XSS protection - student submitted files must not be served with normal headers, we have to force download instead; ideally there should be second wwwroot for serving of untrusted files
  • only internal links to selected areas are supported - you can link images in summary area, but not the assignment submissions

Absolute file links need to be rewritten if html editing allowed in module. The links are stored internally as relative links. Before editing or display the internal link representation is converted to absolute links using simple str_replace() @@thipluginlink/summary@@/image.jpg --> /pluginfile.php/assignmentcontextid/intro/image.jpg, it is converted back to internal links before saving.

Can the distinct file areas supported by one plugin be declared somehow in order add some information about them? For example, I think it can be interesting to declare:
  • assignment_summary:
    • relpath='intro'
    • userdata=false
    • anotherproperty=anothervalue
  • assignment_submission:
    • relpath='submission/@@USERID@@'
    • userdata=false
    • anotherproperty=anothervalue
  • and so on...
And then, when the editor "receives" one "assignment_summary" areaname, if knows what to show and so on? Also that info could be useful to know, in backup & restore if some fileareas have to be processed or no (userdata=false). Or also, when reconstructing the links (str_replace() above). And will cause to have a well defined list of fileareas by module, instead of coding them in a free way (prone to errors). Eloy Lafuente (stronk7) 16:35, 28 June 2008 (CDT)
Something like this will be part of file management API, hardcoding this in file storage would make it less flexible imo Petr Škoda (škoďák)
Yup, yup. Storage doesn't know anything but get/put files (nothing else). It's part of management, absolutely. Eloy Lafuente (stronk7) 11:21, 29 June 2008 (CDT)
/pluginfile.php/contextid/areaname/arbitrary/params/or/dirs/filename.ext

pluginfile.php detects the type of plugin from context table, fetches basic info (like $course or $cm if appropriate) and calls plugin function (or later method) which does the access control and finally sends the file to user. areaname separates files by type and divides the context into several subtrees - for example summary files (images used in module intros), post attachments, etc.

Assignment example

/pluginfile.php/assignmentcontextid/intro/someimage.jpg
/pluginfile.php/assignmentcontextid/submission/submissionid/attachmentname.ext
/pluginfile.php/assignmentcontextid/extra/allsubmissionfiles.zip
Uhm... all those files together? What's going to differentiate the "submission" path in the example above from the "summary" path? Is it supposed that the editor, or the filemanager won't allow , for example to pick-up one file from the "submission" area to be used in the summary of one assignment and only the "summary" area will be showed? That means multiple file managers by context and it's against the clean "one file manager per context" agreed below Eloy Lafuente (stronk7) 21:28, 26 June 2008 (CDT)
Yes Eloy, the different areas (summary, submission) etc. have different uses, different access control. There are two types of file manager - the two pane file manager which lists all contexts+areas user may access, and minimalistic manager in html editor which shows only subset of areas from current plugin (because you can not link anything else).

scorm example

/pluginfile.php/scormcontextid/intro/someimage.jpg
/pluginfile.php/scormcontextid/content/revisionnumber/dir/somescormfile.js

The revision counter is incremented when any file changes in order to prevent caching problems. The lifetime should be adjustable in module settings.

quiz example

pluginfile.php/quizcontextid/intro/niceimage.jpg
pluginfile.php/quizcontextid/report/type/export.ods

questions example

pluginfile.php/SYSCONTEXTID/question/questionid/file.jpg

blog example

Blog entries or notes in general do not have context id (because they live in system context, SYSCONTEXTID below is the id of system context). The note attachments are always served with XSS protection on, ideally we should use separate wwwroot for this. Access control can be hardcoded.

/pluginfile.php/SYSCONTEXTID/blog/blogentryid/attachmentname.ext

Internally stored in array('contextid'=>SYSCONTEXTID, 'filearea'=>'blog', 'itemid'=>$blogentryid)

backup example

It would be nice to have some special protection of backup files - new capabilities for backup file download, upload. Backups contain a lot of personal info, we could block restoring of backups from other sites too.

/pluginfile.php/coursecontextid/backup/backupfile.zip

Internally stored in array('contextid'=>$coursecontextid, 'filearea'=>'backup', 'itemid'=>0)

userfile.php

Personal file storage, intended as an online storage of work in progress like assignments before the submission.

  • read/write own files only for now
  • option to share with others later
  • personal "websites" will not be supported (security)
/userfile.php/userid/dir/dir/filename.ext

rssfile.php

Replaces rss/file.php which is kept only for backwards compatibility. RSS files should not require sessions/cookies, URLs should contain some sort of security token/key. Internally the files may be stored in database or together with other files. Performance improvements - we should support both Etag (cool) and Last-Modified (more used), when we receive If-None-Match/If-Modified-Since => 304

/rssfile.php/contextid/any/parameters/module/wants/rss.xml
/rssfile.php/SYSCONTEXTID/blog/userid/rss.xml

Again modules and plugins decide what gets sent to user.

Temporary files

Temporary files are usually used during the lifetime of one script only. uses:

  • exports
  • imports
  • zipping/unzipping
  • processing by executable files (latex, mimetex)

Ideally these files should never use utf-8 (which is a major problem for zipping at the moment). Proposed new sha1 based file storage is not suitable both for performance and technical reasons.

Legacy file storage and serving

Going to use good-old separate directories in $CFG->dataroot.

file serving and storage:

  1. user avatars - user/pix.php
  2. group avatars - user/pixgroup.php
  3. tex, algebra - filter/tex/* and filter/algebra/*
  4. rss cache (?full rss rewrite soon?) - backwards compatibility only rss/file.php

only storage:

  1. sessions


File API internals

File storage on disk

Files are stored in $CFG->dataroot (also known as moodledata) in the filedir subfolder.

Files are stored according to the SHA1 hash of their content. This means each file with particular contents is stored once, irrespective of how many times it is included in different places, even if it is referred to by different names. (This idea comes from the git version control system.) To relate a file on disc to a user-comprehensible path or filename, you need to use the file database tables. See the next section.

Suppose a file has SHA1 hash 081371cb102fa559e81993fddc230c79205232ce. Then it will be stored in on disc as moodledata/filedir/08/13/71/081371cb102fa559e81993fddc230c79205232ce.

If you were wondering, in PHP, SHA1 hashes can be computed with either the sha1 or sha1_file functions.

The information in this section should be considered completely internal to the file API. Other parts of the Moodle code should manipulate files using the higher level functions of the file API. For example, they should refer to files by file id in the file table, not the SHA1 hash.

Files database tables

Table: files

This table contains one entry for each usage of a file. Enough information is kept here so that the file can be fully identified and retrieved again if necessary.

If, for example, the same image is used in a user's profile, and a forum post, then there will be two rows in this table, one for each use of the file, and Moodle will treat the two as separate files, even though the file is only stored once on disc.

Field Type Default Info
id int(10) auto-incrementing The unique ID for this file.
contenthash varchar(40) The sha1 hash of content.
pathnamehash varchar(40) The sha1 hash of contextid+filearea+itemid+filepath+filename - prevents file duplicates and allows fast lookup
contextid int(10) The context id defined in context table - identifies the instance of plugin owning the file.
filearea varchar(50) Like "submissions", "intro" and "content" (images and swf linked from summaries), etc.; "blogs" and "userfiles" are special case that live at the system context.
itemid int(10) Some plugin specific item id (eg. forum post, blog entry or assignment submission or user id for user files)
filepath text relative path to file from module content root, useful in Scorm and Resource mod - most of the mods do not need this
filename varchar(255) The full Unicode name of this file (case sensitive)
filesize int(10) size of file - bytes
mimetype varchar(100) NULL type of file
userid int(10) NULL Optional - general user id field - meaning depending on plugin
timecreated int(10) The time this file was created
timemodified int(10) The last time the file was modified

Indexes:

  • non-unique index on (contextid, filearea, itemid)
  • non-unique index on (contenthash)
  • unique index on (pathnamehash).

The plugin type does not need to be specified because it can be derived from the context. Items like blog that do not have their own context will use their own file area inside a suitable context. In this case, the user context.

Entries with filename = '.' represent directories. Directory entries like this are created automatically when a file is added within them.

Note: 'files' plural used even thought that goes against the coding guidelines because 'file' is a reserved word.

Table: files_metadata

This table contains extra metadata about files. Repositories could provide this, or it could be manually edited in the local copy.

Field Type Default Info
id int(10) auto-incrementing
fileid int(10) Foreign key, references files.id
name varchar(255) The name of the metadata field
value text The value of this metadata field

Note: this is not implemented yet.

Table: files_sync

This table contains information on how to synchronise data with repositories. Data would be synchronised from cron.php or on demand from file manager. The sync would be one way only (repository -> local file).

Field Type Default Info
id int(10) auto-incrementing
fileid int(10) Foreign key, references files.id
repositoryid int(10) The repository instance this is associated with, see Development:Repository_API
updates int(10) Specifies the update schedule (0 = none, 1 = on demand, other = some period in seconds)
repositorypath text The full path to the original file on the repository
timeimportfirst int(10) The first time this file was imported into Moodle
timeimportlast int(10) The most recent time that this file was imported into Moodle

Note: this is not implemented yet. It may end up being implemented within the Repository API istead. That is, this table may end up being called repository_sync.

Table: files_cleanup

This table contains candidates for deletion from the file pool. Files are not deleted immediately because there may be multiple references to the same file. Therefore, it is better for performance to simply add a row to this table, and later, on cron, clean the files off disc if they are really no longer used. Also, batch deletion on cron makes it easier to avoid concurrency issues when one user deletes what was the last reference to the file as another user adds a new reference. (The cron cleanup code should be doing appropriate locking.)

Field Type Default Info
id int(10) auto-incrementing
contenthash varchar(40)

Implementation of basic operations

This is just an overview. See the external API description below for how to do this from client code.

Storing a file

  1. Calculate the SHA1 hash of the file contents.
  2. Check if a file with this SHA1 hash already exists on disc. If not, store the file there.
  3. Remove this SHA1 hash from list of deleted files, if present.
  4. Add the record for this file to the files table.

Reading a file

  1. Fetch the record (which includes the SHA1 hash) for the file you want from the files table.
  2. Retrieve the contents using the SHA1 hash.

Deleting a file

  1. Store the SHA1 hash of the file being deleted in the files_cleanup table.
  2. Delete the record from the files table.
  3. Later, admin/cron.php will actually delete the file from disc if it is no longer used (with proper table locking to prevent race conditions when adding/deleting files simultaneously).


File management API

These are the ways that other parts of the code manipulate files.


File management user-interface

This section describes following:

  1. file manager
  2. integration with html editor
  3. interactions with repos

File manager

Single pane file manager is hard to implement without drag & drop which is notoriously problematic in web based applications. I propose to implement a two pane commander-style file manager. Two pane manager allows you to easily copy/move files between two different contexts (ex: courses).

File manager must not interact directly with filesystem API, instead each module should return traversable tree of files and directories with both real and localised names (localised names are needed for dirs like backupdata).

Originally there was a single file tree for each course. We need to fully separate each module/block from the course files and there might be also independent file areas in modules (ex: module introduction, content files, submissions, post attachments). File area may be defined as a small tree where we can use relative paths. These file areas are hanging from the branches of the context tree (this needs a picture).

Integration with htmleditor

Html editor should be able to browse only relevant files - for example when editing resource introduction only images from the file area of that resource should be available; when editing html resource page only the content area images should be listed.

There are several problems here:

  1. when adding new resource its context does not exist yet, we will have to create some table to handle temporary file storage for adding of new stuff, not easy but should be solvable - maybe we could abuse the course context id or store it temporarily in some special user file area
  2. we can not use absolute address relinking for pluginfile.php links, instead we can use the absolute links only when editing and before storage convert them to something like @@thispluginfile@@/2112/112/image.jpg before storage. the local links would be converted to full absolute links before display or editing. Not all file areas will support this (ex: linking to assignment submission does not make sense because nobody else may access it anyway). This would allow us to implement image preview in html editor.

Html editor should contain simplified single pane file manager with basic operations only - select file area, browse file area, upload file/copy user file/use repo file, delete. The editor will communicate with modules and core through ajax call to some script specified by module embedding the editor. The callback script would use different logic to construct the tree of files than the File manager, it needs to know only about files that other ppl viewing the resulting html may access.

Interactions with repos

Repositories may serve as a replacement for file uploading. They may be also used to synchronise files between courses. The repo option should be available whenever there is a file upload field, sometimes with extra "keep synchronised" option (this would not make sense for stuff like assignment submissions).


Backwards compatibility

Content backwards compatibility

Means existing courses should not loose images, flash, etc. Though some new features (like resource sharing - if implemented) may not work with existing data that still uses files from course files area.

There might be a breakage of links due to special characters stripping in uploaded files which will not match the links in uploaded html files any more. This should not be very common I hope.

Code backwards compatibility

Other Moodle code (for example plugins) will have to be converted to the new APIs. See Development:Using_the_file_API for guidance.

It is not possible to provide backwards-compatibility here. For example, the old $CFG->dataroot/$courseid/ will no longer exist, and there is no way to emulate that, so we won't try.


Upgrade and migration

When a site is upgraded to Moodle 2.0, all the files in moodledata will have to be migrated. This is going to be a pain, like DML/DDL was :-(

The upgrade process should be interruptible (like the Unicode upgrade was) so it can be stopped/restarted any time.

Migration of content

  • resources - move files to new resource content file area; can be done automatically for pdf, image resources; definitely not accurate for uploaded web pages
  • questions - image file moved to new area, image tag appended to questions
  • moddata files - the easiest part, just move to new storage
  • coursefiles - there might be many outdated files :-( :-(
  • rss feeds links in readers - will be broken, the new security related code would break it anyway

Moving files to files table and file pool

The migration process must be interruptable because it might take a very long time. The files would be moved from old location, the restarting would be straightforward.

Proposed stages:

  1. migration of all course files except moddata - finish marked by some $CFG->files_migrated=true; - this step breaks the old file manager and html editor integration
  2. migration of blog attachments
  3. migration of question files
  4. migration of moddata files - each module is responsible to copy data from converted coursefiles or directly from moddata which is not converted automatically

Some people use symbolic links in coursefiles - we must make sure that those will be copied to new storage in both places, though they can not be linked any more - anybody wanting to have content synced will need to move the files to some repository and set up the sync again.

Talked about a double task here, when migrating course files to module areas:
  1. Parse html files to detect all the dependencies and move them together.
  2. Fallback in pluginfile.php so, if something isn't found in module filearea, search for it in course filearea, copying it and finally, serving it.
Also we talked about the possibility of add a new setting to resource in order to define if it should work against old coursefiles or new autocontained file areas. Migrated resources will point to old coursefiles while new ones will enforce autocontained file areas.
it seems that only resource files will be really complex (because allow arbitrary HTML inclusion). The rest (labels, intros... doesn't) and should be easier to parse.
Eloy Lafuente (stronk7) 19:00, 29 June 2008 (CDT)

Other parts of Moodle that need to be modified

Backup/restore

File handling in backups needs to be fully rewritten - list of files in xml + pool of sha1 named files with contents. This solves the utf-8 trouble here, yay!!

Antivirus scanning

Issues that need to be resolved

Unicode support in zip format

This has now been solved by using the built in zip in PHP 5.2.8.

Zip format is an old standard for compressing files. It was created long before Unicode existed, and Unicode support was only recently added. There are several ways used for encoding of non-ASCII characters in path names, but unfortunately it is not very standardised. Most Windows packers use DOS encoding.

Client software:

  • Windows built-in compression - bundled with Windows, non-standard DOS encoding only
  • WinZip - shareware, Unicode option (since v11.2)
  • TotalCommander - shareware, single byte(DOS) encoding only
  • 7-Zip - free, Unicode or DOS encoding depending on characters used in file name (since v4.58beta)
  • Info-ZIP - free, uses some weird character set conversions

PHP extraction:

  • Info-ZIP binary execution - no Unicode support at all, mangles character sets in file names (depends on OS, see docs), files must be copied to temp directory before compression and after extraction
  • PclZip PHP library - reads single byte encoded names only, problems with random problems and higher memory usage.
  • Zip PHP extension - reads single byte encoded names only, 64bit operating system can not open/create archives with more than 500 files (depends on sum of lengths of all filenames and directories, to be fixed in PHP 5.3 and external PECL library, no PHP 5.2.x backport planned!), adding of files is limited by number of free file handles (around 1000 - depends on OS and other PHP code, workaround is to close and reopen archive)

Large file support: PHP running under 32bit operating systems does not support files >2GB (do not expect fix before PHP 6). This might be a potential problem for larger backups.

Tar Alternative:

  • tar with gzip compression - easy to implement in PHP + zlib extension (PclTar, Tar from PEAR or custom code)
  • no problem with unicode in *nix, Windows again expects DOS encoding :-(
  • seems suitable for backup/restore - yay!

Roadmap:

  1. add zip processing class that fully hides the underlying library
  2. use single byte encoding "garbage in/garbage out" approach for encoding of files in zip archives; add new 'zipencoding' string into lang packs (ex: cp852 DOS charset for Czech locale) and use it during extraction, we might support true unicode later when PHP Zip extension does that


Possible future ideas

Files maintenance report

This would do deep validation of the files on disc. It would:

  • Report when there is a row in the files table, but the corresponding file is missing from the file system.
  • Report files that still exist on disc, even though no references remain.
  • Report orphaned files.
  • Report total disc space usage by context (as much as is possible with content-addressible file storage).
  • Verify that the SHA1 hash of the contents of each file matches the file name.

In addition, it might offer options to fix these problems, where possible.

Support quotas per user, course, etc.

People want this.

If we have implemented the above report, it would then just be a matter of adding the interface hooks to stop people uploading more files once the quota has been reached.

(We could also divide the file size by number of instances that are using it, this might be considered more accurate in some scenarios.)


See also