Note:

If you want to create a new page for developers, you should create it on the Moodle Developer Resource site.

File API internals: Difference between revisions

From MoodleDocs
(Replaced by "Skodak's rants" from talk page)
Line 1: Line 1:
This page outlines the current thinking about implementing file storage and access in Moodle 2.0.  It's a SPECIFICATION UNDER CONSTRUCTION!
This page outlines the current thinking about implementing file storage and access in Moodle 2.0.  It's a SPECIFICATION UNDER CONSTRUCTION!
A lot of this has been brought over from the discussion about the [[Repository API]] which is intimately connected. 


The page is open for everyone so everyone can help correct mistakes and help with the evolution of this document.  However, if you have questions, problems to report or major changes to suggest please add them to the [[Development_talk:File_API|page comments]], or start a discussion in the [http://moodle.org/mod/forum/view.php?id=1807 Repositories forum].  We'll endeavour to merge all such suggestions into the main spec before we start development.
The page is open for everyone so everyone can help correct mistakes and help with the evolution of this document.  However, if you have questions, problems to report or major changes to suggest please add them to the [[Development_talk:File_API|page comments]], or start a discussion in the [http://moodle.org/mod/forum/view.php?id=1807 Repositories forum].  We'll endeavour to merge all such suggestions into the main spec before we start development.
Line 10: Line 8:
# Allow files to be added directly into Moodle (as we do now)
# Allow files to be added directly into Moodle (as we do now)
# Remember where files came from
# Remember where files came from
# Give modules control over the access to files
# Give modules control over the access to files using capabilities and other local rules
# Allow content to be used in multiple Moodle contexts securely and simply via capabilities
# Consistent and simple approach for ALL file handling throughout Moodle
# Consistent and simple approach for ALL file handling throughout Moodle


Line 18: Line 15:


The File API is a core set of interfaces that all Moodle code will use to:
The File API is a core set of interfaces that all Moodle code will use to:
# copy files into Moodle
# store files within Moodle
# store files within Moodle
# display files to Moodle users
# display files to Moodle users
Line 24: Line 20:
It applies only to "user" files.  It will NOT apply to local files and caches created by Moodle such as these directories in dataroot: temp, lang, cache, environment, filter, rss, search, sessions, upgradelogs etc
It applies only to "user" files.  It will NOT apply to local files and caches created by Moodle such as these directories in dataroot: temp, lang, cache, environment, filter, rss, search, sessions, upgradelogs etc


==Use cases==
The API will be split into several independent parts:
# File serving API
## file.php
## pluginfile.php
## userfile.php
## rssfile.php
# File storage API
## optional access control
## optional repo sync
# File management API
## File browsing
## File linking (editor integration)
## Upload from repository
 
==File serving API==
Deals with serving of files - browser requests file, Moodle sends it back. We have three main files. It is important to setup slasharguments on server (file.php/some/thing/xxx.jpg), any content that relies on relative links can not work without it (scorm, uploaded html pages, etc.).
 
===file.php===
Serves course files.
 
Implements basic file access. Ideally only images and files linked from course sections should be there, no XSS protection required - we expect javascript, sw, etc. there, no way to make it "secure". The access control is not critical any more if we move most most of the files into modules
 
The file name and parameter structure is critical for backwards compatibility of existing course content.
 
/file.php/courseid/dir/dir/filename.ext
 
Internally the files would be stored in <code>array('contextid'=>$coursecontextid, 'filearea'=>'content', 'itemid'=>0)</code>
 
===pluginfile.php===
(aka modfile.php)
Sends module, block, question files.
* modules decide about access control
* optional XSS protection - student submitted files must not be served with normal headers, we have to force download instead; ideally there should be second wwwroot for serving of untrusted files
* only internal links to selected areas are supported - you can link images in summary area, but not the assignment submissions


Coming soon
Absolute file links need to be rewritten if html editing allowed in module. The links are stored internally as relative links. Before editing or display the internal link representation is converted to absolute links using simple str_replace() @@thipluginlink/summary@@/image.jpg --> /pluginfile.php/assignmentcontextid/summary/image.jpg, it is converted back to internal links before saving.


::Can the distinct file areas supported by one plugin be declared somehow in order add some information about them? For example, I think it can be interesting to declare:
::* assignment_summary:
::** relpath='summary'
::** userdata=false
::** anotherproperty=anothervalue
::* assignment_submission:
::** relpath='submission/@@USERID@@'
::** userdata=false
::** anotherproperty=anothervalue
::* and so on...
::And then, when the editor "receives" one "assignment_summary" areaname, if knows what to show and so on? Also that info could be useful to know, in backup & restore if some fileareas have to be processed or no (userdata=false). Or also, when reconstructing the links (str_replace() above). And will cause to have a well defined list of fileareas by module, instead of coding them in a free way (prone to errors). [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 16:35, 28 June 2008 (CDT)


==General Architecture==
::Something like this will be part of file management API, hardcoding this in file storage would make it less flexible imo [[User:Skodak|Skodak]]


All file-handling areas in Moodle (eg adding a new resource, adding attachments to a forum post, uploading assignments) will be rewritten to talk to the standard API class methods in a standard way.
::Yup, yup. Storage doesn't know anything but get/put files (nothing else). It's part of management, absolutely. [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 11:21, 29 June 2008 (CDT)


When adding a file, the interface will allow a choice from the (active) [[Repository_API]] repository plugins, each of which is a subclass of the core Repository class. The default plugin is "local" which shows local files already in Moodle (and allows files to be added from desktop there, much like current filemanager which it replaces).
/pluginfile.php/contextid/areaname/arbitrary/params/or/dirs/filename.ext


As is usual in Moodle, there will be admin settings to disable/enable certain repository plugins as standard, as well as user settings so that users can add their own personal repositories to the standard list (eg [http://briefcase.yahoo.com Yahoo Briefcase] or [http://docs.google.com Google Docs]) and to select their default repository.
pluginfile.php detects the type of plugin from context table, fetches basic info (like $course or $cm if appropriate) and calls plugin function (or later method) which does the access control and finally sends the file to user. ''areaname'' separates files by type and divides the context into several subtrees - for example ''summary'' files (images used in module intros), post attachments, etc.


Once a file has been selected the file will almost always be copied into Moodle there and then.  However there will also be options to:
====assignment example====
* only return the URL to the file if it's desired to keep it external (but this does present security and integrity risks), or
* refresh the local file copy regularly and automatically
* refresh the file manually from the File manager interface


All files in Moodle will be listed in a table (see below) allowing us to store various metadata about each fileThe file contents will not be in the database (though we could easily offer that option if we want to), they will be on disk with a name related to the id rather than the "human" name (this avoids a lot of OS Unicode problems).
/pluginfile.php/assignmentcontextid/summary/someimage.jpg
/pluginfile.php/assignmentcontextid/submission/submissionid/attachmentname.ext
  /pluginfile.php/assignmentcontextid/extra/allsubmissionfiles.zip


The module that is responsible for initiating the file will be remembered as the "owner" of that file, and will be responsible for access to that file afterwards, either by publishing the permissions required to access the file or by providing a callback function that can be used. For example, the assignment plugin may, after allowing a student to select a file to be submitted, add permissions so that people who have grade permissions in that assignment can read it.  Or it may choose to provide a function that can do a more detailed check based on dates and so on.
::Uhm... all those files together? What's going to differentiate the "submission" path in the example above from the "summary" path? Is it supposed that the editor, or the filemanager won't allow , for example to pick-up one file from the "submission" area to be used in the summary of one assignment and only the "summary" area will be showed? That means multiple file managers by context and it's against the clean "one file manager per context" agreed below [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 21:28, 26 June 2008 (CDT)
::Yes Eloy, the different areas (summary, submission) etc. have different uses, different access control. There are two types of file manager - the two pane file manager which lists all contexts+areas user may access, and minimalistic manager in html editor which shows only subset of areas from current plugin (because you can not link anything else).


All files will be served via a single control script in Moodle, located at $CFG->fileroot.  This could be the same as $CFG->wwwroot by default, but will be recommended (for security and avoiding XSS) that Moodle admins set up a second DNS name pointing to this script eg the main site could be at http://moodle.domain.edu but files would be served via http://moodlefiles.domain.edu/file.php.  (We'll have to set session cookies on both domains and keep them in sync somehow).
====scorm example====


The file.php will serve files using slasharguments almost as nowSee the section below on serving files for the details of this.
/pluginfile.php/scormcontextid/summary/someimage.jpg
  /pluginfile.php/scormcontextid/content/revisionnumber/dir/somescormfile.js


==Local Files==
The revision counter is incremented when any file changes in order to prevent caching problems. The lifetime should be adjustable in module settings.


In general, all external files will be copied locally and stored in Moodle.  All existing files in the Moodle dataroot course areas will be moved into this new system during the upgrade.
====quiz example====


The files will not be stored as they have been in the pastThe new file system is "flat" with each file stored as an name calculated from the content (see below). The full name, path and other metadata will now be stored in tables.
pluginfile.php/quizcontextid/summary/niceimage.jpg
  pluginfile.php/quizcontextid/report/type/export.ods


The name of the file will be the SHA1 hash calculated from the content of the file using the PHP function '''sha1_file()'''.  This results in names like:  231e2dc421be4fcd0172e5afceea3970e2f3d940.jpg
====questions example====


To avoid running out of operating system file nodes we'll use a directory structure with perhaps three levels (a very conservative max of 1000 nodes per directory allows a billion files to be stored):
pluginfile.php/SYSCONTEXTID/question/questionid/file.jpg


dataroot
====blog example====
    /files
Blog entries or notes in general do not have context id (because they live in system context, SYSCONTEXTID bellow is the id of system context).
      /0
The note attachments are always served with XSS protection on, ideally we should use separate wwwroot for this. Access control can be hardcoded.
      /1
      /2
        /0
        /1
        /2
        /3
            /0
            /1
              /231e2dc421be4fcd0172e5afceea3970e2f3d940.jpg
      /3
      /4
      /5
      /6
      /7
      ...
      /e
      /f


We should probably keep the mime-derived file extensions (eg .jpg) to help people who might be browsing the files directly for some reason.
/pluginfile.php/SYSCONTEXTID/blog/blogenryid/attachmentname.ext


The big advantage of using this scheme is that if two or more files have the same content, or the same file is used in different contexts then there will only be one copy of the actual data.  A simple clean-up function in cron could find and delete file data that no longer have any references to them in the file table (or it could be part of the file API that deletes files).
Internally stored in <code>array('contextid'=>SYSCONTEXTID, 'filearea'=>'blog', 'itemid'=>$blogenryid)</code>


==File serving==
====backup example====
It would be nice to have some special protection of backup files - new capabilities for backup file download, upload. Backups contain a lot of personal info, we could block restoring of backups from other sites too.


There will be at least three different scripts serving files (for full security).
/pluginfile.php/coursecontextid/backup/backupfile.zip


===Course files: file.php===
Internally stored in <code>array('contextid'=>$coursecontextid, 'filearea'=>'backup', 'itemid'=>0)</code>


This script is for general course files and backward compatibility for old modules.
===userfile.php===
Personal file storage, intended as an online storage of work in progress like assignments before the submission.
* read/write own files only for now
* option to share with others later
* personal "websites" will not be supported (security)


# File gets a URL like: file.php/'''courseid'''/dir/dir/file.jpg
  /userfile.php/userid/dir/dir/filename.ext
# File uses path and courseid to get the record from the '''file''' table for all information about this file.
# File uses fileid to get the current ACL from the '''file_access''' table.
# Each line of the ACL is checked, if any of them are true then access is given.
## Use has_capability with context and capability to check permissions
# If access is allowed then serve the file.


===Module files: modfile.php===
===rssfile.php===
Replaces rss/file.php wchi is kept only for backwards compatibility.
RSS files should not require sessions/cookies, urls should contain some sort of security token/key
Internally the files may be stored in database or together with other files.
Performance improvements - we should support both Etag (cool) and Last-Modified (more used), when we receive If-None-Match/If-Modified-Since => 304


This script is a new one giving modules "ownership" over files and complete control (if required) over their accessModules will generally provide a callback function to determine access to a file.
/rssfile.php/contextid/any/parameters/module/wants/rss.xml
  /rssfile.php/SYSCONTEXTID/blog/userid/rss.xml


# File gets a URL like:  modfile.php/'''contextid'''/dir/dir/file.jpg
Again modules and plugins decide what gets sent to user.
# File uses path and contextid to get the record from the '''file''' table for all information about this file.
# File uses fileid to get the current ACL from the '''file_access''' table.
# Each line of the ACL is checked, if any of them are true then access is given.
## If the capability is "function/accessfunction" then file.php looks for a function called '''accessfunction''' in the module's lib.php to return true/false.
## Otherwise use has_capability with context and capability to check permissions
# If access is allowed then serve the file.


This should work for all kinds of modules (not just activity modules)... we need to make that efficient.
===Temporary files===
Temporary files are usually used during the lifetime of one script only.
uses:
* exports
* imports
* zipping/unzipping
* processing by executable files (latex, mimetex)


===User files: userfile.php===
Ideally these files should never use utf-8 (which is a major problem for zipping at the moment).
Proposed new sha1 based file storage is not suitable both for performance and technical reasons.


This script is a new one for users to share personal files (eg could be images embedded in HTML).  The only security is public/private.
===Legacy file storage and serving===
Going to use good-old separate directories in $CFG->dataroot.


# File gets a URL like:  userfile.php/'''userid'''/dir/dir/file.jpg
file serving and storage:
# File uses path and userid to get the record from the '''file''' table for all information about this file.
# user avatars - user/pix.php
# If the file is "private" then require current user to match userid. Otherwise the file is "public" and no check is performed.
# group avatars - user/pixgroup.php
# Serve the file
# tex, algebra - filter/tex/* and filter/algebra/*
# rss cache (?full rss rewrite soon?) - backwards compatibility only rss/file.php


From the interface point of view, is a user repository of files for casual use, governed by quotas.
only storage:
#sessions


# User goes to the "Files" area (similar to current Moodle, but for all users)
==File storage API==
# User can upload/download/rename/move/delete their own set of files.
Modules in general work only with local Moodle files. One of the major reason is performance when accessing external repository files. It will be possible to use repositories instead of file uploading and also to keep local files synced with external repository.
# Each of these files can either be marked PRIVATE or PUBLIC (to everyone).
# A second tab on that page allows everyone to browse all the public files from everyone else.
# A user's public files can also be listed near their profile.
# The listings are smart about showing various media


==Tables==
File contents are stored in moodledata/filepool indexed using SHA1 hashes instead of file names; file names, relative paths and other metadata will be stored in file(_xxx) database tables. This should be fully abstracted so that modules do not actually know where the files are located. When storing files the content is sent as string or file handle, when reading content it is returned as file handle.


=== file ===
===files table===


This table contains one entry for every file.  Enough information is kept here so that the file can be fully identified and retrieved again if necessary.
This table contains one entry for every file.  Enough information is kept here so that the file can be fully identified and retrieved again if necessary.
note: plural used because file is a reserved word


{| border="1" cellpadding="2" cellspacing="0"
{| border="1" cellpadding="2" cellspacing="0"
Line 151: Line 183:


|-
|-
|filename
|sha1hash
|varchar
|varchar(40)
|
|The sha1 hash of content.
 
|-
|'''contextid'''
|int(10)
|
|The context id defined in context table - identifies the instance of plugin owning the file.
 
|-
|filearea
|varchar(50)
|
|
|The full Unicode name of this file
|Like "submissions", "intro" and "content" (images and swf linked from summaries), etc.; "blogs" and "userfiles" are special case that live at the system context.


|-
|-
|mimetype
|itemid
|varchar
|int(10)
|
|
|Full mimetype of the file (or should we rely on extension?)
|Some plugin specific item id (eg. forum post, blog entry or assignment submission or user id for user files)


|-
|-
|'''userid'''
|filepath
|int(10) 
|text
|  
|
|The user id of the person who created this entry
|relative path to file from module content root, useful in Scorm and Resource mod - most of the mods do not need this


|-
|-
|'''modulename'''
|filename
|varchar(255)
|varchar(255)
|  
|
|The module that is the "owner" of this file (eg "moodle" or "mod/assignment" or "blocks/html")
|The full Unicode name of this file (case sensitive)


|-
|-
|'''moduleinstance'''
|filesize
|int(10)
|int(10)
|  
|
|The instance of the module that is the "owner" of this file (eg assignment id 5)
|size of file - bytes


|-
|-
|'''modulecallback'''
|mimetype
|boolean
|varchar(100)
|  
|NULL
|If true then file.php will use a class in modulename/modfile.php to determine access
|type of file


|-
|-
|'''originalfileid'''  
|'''userid'''
|int(10)   
|int(10)   
|  
|NULL
|The id of the file which this is a copy of.  If this is set, then all changes to the parent will be copied to this entry too.
|Optional - general user id field - meaning depending on plugin


|-
|-
|'''repositoryid'''
|timecreated
|int(10)
|int(10)
|
|
|The repository instance this is associated with, see [[Repository_API]]
|The time this file was created


|-
|-
|updates
|timemodified
|int(10)
|int(10)
|
|
|Specifies the update schedule (0 = none, 1 = on demand, other = some period in seconds)
|The last time the file was modified
|}
 
index on "contextid, filearea, itemid" and "sha1hash"
 
Plugin type is not specified because it is derived from contextid, items like blog that do not have own context will use own filearea usually from systemcontextid.
 
::Perhpas we could also hash filepath and filename and index by them, to save some text limitations in the DB side (length limits of indexes, not indexable, complex retrieval...). [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 11:54, 29 June 2008 (CDT)
 
::Also, perhaps we should store finally the plugin type there to save some queries per request, using it to drive to the correct file handling of each plugin. [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 18:54, 29 June 2008 (CDT)
 
=== files_cleanup table ===
 
This table contains candidates for deletion from the file pool. Files are not deleted immediately, cron uses the files_cleanup table, verifies the file is not used any more and deletes it from pool. Reasons fro cron clean-up are performance and prevention of collision - there could be a problem with concurrent uploads and deletes, we will probably need to add some table based locking during the clean up.
 
We might add extra script that does deep validation of pool area - report missing files, report orphaned files, content not matching the sha1 filename, etc. - this would very very time consuming.
 
{| border="1" cellpadding="2" cellspacing="0"
|'''Field'''
|'''Type'''
|'''Default'''
|'''Info'''


|-
|-
|cachetime
|'''id'''
|int(10)
|int(10)
|
|
|Specifies how long this file can be cached by browsers
|autoincrementing


|-
|-
|moodlepath
|sha1hash
|text
|varchar(40)
|
|  
|The virtual path to the file locally (so we can still have apparent subdirectories etc)
 
|}
 
=== files_metadata table ===
 
This table contains extra metadata about files.  Repositories could provide this, or it could be manually edited in the local copy.
 
{| border="1" cellpadding="2" cellspacing="0"
|'''Field'''
|'''Type'''
|'''Default'''
|'''Info'''


|-
|-
|repositorypath
|'''id'''
|text
|int(10) 
|
|
|The full path to the original file on the repository
|autoincrementing


|-
|-
|timeimportfirst
|'''fileid'''
|int(10)
|int(10)
|
|Id of file.
|-
|'''name'''
|varchar(255)
|
|
|The first time this file was imported into Moodle
|The name of extra metadata


|-
|-
|timeimportlast
|value
|int(10)
|text
|
|
|The most recent time that this file was imported into Moodle
|Value
 
|}
 
===files_acl table===
 
This table describes optional ACL for file. This is not required in majority of cases, modules usually hardcode the file access logic, course files should not be used much any more.
 
{| border="1" cellpadding="2" cellspacing="0"
|'''Field'''
|'''Type'''
|'''Default'''
|'''Info'''


|-
|-
|timecreated
|'''id'''
|int(10)
|int(10)
|
|
|The time this file was created (if known), otherwise same as time imported
|autoincrementing


|-
|-
|timemodified
|'''fileid'''
|int(10) 
|
|The file we are defining access for
 
|-
|'''contextid'''
|int(10)
|int(10)
|
|
|The last time the file was modified
|The context where this file is being published


|-
|-
|timeaccessed
|'''capability'''
|int(10)
|text
|
|
|The last time this file was accessed for any reason
|The capability that is required to see this file.
|}
|}


=== file_access ===
====acl notes====
* this is missing some concept similar to '''user/group/others''', for example in case of user files typical user can not assign permissions or view them - this becomes useless there
* it is more important to synchronise the availability of file link and the file itself - having link pointing to inaccessible file or file which is accessible when not wanted are both problems
* browser/proxy caching works against us here - "secret" files should not be cached


This table describes the ACL for each file, so that checks can easily be made on whether someone can see this file or not. Note there can be multiple entries per file.  Users can ALWAYS see their own files, so there are no entries here for that.
===files_sync table===
 
This table contains information how to synchronise data with repositories. Data would be synchronised from cron.php or on demand from file manager. The sync would be one way only (repository-->local file).


{| border="1" cellpadding="2" cellspacing="0"
{| border="1" cellpadding="2" cellspacing="0"
Line 271: Line 376:
|-
|-
|'''fileid'''  
|'''fileid'''  
|int(10)
|int(10)
|  
|  
|The file we are defining access for
|Id of file.
 
|-
|'''repositoryid'''
|int(10)
|
|The repository instance this is associated with, see [[Repository_API]]


|-
|-
|'''contextid'''
|updates
|int(10)
|int(10)
|
|
|The context where this file is being published
|Specifies the update schedule (0 = none, 1 = on demand, other = some period in seconds)


|-
|-
|'''capability'''
|repositorypath
|text
|text
|
|
|The capability that is required to see this file.
|The full path to the original file on the repository
 
|-
|timeimportfirst
|int(10)
|
|The first time this file was imported into Moodle
 
|-
|timeimportlast
|int(10)
|
|The most recent time that this file was imported into Moodle
|}
|}


==Class methods==
===File content storage===
Originally the file storage hierarchy contained a lot of metadata including userids, entry ids, filenames, etc. The file content will now be stored separately from file metadata. It must supports utf8 on all platforms.
 
File storing:
# calculate SHA1 hash of content
# check if file with SHA1 name exists, if not add the file to file pool
# remove SHA1 from list of deleted files if found there
# store file in ''file'' table, use SHA1 as file pool identifier


===File class===
File reading:
#fetch file record from 'file' table - probably using file id or combination of contextid+instanceid
#fetch content of file


This class implements the display and management of files from local storage, with full access checking.  Some of the functions are for single files, while some are optimised for bulk display and searching (eg in the personal files interface).
File deleting:
#delete record from ''file'' table, remember file SHA1
#store the deleted SHA1 in deleted files table, do not remove the physical file yet
#wait for cron cleanup script to actually delete the file named SHA1 (proper table locking needed to prevent race conditions when adding/deleting files)


====add_file====
====File pool details====
located in $CFG->dataroot/filepool/, all files can not be stored in one directory due to OS limitations, it uses 3 levels based on first three characters of sha1 hash. It is unlikely that there will be thousands of files with the same first 3 chars in sha1 hash of their content.


====delete_file====
This type of storage saves a lot of disk space when storing multiple copies of the same large file. It can also help substantially when synchronising data with external repositories. Another benefit is we can detect inconsistencies in file content.


====modify_file====
File read performance is similar to previous code, file write performance will be slower - due to hashing and extra database access.


...
dataroot
    /filepool
      /00
      /01
      ...
      /'''23'''
        /00
        /01
        ...
        /'''1e'''
            /00
            /01
            ...
            /'''2d'''
              /231e2dc421be4fcd0172e5afceea3970e2f3d940.jpg
      ...
      /fe
      /ff
 
==File management API==
 
This section describes following:
#file manager
#integration with html editor
#interactions with repos
 
===File manager===
Single pane file manager is hard to implement without drag & drop which is notoriously problematic in web based applications. I propose to implement a two pane commander-style file manager. Two pane manager allows you to easily copy/move files between two different contexts (ex: courses).
 
File manager must not interact directly with filesystem API, instead each module should return traversable tree of files and directories with both real and localised names (localised names are needed for dirs like backupdata).
 
Originally there was a single file tree for each course. We need to fully separate each module/block from the course files and there might be also independent file areas in modules (ex: module introduction, content files, submissions, post attachments). File area may be defined as a small tree where we can use relative paths. These file areas are hanging from the branches of the context tree (this needs a picture).
 
===Integration with htmleditor===
Html editor should be able to browse only relevant files - for example when editing resource introduction only images from into file area of that resource should be available, when editing html resource page only the content area images should be listed.
 
There are several problem here:
# when adding new resource its context does not exist yet, we will have to create some table to handle temporary file storage for adding of new stuff, not easy but should be solvable - maybe we could abuse the course context id or store it temporarily in some special user file area
# we can not use absolute address relinking for pluginfile.php links, instead we can use the absolute links only when editing and before storage convert them to something like @@thispluginfile/intro@@/2112/112/image.jpg before storage. the local links would be converted to full absolute links before display or editing. Not all file areas will support this (ex: linking to assignment submission does not make sense because nobody else may access it anyway). This would allow us to implement image preview in html editor.
 
Html editor should contain simplified single pane file manager with basic operations only - select file area, browse file area, upload file/copy user file/use repo file, delete. The editor will communicate with modules and core through ajax call to some script specified by module embedding the editor. The callback script would use different logic to construct the tree of files than the File manager, it needs to know only about files that other ppl viewing the resulting html may access.
 
===Interactions with repos===
Repositories may serve as a replacement for file uploading. They may be also used to synchronise files between courses. The repo option should be available whenever there is a file upload field, sometimes with extra "keep synchronised" option (this would not make sense for stuff like assignment submissions).
 
==Upgrade, migration and backwards compatibility==
It is going to be a pain again like DML/DDL ;-)
 
===Code backwards compatibility===
0% backwards compatibility related to file storage. New objects will be mandatory to use. Old $CFG->dataroot/$courseid/ will be empty, $CFG->dataroot/blog/ too, etc.
 
===Content backwards compatibility===
Means existing courses should not loose images, flash, etc. Though some new features (like resource sharing - if implemented) may not work with existing data that still uses files from course files area.
 
There might be a breakage of links due to special characters stripping in uploaded files which will not match the links in uploaded html files any more. This should not be very common I hope.
 
===Migration of content===
* resources - move files to new resource content file area; can be done automatically for pdf, image resources; definitely not accurate for uploaded web pages
* questions - image file moved to new are, image tag appended to questions
* moddata files - the easiest part, just move to new storage
* coursefiles - there might be a lot of outdated files :-( :-(
* rss feeds links in readers - will be broken, the new security related code would break it anyway
 
===Moving files to files table and file pool===
The migration process must be interruptable because it might take a very long time. The files would be moved from old location, the restarting would be straightforward.
Proposed stages:
#migration of all course files except moddata - finish marked by some $CFG->files_migrated=true; - this step breaks the old file manager and html editor integration
#migration of blog attachments
#migration of question files
#migration of moddata files - each module is responsible to copy data from converted coursefiles or directly from moddata which is not converted automatically
 
Some ppl use symbolic links in coursefiles - we must make sure that those will be copied to new storage in both places, though they can not be linked any more - anybody wanting to have content synced will need to move the files to some repository and set up the sync again.
 
::Talked about a double task here, when migrating course files to module areas:
::# Parse html files to detect all the dependencies and move them together.
::# Fallback in pluginfile.php so, if something isn't found in module filearea, search for it in course filearea, copying it and finally, serving it.
 
:: Also we talked about the possibility of add a new setting to resource in order to define if it should work against old coursefiles or new autocontained file areas. Migrated resources will point to old coursefiles while new ones will enforce autocontained file areas.
 
:: it seems that only resource files will be really complex (because allow arbitrary HTML inclusion). The rest (labels, intros... doesn't) and should be easier to parse.
 
::[[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 19:00, 29 June 2008 (CDT)
 
==Backup/restore changes==
File handling in backups needs to be fully rewritten - list of files in xml + pool of sha1 named files with contents. This solves the utf-8 trouble here, yay!!
 
==Quotas==
File size will be stored in files table, we can use simple queries to find out how much space is used, however this may not be accurate because the sha1 hash based storage eliminates duplicate files.
*total course files - find out all contexts used in course, query files table with contextid IN ($listofcontexts)
*module files - find module context and calculate space per file area
*user files quota - inside the personal area only, counting all attachments in all mods might take a while
We could also divide the file size by number of instances that are using it, this might be considered more accurate in some scenarios.
 
==Other==
*antivirus scanning + upload manager rewrite/integration with forms lib
*zip compression and extraction
 
==Major problems==
List of hard to solve prolbems
 
===unicode zip support===
Unicode chars in zip files uploaded by teachers - unfortunately there is no 100% solution that will work for anybody because most zip programs do not support unicode, it is usually ''garbage in/garbage out'' which works in some cases only
 
Latest WinZIP 11.2 and Total Commander seem to support some very limited form of utf-8 encodings of file names. I managed to create a zip file in Windows (Czech locale) and extract them in linux with native PHP zip functions, the only step I needed to add was conversion cp852(DOS charset for Czech locale) -> UTF-8. The native windows zipping did not work for me though, because it does some different borking of charsets.
 
In any case it seems likely that native PHP support in PHP 5.2.x should be better than current pclzip or infozip binary.
 
===empty directories===
Hmm, thinking a bit more about Justin's comment I realised there is no support for empty directories in this proposal. This will require either new table or some hack in files table - maybe we could add files with "." as name and just skip them when iterating directory content.
 
===file overwriting===
Concept of file overwriting does not exist anymore here, the path+filename are not enforced to be unique - we can not make index because sloppy mssql does not allow indexes larger than 900 bytes :-( We will haev to emulate it somehow and deal with collisions if found.
 
== Some little comments to be considered (to avoid forgetting them) ==
 
* each context will have its own "file manager"
* separate "file manager context" files (FMF) and "internal context" (ICF) files (current modedit files, submissions, attachements...)
* /pluginfile.php/SYSCONTEXTID/{blog|question} and so... will have own FMF too? Or only ICF ?
* Way to copy between contexts
* Links = -1 for them
* Deletion strategy (locks, quarantine status...)
* include support for quotas per user, per course, etc
* upgrade process should be interruptable (like the unicode upgrade) so it can be stopped/restarted any time
 
==Justin's thinking out loud==
 
I'm actually working on implementing this along with extending an existing Alfresco integration to work together with the whole File / Repository system and I wanted to get some of my comments and thoughts in here for feedback.  Go easy on me.  =)
 
So far I've only got one that I'd like to solicit some feedback on (BTW, if this would be better suited to a forum discussion, let me know):
 
===Not storing the full ''filepath'' with each entry in the '''file''' table===
*For browsing a directory structure, determining things like child directories or a parent directory given a filepath requires a lot of extraneous coding in PHP.  I think it might be better served to create a new '''file_directory''' table, storing only a directory name, and reference to a parent directory record.  The benefits here are that we're storing a lot of duplicate text field values in the '''file''' table and browsing through the file picker for local files doesn't require a lot of PHP overhead to calculate links to parent / child directories.
*Given that file permissions are no longer calculated using structured file paths, using the complete, full, path to a given file would most likely never be needed.
 
*The '''repositorypath''' field in the '''repository_sync''' table still makes sense, though.
 
*The '''file_directory''' table:
 
{| border="1" cellpadding="2" cellspacing="0"
|'''Field'''
|'''Type'''
|'''Default'''
|'''Info'''
 
|-
|'''id'''
|int(10) 
|
|autoincrementing
 
|-
|'''parent'''
|int(10)
|
|ID of directory that this record is a child of.
 
|-
|'''directoryname'''
|varchar(255)
|
|The actual name of this directory.
|}


==Areas in Moodle that need re-writing==
skodak: filepath is stored in files table - its root is the corresponding filearea, the file manager will use the context tree to find all plugins/courses and ask them to return the list of areas with all those small branches inside it


==See also==
==See also==

Revision as of 05:38, 3 July 2008

This page outlines the current thinking about implementing file storage and access in Moodle 2.0. It's a SPECIFICATION UNDER CONSTRUCTION!

The page is open for everyone so everyone can help correct mistakes and help with the evolution of this document. However, if you have questions, problems to report or major changes to suggest please add them to the page comments, or start a discussion in the Repositories forum. We'll endeavour to merge all such suggestions into the main spec before we start development.


Objectives

  1. Allow files to be added directly into Moodle (as we do now)
  2. Remember where files came from
  3. Give modules control over the access to files using capabilities and other local rules
  4. Consistent and simple approach for ALL file handling throughout Moodle


Overview

The File API is a core set of interfaces that all Moodle code will use to:

  1. store files within Moodle
  2. display files to Moodle users

It applies only to "user" files. It will NOT apply to local files and caches created by Moodle such as these directories in dataroot: temp, lang, cache, environment, filter, rss, search, sessions, upgradelogs etc

The API will be split into several independent parts:

  1. File serving API
    1. file.php
    2. pluginfile.php
    3. userfile.php
    4. rssfile.php
  2. File storage API
    1. optional access control
    2. optional repo sync
  3. File management API
    1. File browsing
    2. File linking (editor integration)
    3. Upload from repository

File serving API

Deals with serving of files - browser requests file, Moodle sends it back. We have three main files. It is important to setup slasharguments on server (file.php/some/thing/xxx.jpg), any content that relies on relative links can not work without it (scorm, uploaded html pages, etc.).

file.php

Serves course files.

Implements basic file access. Ideally only images and files linked from course sections should be there, no XSS protection required - we expect javascript, sw, etc. there, no way to make it "secure". The access control is not critical any more if we move most most of the files into modules

The file name and parameter structure is critical for backwards compatibility of existing course content.

/file.php/courseid/dir/dir/filename.ext

Internally the files would be stored in array('contextid'=>$coursecontextid, 'filearea'=>'content', 'itemid'=>0)

pluginfile.php

(aka modfile.php) Sends module, block, question files.

  • modules decide about access control
  • optional XSS protection - student submitted files must not be served with normal headers, we have to force download instead; ideally there should be second wwwroot for serving of untrusted files
  • only internal links to selected areas are supported - you can link images in summary area, but not the assignment submissions

Absolute file links need to be rewritten if html editing allowed in module. The links are stored internally as relative links. Before editing or display the internal link representation is converted to absolute links using simple str_replace() @@thipluginlink/summary@@/image.jpg --> /pluginfile.php/assignmentcontextid/summary/image.jpg, it is converted back to internal links before saving.

Can the distinct file areas supported by one plugin be declared somehow in order add some information about them? For example, I think it can be interesting to declare:
  • assignment_summary:
    • relpath='summary'
    • userdata=false
    • anotherproperty=anothervalue
  • assignment_submission:
    • relpath='submission/@@USERID@@'
    • userdata=false
    • anotherproperty=anothervalue
  • and so on...
And then, when the editor "receives" one "assignment_summary" areaname, if knows what to show and so on? Also that info could be useful to know, in backup & restore if some fileareas have to be processed or no (userdata=false). Or also, when reconstructing the links (str_replace() above). And will cause to have a well defined list of fileareas by module, instead of coding them in a free way (prone to errors). Eloy Lafuente (stronk7) 16:35, 28 June 2008 (CDT)
Something like this will be part of file management API, hardcoding this in file storage would make it less flexible imo Petr Škoda (škoďák)
Yup, yup. Storage doesn't know anything but get/put files (nothing else). It's part of management, absolutely. Eloy Lafuente (stronk7) 11:21, 29 June 2008 (CDT)
/pluginfile.php/contextid/areaname/arbitrary/params/or/dirs/filename.ext

pluginfile.php detects the type of plugin from context table, fetches basic info (like $course or $cm if appropriate) and calls plugin function (or later method) which does the access control and finally sends the file to user. areaname separates files by type and divides the context into several subtrees - for example summary files (images used in module intros), post attachments, etc.

assignment example

/pluginfile.php/assignmentcontextid/summary/someimage.jpg
/pluginfile.php/assignmentcontextid/submission/submissionid/attachmentname.ext
/pluginfile.php/assignmentcontextid/extra/allsubmissionfiles.zip
Uhm... all those files together? What's going to differentiate the "submission" path in the example above from the "summary" path? Is it supposed that the editor, or the filemanager won't allow , for example to pick-up one file from the "submission" area to be used in the summary of one assignment and only the "summary" area will be showed? That means multiple file managers by context and it's against the clean "one file manager per context" agreed below Eloy Lafuente (stronk7) 21:28, 26 June 2008 (CDT)
Yes Eloy, the different areas (summary, submission) etc. have different uses, different access control. There are two types of file manager - the two pane file manager which lists all contexts+areas user may access, and minimalistic manager in html editor which shows only subset of areas from current plugin (because you can not link anything else).

scorm example

/pluginfile.php/scormcontextid/summary/someimage.jpg
/pluginfile.php/scormcontextid/content/revisionnumber/dir/somescormfile.js

The revision counter is incremented when any file changes in order to prevent caching problems. The lifetime should be adjustable in module settings.

quiz example

pluginfile.php/quizcontextid/summary/niceimage.jpg
pluginfile.php/quizcontextid/report/type/export.ods

questions example

pluginfile.php/SYSCONTEXTID/question/questionid/file.jpg

blog example

Blog entries or notes in general do not have context id (because they live in system context, SYSCONTEXTID bellow is the id of system context). The note attachments are always served with XSS protection on, ideally we should use separate wwwroot for this. Access control can be hardcoded.

/pluginfile.php/SYSCONTEXTID/blog/blogenryid/attachmentname.ext

Internally stored in array('contextid'=>SYSCONTEXTID, 'filearea'=>'blog', 'itemid'=>$blogenryid)

backup example

It would be nice to have some special protection of backup files - new capabilities for backup file download, upload. Backups contain a lot of personal info, we could block restoring of backups from other sites too.

/pluginfile.php/coursecontextid/backup/backupfile.zip

Internally stored in array('contextid'=>$coursecontextid, 'filearea'=>'backup', 'itemid'=>0)

userfile.php

Personal file storage, intended as an online storage of work in progress like assignments before the submission.

  • read/write own files only for now
  • option to share with others later
  • personal "websites" will not be supported (security)
/userfile.php/userid/dir/dir/filename.ext

rssfile.php

Replaces rss/file.php wchi is kept only for backwards compatibility. RSS files should not require sessions/cookies, urls should contain some sort of security token/key Internally the files may be stored in database or together with other files. Performance improvements - we should support both Etag (cool) and Last-Modified (more used), when we receive If-None-Match/If-Modified-Since => 304

/rssfile.php/contextid/any/parameters/module/wants/rss.xml
/rssfile.php/SYSCONTEXTID/blog/userid/rss.xml

Again modules and plugins decide what gets sent to user.

Temporary files

Temporary files are usually used during the lifetime of one script only. uses:

  • exports
  • imports
  • zipping/unzipping
  • processing by executable files (latex, mimetex)

Ideally these files should never use utf-8 (which is a major problem for zipping at the moment). Proposed new sha1 based file storage is not suitable both for performance and technical reasons.

Legacy file storage and serving

Going to use good-old separate directories in $CFG->dataroot.

file serving and storage:

  1. user avatars - user/pix.php
  2. group avatars - user/pixgroup.php
  3. tex, algebra - filter/tex/* and filter/algebra/*
  4. rss cache (?full rss rewrite soon?) - backwards compatibility only rss/file.php

only storage:

  1. sessions

File storage API

Modules in general work only with local Moodle files. One of the major reason is performance when accessing external repository files. It will be possible to use repositories instead of file uploading and also to keep local files synced with external repository.

File contents are stored in moodledata/filepool indexed using SHA1 hashes instead of file names; file names, relative paths and other metadata will be stored in file(_xxx) database tables. This should be fully abstracted so that modules do not actually know where the files are located. When storing files the content is sent as string or file handle, when reading content it is returned as file handle.

files table

This table contains one entry for every file. Enough information is kept here so that the file can be fully identified and retrieved again if necessary.

note: plural used because file is a reserved word

Field Type Default Info
id int(10) autoincrementing
sha1hash varchar(40) The sha1 hash of content.
contextid int(10) The context id defined in context table - identifies the instance of plugin owning the file.
filearea varchar(50) Like "submissions", "intro" and "content" (images and swf linked from summaries), etc.; "blogs" and "userfiles" are special case that live at the system context.
itemid int(10) Some plugin specific item id (eg. forum post, blog entry or assignment submission or user id for user files)
filepath text relative path to file from module content root, useful in Scorm and Resource mod - most of the mods do not need this
filename varchar(255) The full Unicode name of this file (case sensitive)
filesize int(10) size of file - bytes
mimetype varchar(100) NULL type of file
userid int(10) NULL Optional - general user id field - meaning depending on plugin
timecreated int(10) The time this file was created
timemodified int(10) The last time the file was modified

index on "contextid, filearea, itemid" and "sha1hash"

Plugin type is not specified because it is derived from contextid, items like blog that do not have own context will use own filearea usually from systemcontextid.

Perhpas we could also hash filepath and filename and index by them, to save some text limitations in the DB side (length limits of indexes, not indexable, complex retrieval...). Eloy Lafuente (stronk7) 11:54, 29 June 2008 (CDT)
Also, perhaps we should store finally the plugin type there to save some queries per request, using it to drive to the correct file handling of each plugin. Eloy Lafuente (stronk7) 18:54, 29 June 2008 (CDT)

files_cleanup table

This table contains candidates for deletion from the file pool. Files are not deleted immediately, cron uses the files_cleanup table, verifies the file is not used any more and deletes it from pool. Reasons fro cron clean-up are performance and prevention of collision - there could be a problem with concurrent uploads and deletes, we will probably need to add some table based locking during the clean up.

We might add extra script that does deep validation of pool area - report missing files, report orphaned files, content not matching the sha1 filename, etc. - this would very very time consuming.

Field Type Default Info
id int(10) autoincrementing
sha1hash varchar(40)

files_metadata table

This table contains extra metadata about files. Repositories could provide this, or it could be manually edited in the local copy.

Field Type Default Info
id int(10) autoincrementing
fileid int(10) Id of file.
name varchar(255) The name of extra metadata
value text Value

files_acl table

This table describes optional ACL for file. This is not required in majority of cases, modules usually hardcode the file access logic, course files should not be used much any more.

Field Type Default Info
id int(10) autoincrementing
fileid int(10) The file we are defining access for
contextid int(10) The context where this file is being published
capability text The capability that is required to see this file.

acl notes

  • this is missing some concept similar to user/group/others, for example in case of user files typical user can not assign permissions or view them - this becomes useless there
  • it is more important to synchronise the availability of file link and the file itself - having link pointing to inaccessible file or file which is accessible when not wanted are both problems
  • browser/proxy caching works against us here - "secret" files should not be cached

files_sync table

This table contains information how to synchronise data with repositories. Data would be synchronised from cron.php or on demand from file manager. The sync would be one way only (repository-->local file).

Field Type Default Info
id int(10) autoincrementing
fileid int(10) Id of file.
repositoryid int(10) The repository instance this is associated with, see Repository_API
updates int(10) Specifies the update schedule (0 = none, 1 = on demand, other = some period in seconds)
repositorypath text The full path to the original file on the repository
timeimportfirst int(10) The first time this file was imported into Moodle
timeimportlast int(10) The most recent time that this file was imported into Moodle

File content storage

Originally the file storage hierarchy contained a lot of metadata including userids, entry ids, filenames, etc. The file content will now be stored separately from file metadata. It must supports utf8 on all platforms.

File storing:

  1. calculate SHA1 hash of content
  2. check if file with SHA1 name exists, if not add the file to file pool
  3. remove SHA1 from list of deleted files if found there
  4. store file in file table, use SHA1 as file pool identifier

File reading:

  1. fetch file record from 'file' table - probably using file id or combination of contextid+instanceid
  2. fetch content of file

File deleting:

  1. delete record from file table, remember file SHA1
  2. store the deleted SHA1 in deleted files table, do not remove the physical file yet
  3. wait for cron cleanup script to actually delete the file named SHA1 (proper table locking needed to prevent race conditions when adding/deleting files)

File pool details

located in $CFG->dataroot/filepool/, all files can not be stored in one directory due to OS limitations, it uses 3 levels based on first three characters of sha1 hash. It is unlikely that there will be thousands of files with the same first 3 chars in sha1 hash of their content.

This type of storage saves a lot of disk space when storing multiple copies of the same large file. It can also help substantially when synchronising data with external repositories. Another benefit is we can detect inconsistencies in file content.

File read performance is similar to previous code, file write performance will be slower - due to hashing and extra database access.

dataroot 
   /filepool
      /00
      /01
      ...
      /23
        /00
        /01
        ...
        /1e
           /00
           /01
           ...
           /2d
              /231e2dc421be4fcd0172e5afceea3970e2f3d940.jpg
      ...
      /fe
      /ff

File management API

This section describes following:

  1. file manager
  2. integration with html editor
  3. interactions with repos

File manager

Single pane file manager is hard to implement without drag & drop which is notoriously problematic in web based applications. I propose to implement a two pane commander-style file manager. Two pane manager allows you to easily copy/move files between two different contexts (ex: courses).

File manager must not interact directly with filesystem API, instead each module should return traversable tree of files and directories with both real and localised names (localised names are needed for dirs like backupdata).

Originally there was a single file tree for each course. We need to fully separate each module/block from the course files and there might be also independent file areas in modules (ex: module introduction, content files, submissions, post attachments). File area may be defined as a small tree where we can use relative paths. These file areas are hanging from the branches of the context tree (this needs a picture).

Integration with htmleditor

Html editor should be able to browse only relevant files - for example when editing resource introduction only images from into file area of that resource should be available, when editing html resource page only the content area images should be listed.

There are several problem here:

  1. when adding new resource its context does not exist yet, we will have to create some table to handle temporary file storage for adding of new stuff, not easy but should be solvable - maybe we could abuse the course context id or store it temporarily in some special user file area
  2. we can not use absolute address relinking for pluginfile.php links, instead we can use the absolute links only when editing and before storage convert them to something like @@thispluginfile/intro@@/2112/112/image.jpg before storage. the local links would be converted to full absolute links before display or editing. Not all file areas will support this (ex: linking to assignment submission does not make sense because nobody else may access it anyway). This would allow us to implement image preview in html editor.

Html editor should contain simplified single pane file manager with basic operations only - select file area, browse file area, upload file/copy user file/use repo file, delete. The editor will communicate with modules and core through ajax call to some script specified by module embedding the editor. The callback script would use different logic to construct the tree of files than the File manager, it needs to know only about files that other ppl viewing the resulting html may access.

Interactions with repos

Repositories may serve as a replacement for file uploading. They may be also used to synchronise files between courses. The repo option should be available whenever there is a file upload field, sometimes with extra "keep synchronised" option (this would not make sense for stuff like assignment submissions).

Upgrade, migration and backwards compatibility

It is going to be a pain again like DML/DDL ;-)

Code backwards compatibility

0% backwards compatibility related to file storage. New objects will be mandatory to use. Old $CFG->dataroot/$courseid/ will be empty, $CFG->dataroot/blog/ too, etc.

Content backwards compatibility

Means existing courses should not loose images, flash, etc. Though some new features (like resource sharing - if implemented) may not work with existing data that still uses files from course files area.

There might be a breakage of links due to special characters stripping in uploaded files which will not match the links in uploaded html files any more. This should not be very common I hope.

Migration of content

  • resources - move files to new resource content file area; can be done automatically for pdf, image resources; definitely not accurate for uploaded web pages
  • questions - image file moved to new are, image tag appended to questions
  • moddata files - the easiest part, just move to new storage
  • coursefiles - there might be a lot of outdated files :-( :-(
  • rss feeds links in readers - will be broken, the new security related code would break it anyway

Moving files to files table and file pool

The migration process must be interruptable because it might take a very long time. The files would be moved from old location, the restarting would be straightforward. Proposed stages:

  1. migration of all course files except moddata - finish marked by some $CFG->files_migrated=true; - this step breaks the old file manager and html editor integration
  2. migration of blog attachments
  3. migration of question files
  4. migration of moddata files - each module is responsible to copy data from converted coursefiles or directly from moddata which is not converted automatically

Some ppl use symbolic links in coursefiles - we must make sure that those will be copied to new storage in both places, though they can not be linked any more - anybody wanting to have content synced will need to move the files to some repository and set up the sync again.

Talked about a double task here, when migrating course files to module areas:
  1. Parse html files to detect all the dependencies and move them together.
  2. Fallback in pluginfile.php so, if something isn't found in module filearea, search for it in course filearea, copying it and finally, serving it.
Also we talked about the possibility of add a new setting to resource in order to define if it should work against old coursefiles or new autocontained file areas. Migrated resources will point to old coursefiles while new ones will enforce autocontained file areas.
it seems that only resource files will be really complex (because allow arbitrary HTML inclusion). The rest (labels, intros... doesn't) and should be easier to parse.
Eloy Lafuente (stronk7) 19:00, 29 June 2008 (CDT)

Backup/restore changes

File handling in backups needs to be fully rewritten - list of files in xml + pool of sha1 named files with contents. This solves the utf-8 trouble here, yay!!

Quotas

File size will be stored in files table, we can use simple queries to find out how much space is used, however this may not be accurate because the sha1 hash based storage eliminates duplicate files.

  • total course files - find out all contexts used in course, query files table with contextid IN ($listofcontexts)
  • module files - find module context and calculate space per file area
  • user files quota - inside the personal area only, counting all attachments in all mods might take a while

We could also divide the file size by number of instances that are using it, this might be considered more accurate in some scenarios.

Other

  • antivirus scanning + upload manager rewrite/integration with forms lib
  • zip compression and extraction

Major problems

List of hard to solve prolbems

unicode zip support

Unicode chars in zip files uploaded by teachers - unfortunately there is no 100% solution that will work for anybody because most zip programs do not support unicode, it is usually garbage in/garbage out which works in some cases only

Latest WinZIP 11.2 and Total Commander seem to support some very limited form of utf-8 encodings of file names. I managed to create a zip file in Windows (Czech locale) and extract them in linux with native PHP zip functions, the only step I needed to add was conversion cp852(DOS charset for Czech locale) -> UTF-8. The native windows zipping did not work for me though, because it does some different borking of charsets.

In any case it seems likely that native PHP support in PHP 5.2.x should be better than current pclzip or infozip binary.

empty directories

Hmm, thinking a bit more about Justin's comment I realised there is no support for empty directories in this proposal. This will require either new table or some hack in files table - maybe we could add files with "." as name and just skip them when iterating directory content.

file overwriting

Concept of file overwriting does not exist anymore here, the path+filename are not enforced to be unique - we can not make index because sloppy mssql does not allow indexes larger than 900 bytes :-( We will haev to emulate it somehow and deal with collisions if found.

Some little comments to be considered (to avoid forgetting them)

  • each context will have its own "file manager"
  • separate "file manager context" files (FMF) and "internal context" (ICF) files (current modedit files, submissions, attachements...)
  • /pluginfile.php/SYSCONTEXTID/{blog|question} and so... will have own FMF too? Or only ICF ?
  • Way to copy between contexts
  • Links = -1 for them
  • Deletion strategy (locks, quarantine status...)
  • include support for quotas per user, per course, etc
  • upgrade process should be interruptable (like the unicode upgrade) so it can be stopped/restarted any time

Justin's thinking out loud

I'm actually working on implementing this along with extending an existing Alfresco integration to work together with the whole File / Repository system and I wanted to get some of my comments and thoughts in here for feedback. Go easy on me. =)

So far I've only got one that I'd like to solicit some feedback on (BTW, if this would be better suited to a forum discussion, let me know):

Not storing the full filepath with each entry in the file table

  • For browsing a directory structure, determining things like child directories or a parent directory given a filepath requires a lot of extraneous coding in PHP. I think it might be better served to create a new file_directory table, storing only a directory name, and reference to a parent directory record. The benefits here are that we're storing a lot of duplicate text field values in the file table and browsing through the file picker for local files doesn't require a lot of PHP overhead to calculate links to parent / child directories.
  • Given that file permissions are no longer calculated using structured file paths, using the complete, full, path to a given file would most likely never be needed.
  • The repositorypath field in the repository_sync table still makes sense, though.
  • The file_directory table:
Field Type Default Info
id int(10) autoincrementing
parent int(10) ID of directory that this record is a child of.
directoryname varchar(255) The actual name of this directory.

skodak: filepath is stored in files table - its root is the corresponding filearea, the file manager will use the context tree to find all plugins/courses and ask them to return the list of areas with all those small branches inside it

See also