Student projects/Feed aggregation library
Note: You are currently viewing documentation for Moodle 1.9. Up-to-date documentation for the latest stable version is available here: Student projects/Feed aggregation library.
Note: This page outlines ideas for the Feed aggregation library project. It's a specification under construction! If you have any comments or suggestions, please add them to the page comments.
- 1 Status
- 2 Executive Summary
- 3 Steps
- 4 Glossary
- 5 Database structures
- 6 API for communication with modules/blocks
- 7 RSS block
- 8 Criteria for PHP RSS parsing libraries
- 9 Localization Issues
- 10 Security Issues
- 11 Ideas for the future
- 12 See also
This is a draft spec as part of the Google Summer of Code submission of Chris Zubak-Skees (chriszs [at] gmail.com). It is preliminary and partial. Spec based on the "Consuming RSS feeds" idea listed on Student projects. I welcome any and all feedback.
RSS/Atom feeds are becoming an important technology on the web and so it's crucial that Moodle has good support for consuming these feeds for a variety of different uses. At present we generate RSS feeds for use by other applications, though only consume feeds in the RSS feeds block. It would be useful to have a core library which can take care of aggregating feeds (and the issues around it) and to provide them in a simple format for plugins and other core parts of Moodle to use.
This project will involve creating a feed aggregation library which:
- Will consume most current aggregation technologies (RSS 0.90/1.0/2.0/Atom etc.)
- Will work behind a variety of web proxies
- Periodically aggregates and uses a caching mechanism (doesn't query remote server with every request)
- Will most likely be based on an existing PHP framework
As a proof of concept, it will be necessary to refactor the RSS block to use it.
To test this library, it will be necessary to develop a large test corpus of valid and mildly invalid RSS feeds from popular websites, content management tools, and manufactured by the tester.
- Further develop spec, get feedback, feel out implementation
- Gather large test corpus of in-the-wild and manufactured RSS feeds
- Evaluate open source PHP RSS parsing libraries on defined criteria
- Write and test caching, normalization, retrieval, and API code (what needs to be written depends on capabilities of chosen library, or if no library meets needs)
- Refactor and test RSS block to use new API
|RSS feed||A list of links in a machine readable format. Often used to syndicate itemized and chronological content, such as blog posts. RSS refers to a particular technology, but we use it interchangeably with Atom here.|
|Atom||Another feed specification with characteristics similar to RSS feeds.|
Note: Database specifics are preliminary.
Stores a list of requested feed URLs and some associated information.
|feedurl||varchar(255)||The URL at which to fetch the feed|
|normalizedfeedurl||varchar(255)||A URL stripped of some specifics, used to match against requested URLs (see get_feed())|
|timefetched||int(10)||The time this feed was last fetched. Used for caching and potential pruning|
|feedtitle||varchar(255)||The name of this feed as retrieved when last fetched|
|siteurl||varchar(255)||The URL of the site attached to this feed as retrieved when last fetched|
|feeddescription||text||The description of the feed as retrieved when last fetched|
Stores a list of requested feed items.
|feedurlsid||int(10)||Ties the item to the feed|
|itemurl||varchar(255)||The URL as retrieved in the item|
|itemtime||int(10)||The time this item indicates or when it was first fetched (requires keeping track of individual feed items)|
|itemtitle||varchar(255)||The name of this item as retrieved when last fetched|
|itemguid||varchar(255)||The unique id of the element as retrieved when last fetched|
|itemdescription||text||The description of the item as retrieved when last fetched|
|itemposition||int(10)||The position of the item in the RSS feed channel's list|
API for communication with modules/blocks
Note: API specifics are preliminary.
find_feed( string url )
Returns a data structure with a list of RSS feeds found at the given URL. If the page itself is an RSS feed returns just that feed. If the page is HTML then attempts to auto-discover RSS feeds in the header meta tags or linked in the body.
A further possibility (and one that may be beyond the immediate scope) is to use some sort of search mechanism (such as Google Blog Search) to retrieve a list of RSS feeds that match a given search term. This might be fragile, because it would depend on the data provider maintaining consistency.
get_feed( string url )
Returns a data structure representing a RSS feed. URL should be normalized (e.g. http://www.example.com/feed.xml and http://example.com/feed.xml are probably the same), but not overly aggressively. Should be transparently and centrally cached for subsequent calls within some time period. The data structure should be consistent, abstracting away details of the feed organization where possible.
get_feeds( array urls )
Returns a data structure representing multiple RSS feeds with items merged into one stream based on either provided or fetched date-time information and provided order. Should use get_feed() internally.
The RSS block needs to be refactored to use the new public API. If possible parts of the existing database structure should be maintained for backwards compatibility. Will need to use find_feed() when user adds RSS feed and get_feed() on every subsequent display. May need minimal additional/changed preferences UI to accommodate result of find_feed().
Criteria for PHP RSS parsing libraries
- Must have GPL compatible license
- Must be well-written and maintainable
- Must parse a wide variety of RSS/Atom feeds
- Should work in a wide variety of environments
- Should be well documented
- May also fetch feed
- May do data normalization
- May cache feeds
- May auto-discover feeds
- Character encoding and special characters need to be dealt with in a sane way
- International URLs need to be dealt with in a sane way
- Make sure parsed content is sanitized
- Make sure API input is sanitized
Ideas for the future
- Create new and compelling ways to view RSS feeds
- Create a centrally managed subscription mechanism for users/courses/plugins
- Student projects
- Using Moodle Request for comments: Feed aggregation library spec forum discussion
- [http://tracker.moodle.org/browse/CONTRIB-504 CONTRIB-504]