Note:

If you want to create a new page for developers, you should create it on the Moodle Developer Resource site.

Student projects/Feed aggregation library

From MoodleDocs

Note: This page outlines ideas for the Feed aggregation library project. It's a specification under construction! If you have any comments or suggestions, please add them to the page comments.


Status

This is a draft spec as part of the Google Summer of Code submission of Chris Zubak-Skees (chriszs [at] gmail.com). It is preliminary and partial. Spec based on the "Consuming RSS feeds" idea listed on Student projects. I welcome any and all feedback.

Executive Summary

RSS/Atom feeds are becoming an important technology on the web and so it's crucial that Moodle has good support for consuming these feeds for a variety of different uses. At present we generate RSS feeds for use by other applications, though only consume feeds in the RSS feeds block. It would be useful to have a core library which can take care of aggregating feeds (and the issues around it) and to provide them in a simple format for plugins and other core parts of Moodle to use.

This project will involve creating a feed aggregation library which:

  • Will consume most current aggregation technologies (RSS 0.90/1.0/2.0/Atom etc.)
  • Will work behind a variety of web proxies
  • Periodically aggregates and uses a caching mechanism (doesn't query remote server with every request)
  • Will most likely be based on an existing PHP framework

As a proof of concept, it will be necessary to refactor the RSS block to use it.

To test this library, it will be necessary to develop a large test corpus of valid and mildly invalid RSS feeds from popular websites, content management tools, and manufactured by the tester.

Steps

  • Further develop spec, get feedback, feel out implementation
  • Gather large test corpus of in-the-wild and manufactured RSS feeds
  • Evaluate open source PHP RSS parsing libraries on defined criteria
  • Write and test caching, normalization, retrieval, and API code (what needs to be written depends on capabilities of chosen library, or if no library meets needs)
  • Refactor and test RSS block to use new API

Glossary

Term Definition
RSS feed A list of links in a machine readable format. Often used to syndicate itemized and chronological content, such as blog posts. RSS refers to a particular technology, but we use it interchangeably with Atom here.
Atom Another feed specification with characteristics similar to RSS feeds.

Database structures

Note: Database specifics are preliminary.

feed_urls

Stores a list of requested feed URLs and some associated information.

Field Type Default Info
id int(10) autoincrementing
feedurl varchar(255) The URL at which to fetch the feed
normalizedfeedurl varchar(255) A URL stripped of some specifics, used to match against requested URLs (see get_feed())
timefetched int(10) The time this feed was last fetched. Used for caching and potential pruning
feedtitle varchar(255) The name of this feed as retrieved when last fetched
siteurl varchar(255) The URL of the site attached to this feed as retrieved when last fetched
feeddescription text The description of the feed as retrieved when last fetched

feed_items

Stores a list of requested feed items.

Field Type Default Info
id int(10) autoincrementing
feedurlsid int(10) Ties the item to the feed
itemurl varchar(255) The URL as retrieved in the item
itemtime int(10) The time this item indicates or when it was first fetched (requires keeping track of individual feed items)
itemtitle varchar(255) The name of this item as retrieved when last fetched
itemguid varchar(255) The unique id of the element as retrieved when last fetched
itemdescription text The description of the item as retrieved when last fetched
itemposition int(10) The position of the item in the RSS feed channel's list

API for communication with modules/blocks

Note: API specifics are preliminary.

find_feed( string url )

Returns a data structure with a list of RSS feeds found at the given URL. If the page itself is an RSS feed returns just that feed. If the page is HTML then attempts to auto-discover RSS feeds in the header meta tags or linked in the body.

A further possibility (and one that may be beyond the immediate scope) is to use some sort of search mechanism (such as Google Blog Search) to retrieve a list of RSS feeds that match a given search term. This might be fragile, because it would depend on the data provider maintaining consistency.

get_feed( string url )

Returns a data structure representing a RSS feed. URL should be normalized (e.g. http://www.example.com/feed.xml and http://example.com/feed.xml are probably the same), but not overly aggressively. Should be transparently and centrally cached for subsequent calls within some time period. The data structure should be consistent, abstracting away details of the feed organization where possible.

get_feeds( array urls )

Returns a data structure representing multiple RSS feeds with items merged into one stream based on either provided or fetched date-time information and provided order. Should use get_feed() internally.

RSS block

The RSS block needs to be refactored to use the new public API. If possible parts of the existing database structure should be maintained for backwards compatibility. Will need to use find_feed() when user adds RSS feed and get_feed() on every subsequent display. May need minimal additional/changed preferences UI to accommodate result of find_feed().

Criteria for PHP RSS parsing libraries

  • Must have GPL compatible license
  • Must be well-written and maintainable
  • Must parse a wide variety of RSS/Atom feeds
  • Should work in a wide variety of environments
  • Should be well documented
  • May also fetch feed
  • May do data normalization
  • May cache feeds
  • May auto-discover feeds

Localization Issues

  • Character encoding and special characters need to be dealt with in a sane way
  • International URLs need to be dealt with in a sane way

Security Issues

  • Make sure parsed content is sanitized
  • Make sure API input is sanitized

Ideas for the future

  • Create new and compelling ways to view RSS feeds
  • Create a centrally managed subscription mechanism for users/courses/plugins

See also