Arguments in favour of PostgreSQL
Martin Langhoff argues in favour of PostgreSQL (source: Moodle over webct and LNLS at Athabasca University? forum posting)
Reasons to go with Postgres
There are several reasons to go with Postgres, I'll try to make a brief outline. We run a variety of RDBMSs at Catalyst, and have a lot of in-house experience with them: Oracle, Postgres, MySQL and Progress, plus a few others. We also have experience with replicated databases, clustering and other tricks -- which we use for the backend of the .nz root domain servers as well as a few other mission-critical systems.
On the performance side, Postgres requires a bit more up-front configuration than MySQL. A well tuned Postgres is pretty close on SELECT performance to MySQL with small databases. With large tables MySQL has some bad performance problems, and Postgres performs much better.
Write performance is also an issue with MySQL -- with a lot of traffic, it has serious problems with concurrent writes. Under heavy load, Postgres performs much better.
But to tell you the truth, the real reason for choosing Postgres is reliability. We maintain a lot of databases, and Postgres is rock-solid reliable and has a focus on ACID-correctness: when it returns from a commit, the data is safely on disk and won't be lost -- barring actual disk problems, which we offset using RAID-1.
No matter how hard we try, MySQL databases with a lot of usage have recurring index corruption issues. If you look at the startup scripts for MySQL on most Linux distributions, they check for data corruption on every startup -- this is to mask the fact that it is a frequent occurrence.
And while this is passable with small installations where the data isn't mission critical, you have to consider how much you can trust such approch. And with large datasets, runing isamchk/myisamchk can take hours -- we cannot afford that.
The clustering solution for MySQL is being touted a lot, and I think it is a red herring. My main concern about is that it writes "asynchronously" -- that is, there is no guarantee that your data is on-disk safely. It'll get to the disk sometime. It'll get to the slaves... sometime. Hmmm.
Given that the MySQL cluster uses async writes, splitting read/writes between the master and the slaves breaks down in cases where we write some data, and read it back in immediately (or soon after). And this does happen in quite a few places.
And you also have to consider the performance boost of using async writes: if you tell a standalone Postgres or MySQL to use async writes, it'll scale much better (should be able to handle up to 3-4 times more simultaneous writes). Once you do that, the performance advantage of the MySQL cluster mostly vanishes. It still has semi-hot takeover in case the master goes down, but Postgres can do that using Slony, and with better guarantees of consistency of the data in the slave.
In a nutshell, MySQL isn't normally very solid when it comes to ensuring my data is safely stored on-the-disk, even if it theoretically guarantees that it's been saved. And MySQL Cluster says up-front that there isn't a guarantee any more. Riiiiiight wink
Michael is talking about having UPSs. We have a car-sized UPS and a container-sized on-site generator that auto-starts. And yet, I wouldn't depend on that for my DB consistency on a large Installation. So many things other than power can (and do) go amiss. If a process has a problem storing the data, the right thing is to tell that back to the user. With async writes, you end up with a queue of data that hasn't been stored yet, but you already told the user it was.
That's not what a database is supposed to do.
I am currently exploring some techniques similar to those being used in livejournal and slashdot. We should be able to increase Moodle scalability by cutting down on DB load by about 50%. This is happening slowly in the gaps between more urgent projects. Feel free to ping Richard or me if you're interested in that track.