stonebraker.postgres_storage

The Design of the Postgres Storage System

Mike Stonebraker, Berkeley, 1988?

Summary. The design of the Postgres Storage System was based on 4 goals:

instantaneous recovery from crashes
archival functionality
take advantage of parallelism
concurrency control via locking

The resulting storage system is update only. To enable an update only.system, timestamping is used. Timestamping also enables historical queries. The entire system is constructed as "a collection of asnynchronous processes". Archiving is achieved using an ansynchronous process referred to as the ``vacuum cleaner''.

More Detail

Storage System. Interactions are identified with a transaction id (TID) and a command id (CID) for the command within that transaction. A TID and CID together form an interaction id (IID).

There is a small log (which would ideally be kept in NVRAM) that has 2 bits per transaction to indicate the xact's mode: in progress, aborted or committed. The body of the log is kept on disk and is shortened to 1 bit per xact: aborted or committed.

Records have these 8 pieces of information in addition to the data:

OID. Unique object identifier.
XMIN. Transaction that created this record.
TMIN. Time that this record became valid.
CMIN. Command that created this record.
XMAX. Transaction that invalidated this record.
TMAX. Time that this record was invalidated.
CMAX. Command that invalidated this record.
PTR. Pointer to delta record (see below).

When a record is updated, the old record is not deleted. Rather, the.new record is stored as a delta record and is chained to the formerly valid record's PTR field. Delta records are so named because they only contain the changes from the last record. The first record in a chain is called the anchorpoint. Before a transaction is committed, any modifications that it made must be written to stabile storage. Also, the page of the log that contains this xact's bit must also be flushed. Since the log is only one bit per xact, the hope is to actually keep it in NVRAM

When a record is created, only the OID, XMIN & CMIN fields are filled in with the current IID. When a record is updated (which invalidates it), the XMAX & CMAX fields are set to the current IID and the PTR field in the old record is set to point to the new delta record.

When doing a search, every chain must be followed. This is assumed to not be a bottleneck as anchorpoints and delta records should be on the same page. A TIME relation is maintained that maps the xacts to their commit times. The relation is really an array of timestamps where TIME[x] = the commit time of xact X.

With timestamping comes the ability to do historical queries. The postquel language provides ways for the user to specify a time interval. R-trees can be constructed on a relation using time intervals as a key. Historical queries require no locking because they're on WORM media (see archiving).

Archiving. Records that are non-valid can be vacuumed to a WORM drive (other media is okay) asynchronously. Partial chains can be vacuumed. When vacuuming, anchorpoints will move around. To avoid having to modify the indices is done using a layer of indirection: indices point to line arrays on each page which, in turn, point to anchorpoints.

Archiving is done in 3 steps: (1) write record to archive, (2) write new anchorpoint, (3) reclaim space from old anchorpoint. A crash can occur after steps 1 or 2 and no data will be lost. There may be duplicate data around, however.

Archiving has 3 modes:

no archive: TMIN & TMAX are never filled in. Only ``now'' queries can be performed on these relations.
light archive: TMIN & TMAX are never filled in. The TIME relation is consulted when an historical query is run. (Why don't they just fill this in like with heavy archiving? Maybe because this turns the read into a write so you have to do an additional IO?) Only tolerable if archived records accessed 2-3 times ever.
heavy archive: The TIME relation is consulted when the first historical query is run but then TMIN & TMAX are filled in with those values.

Indices (and relations) are expected to grow very large so they .be combined media. In other words, part of the index can reside in the archives. All pages on the archive must point to archived records and their parent's must be archived as well. Three policies are given for moving index pages to archive:

Vacuum the leaf page that has the smallest value of TMAX. Move parents in when all their children have been archived.
Move the leaf that is most full & oldness.
Use batch mode: Move the entire index over to the archive. Then build a new index on disk as time passes. I don't understand how this can work unless you assume that all the records in the indexed relation are historical when the batching is done.

Hardware novelties. Wanted to use NVRAM for the log tail and.WORM drives for archiving. Also wanted to build the system to take advantage of parallelism -- uses a lot of asynchronous processes.

Performance. A rough performance analysis shows that the Postgres storage manager is equivalent with WAL performance (in terms of I/O) as long as the log can reside in NVRAM. If not, Postgres is worse in that it has 1 more write but it has 8 more synchronous writes.

Notes

Perhaps due to my lack of real-life database exposure, but I got the impression there was a lot of handwaving. For example, for light archive situations, the author says that the overhead of retrieving data from the TIME relation will only be tolerable a very small number of times: ``(about 2-3)''.

There was also a question in my mind about the assumption of non-volatile main memory. Again, perhaps this is due to my lack of real exposure to real database systems, but we certainly don't see this around the workstations in soda.