stonebraker.postgres_storage
Mike Stonebraker, Berkeley, 1988?
Summary.
The design of the Postgres Storage System was based on
4 goals:
- instantaneous recovery from crashes
- archival functionality
- take advantage of parallelism
- concurrency control via locking
The resulting storage system is update only. To enable an update only.system, timestamping is used. Timestamping also enables historical
queries. The entire system is constructed as "a collection of
asnynchronous processes". Archiving is achieved using an ansynchronous
process referred to as the ``vacuum cleaner''.
Storage System. Interactions are identified with a transaction id (TID) and
a command id (CID) for the command within that transaction. A
TID and CID together form an interaction id (IID).
There is a small log (which would ideally be kept in NVRAM) that
has 2 bits per transaction to indicate the xact's mode: in progress,
aborted or committed. The body of the log is kept on disk and is
shortened to 1 bit per xact: aborted or committed.
Records have these 8 pieces of information in addition to the
data:
- OID. Unique object identifier.
- XMIN. Transaction that created this record.
- TMIN. Time that this record became valid.
- CMIN. Command that created this record.
- XMAX. Transaction that invalidated this record.
- TMAX. Time that this record was invalidated.
- CMAX. Command that invalidated this record.
- PTR. Pointer to delta record (see below).
When a record is updated, the old record is not deleted. Rather, the.new record is stored as a delta record and is chained to the
formerly valid record's PTR field. Delta records are so named
because they only contain the changes from the last record.
The first record in a chain is called the anchorpoint. Before a
transaction is committed, any modifications that it made must be
written to stabile storage. Also, the page of the log that contains
this xact's bit must also be flushed. Since the log is only one
bit per xact, the hope is to actually keep it in NVRAM
When a record is created, only the OID, XMIN & CMIN fields are filled
in with the current IID. When a record is updated (which invalidates
it), the XMAX & CMAX fields are set to the current IID and the PTR
field in the old record is set to point to the new delta record.
When doing a search, every chain must be followed. This is assumed
to not be a bottleneck as anchorpoints and delta records should be
on the same page. A TIME relation is maintained that maps the xacts
to their commit times. The relation is really an array of timestamps
where TIME[x] = the commit time of xact X.
With timestamping comes the ability to do historical queries. The postquel
language provides ways for the user to specify a time interval. R-trees
can be constructed on a relation using time intervals as a key. Historical
queries require no locking because they're on WORM media (see archiving).
Archiving. Records that are non-valid can be vacuumed
to a WORM drive (other media is okay) asynchronously. Partial chains
can be vacuumed. When vacuuming, anchorpoints will move
around. To avoid having to modify the indices is done using
a layer of indirection: indices point to line arrays on each
page which, in turn, point to anchorpoints.
Archiving is done in 3 steps: (1) write record to archive, (2) write new
anchorpoint, (3) reclaim space from old anchorpoint. A crash can occur
after steps 1 or 2 and no data will be lost. There may be duplicate data
around, however.
Archiving has 3 modes:
- no archive: TMIN & TMAX are never filled in. Only ``now'' queries
can be performed on these relations.
- light archive: TMIN & TMAX are never filled in. The TIME relation
is consulted when an historical query is run. (Why don't they just
fill this in like with heavy archiving? Maybe because this turns the
read into a write so you have to do an additional IO?) Only tolerable
if archived records accessed 2-3 times ever.
- heavy archive: The TIME relation is consulted when the first historical
query is run but then TMIN & TMAX are filled in with those values.
Indices (and relations) are expected to grow very large so they .be combined media. In other words, part of the index can reside
in the archives. All pages on the archive must point to archived
records and their parent's must be archived as well. Three policies
are given for moving index pages to archive:
- Vacuum the leaf page that has the smallest value of TMAX. Move parents
in when all their children have been archived.
- Move the leaf that is most full & oldness.
- Use batch mode: Move the entire index over to the archive. Then
build a new index on disk as time passes. I don't understand how this
can work unless you assume that all the records in the indexed
relation are historical when the batching is done.
Hardware novelties. Wanted to use NVRAM for the log tail and.WORM drives for archiving. Also wanted to build the system to
take advantage of parallelism -- uses a lot of asynchronous processes.
Performance. A rough performance analysis shows that the Postgres
storage manager is equivalent with WAL performance (in terms of I/O)
as long as the log can reside in NVRAM. If not, Postgres is
worse in that it has 1 more write but it has 8 more synchronous
writes.
Perhaps due to my lack of real-life database exposure, but I got the
impression there was a lot of handwaving. For example, for light
archive situations, the author says that the overhead of retrieving
data from the TIME relation will only be tolerable a very
small number of times: ``(about 2-3)''.
There was also a question in my mind about the assumption of
non-volatile main memory. Again, perhaps this is due to my lack of
real exposure to real database systems, but we certainly don't see
this around the workstations in soda.