=head1 The Design of the Postgres Storage System Mike Stonebraker, Berkeley, 1988? B. The design of the Postgres Storage System was based on 4 goals: =over 4 =item * instantaneous recovery from crashes =item * archival functionality =item * take advantage of parallelism =item * concurrency control via locking =back The resulting storage system is update only. To enable an update only system, timestamping is used. Timestamping also enables historical queries. The entire system is constructed as "a collection of asnynchronous processes". Archiving is achieved using an ansynchronous process referred to as the "vacuum cleaner". =head1 More Detail B. Interactions are identified with a transaction id (TID) and a command id (CID) for the command within that transaction. A TID and CID together form an interaction id (IID). There is a small log (which would ideally be kept in NVRAM) that has 2 bits per transaction to indicate the xact's mode: in progress, aborted or committed. The body of the log is kept on disk and is shortened to 1 bit per xact: aborted or committed. Records have these 8 pieces of information in addition to the data: =over 4 =item 1 B. Unique object identifier. =item 2 B. Transaction that created this record. =item 3 B. Time that this record became valid. =item 4 B. Command that created this record. =item 5 B. Transaction that invalidated this record. =item 6 B. Time that this record was invalidated. =item 7 B. Command that invalidated this record. =item 8 B. Pointer to delta record (see below). =back When a record is updated, the old record is not deleted. Rather, the new record is stored as a I and is chained to the formerly valid record's B field. Delta records are so named because they only contain the changes from the last record. The first record in a chain is called the I. B B B B Since the log is only one bit per xact, the hope is to actually keep it in NVRAM When a record is created, only the OID, XMIN & CMIN fields are filled in with the current IID. When a record is updated (which invalidates it), the XMAX & CMAX fields are set to the current IID and the PTR field in the old record is set to point to the new delta record. When doing a search, every chain must be followed. This is assumed to not be a bottleneck as B B. A TIME relation is maintained that maps the xacts to their commit times. The relation is really an array of timestamps where TIME[x] = the commit time of xact X. With timestamping comes the ability to do historical queries. The postquel language provides ways for the user to specify a time interval. R-trees can be constructed on a relation using time intervals as a key. Historical queries require no locking because they're on WORM media (see archiving). B. Records that are non-valid can be I to a WORM drive (other media is okay) asynchronously. Partial chains can be vacuumed. When vacuuming, anchorpoints will move around. To avoid having to modify the indices is done using a layer of indirection: indices point to I on each page which, in turn, point to anchorpoints. Archiving is done in 3 steps: (1) write record to archive, (2) write new anchorpoint, (3) reclaim space from old anchorpoint. A crash can occur after steps 1 or 2 and no data will be lost. There may be duplicate data around, however. Archiving has 3 modes: =over 4 =item * no archive: TMIN & TMAX are never filled in. Only "now" queries can be performed on these relations. =item * light archive: TMIN & TMAX are never filled in. The TIME relation is consulted when an historical query is run. (Why don't they just fill this in like with heavy archiving? Maybe because this turns the read into a write so you have to do an additional IO?) Only tolerable if archived records accessed 2-3 times ever. =item * heavy archive: The TIME relation is consulted when the first historical query is run but then TMIN & TMAX are filled in with those values. =back Indices (and relations) are expected to grow very large so they be I. In other words, part of the index can reside in the archives. All pages on the archive must point to archived records and their parent's must be archived as well. Three policies are given for moving index pages to archive: =over 4 =item * Vacuum the leaf page that has the smallest value of TMAX. Move parents in when all their children have been archived. =item * Move the leaf that is most full & oldness. =item * Use I mode: Move the entire index over to the archive. Then build a new index on disk as time passes. I don't understand how this can work unless you assume that I the records in the indexed relation are historical when the batching is done. =back B Wanted to use NVRAM for the log tail and WORM drives for archiving. Also wanted to build the system to take advantage of parallelism -- uses a lot of asynchronous processes. B A rough performance analysis shows that the Postgres storage manager is equivalent with WAL performance (in terms of I/O) I. If not, Postgres is worse in that it has 1 more write but it has 8 more I writes. =head1 Notes Perhaps due to my lack of real-life database exposure, but I got the impression there was a lot of handwaving. For example, for light archive situations, the author says that the overhead of retrieving data from the TIME relation will only be tolerable a very small number of times: "(about 2-3)". There was also a question in my mind about the assumption of non-volatile main memory. Again, perhaps this is due to my lack of real exposure to real database systems, but we certainly don't see this around the workstations in soda.