Everybody that deals with databases for a long time knows this problem: contempts, which are sometimes confused with deadlocks. The cause of a comtempt is perfectly known: several clientes trying to update the same data at the same time, forcing them to enter a queue and slowing things down. Contempts are not deadlocks since transactions victimized by contempts will eventually complete, while true deadlocks will stall the offending transactions forever, unless one or more of them is withdrawn/cancelled.
The occurence of cntemptions is mitigated or exacerbated by database locking mechanisms. Data locking is essential to guarantee data integrity -- it is certainly much worse to let two transactions mix together, changing the same data, and we would end up with Frankenstein records of database.
In my old good times as a Clipper developer, contemption occured mostly around the index files, not around data files themselves. Clipper implements row-level locking for writes, which is fast and tends to generate few contemption. But index files were a problem because they had complex internal structures and demanded whole file locking when updated (which also blocks readers). Some developers employed third-party database drivers in order to try to improve escalability of index system.
Surprisingly enough, a countless number of other more products, all of them much more pretentious than Clipper, also had serious contemption problems. Among the ones I had to endure personally, I remember ZIM and early versions of MS SQL. They were certainly worse than Clipper because they employed page-level locking instead of row-level locking.
The first database that I met personally which truly solved the problem was Borland's Interbase, today known as Firebird, a quite popular database among Delphi developers. Though not incredibly fast, it works well and keeps the developer safe from contemption issues. (I guess this was a very strong selling point for Clipper developers that migrated to Delphi, since Clipper guys like me would have a tough time getting used to the contemption issues which were commonplace in MS SQL).
Well, Interbase avoids contemption, deadlocks and guarantees ACID transactions altogether using a single mechanism: multiple generations of data (Multiple Version Concurrency Control -- MVCC). The idea is: never update or remove a row on disk. If you want to update, append a new row with the same primary key, which shadows the old one. If you want to remove, add a new row that labels the primary key as deleted, which both shadows and hides the old row.
Old rows are never updated, so it is not necessary to implement a data locking mechanism -- and contemption magically disappears.
It is so simple that it sounds silly, but it works perfectly well. The scheme has two relatively small defects: eats up more disk, and may be a bit slower. On top of it, each row must have two extra, "hidden" columns: a creation timestamp and some kind of "deleted" flag.
A decisive advantage of MVCC is that database files can be (in theory) copied "hot", that is, with database up and running. The worst that may happen is to get a corrupted version of the most recent row, but all rows before it will be consistent. Compare this to a typical database which can never be backed up "hot", since any file page can be changed at any time -- copying the "hot" file will probably render a corrupted copy, since each part of the file is a snapshot of a different moment.
It seems that all major database implement MVCCC now. In MySQL case, InnoDB and Falcon drivers are MVCC, even though InnoDB implements updates using row-level locking which may still cause contemption.
Another very welcome feature that is easy to implement under MVCC, is database snapshotting -- taking a 'picture' of the database situation at a certain moment, and being able to query this 'picture' later. I can't understand why Firebird does not offer snapshots out of the box, since MVCC makes it very easy to implement.
Snapshots are far more useful than they seem. I have forgotten how may times I heard user complaints because e.g. someone changed a client's address and the truck ended up delivering cargo at the wrong place. Most commercial ERP systems keep old rows in an explicit fashion, doing it by themselves since commercial databases don't do that for themselves.
For hacking value, let's see how would it be to implement MVCC over a database that is not MVCC by itself. As we said before, we create tables with two extra, hidden columns:
CREATE TABLE t ( code INT NOT NULL, name VARCHAR, _timestamp TIMESTAMP NOT NULL, _deleted INT NOT NULL DEFAULT 0, PRIMARY KEY (code, _timestamp) );
The typical primary key would be "code" but we will have several versions of the same row on table, so the primary key must compound with the timestamp. The code+timestamp tuple is guaranteed to be unique, provided that timestamp has sufficient precision.
In order to add rows, we may not need to specify the "hidden" columns since most databases offer reasonable default values for them. In fact, a monotonically increasing integer would be a better timestamp than a system time-based timestamp... but we are just playing, after all!
We can add as many versions of the same row (i.e. with the same code) as we can. In order to "delete" a row, we simply add a new row, with _deleted column equal to 1. Now, when we query the table, we probably want just the latest versions of every code, which we filter with the following SQL code:
select * from t t1 where t1._timestamp = (select max(_timestamp) from t where t.code = t1.code) and _deleted = 0;
Not incredibly efficient, but it works. Frankly, I could not avoid using subselects, which typically degrade performance. If someone can unroll this query for me, I'd be thankful.
Anyway, the SQL command above will do what we want: fetch the latest versions of every table row, leaving out rows marked as deleted. In the other hand, if we want to get a snapshot of the same table at May 01, 2007, 4:30pm, we could do it with the following SQL:
select * from t t1 where t1._timestamp = (select max(_timestamp) from t where t.code = t1.code and t._timestamp <= "2007-05-01 16:30" ) and _deleted = 0;
Let's suppose that our beloved SQL database has no support to transactions. One key feature of a transaction if isolation: it cannot see "new" data, that is, data that was inserted or updated after the beginning of transaction. That's exactly what we achieved with the SQL command above: it is just a matter of using the transaction start timestamp instead of May 1997. We have just achieved the "I" of ACID.
The "A" (atomicity) and "D" (durability) of ACID cannot be guaranteed by MVCC alone, it needs some help from the filesystem. The "C" (consistent) guarantee is just a matter of creating SQL foreign keys in strategic places.
An additional protection, outside ACID, supplied by every MVCC database, is to prevent that concurrent transactions update the same row. Of course it would never happen on table level since every update creates a new row, but even this must me avoided, since it is certainly not desirable that one transacton's work "shadows" completely the other. Worse yet, we would not be able to predict which transaction would "win" -- leading to uncertainty and race conditions.
Full-fledged MVCC database detect such potential conflicts, leave the oldest transaction to finish, and rollback the others (and signal the respective clients that their transactions were aborted). But all this work goes to waste if the application does not handle the abortion correctly. I have seen many applications just try the same transaction again upon abort, defeating completely the database conflict detection's purpose.
Being optimistic and assuming that applications will do the right thing, how could we simulate the conflict detection in our homebrew, SQL trick-based MVCC, As far as I could see, the only way would be to keep all updates in memory, and verify if any updated rows were changed on disk before transaction commit.
Leaving updates in RAM before commit seems to be a half-baked and insecure solution, but it is exactly what MySQL's Falcon does. Falcon's main architect was also Firebird's. Firebird is actually more careful and keeps all ongoing transactions on disk, in a way that they can be recovered should a power loss occurs. The problem with this neat feature is that it makes Firebird slower, and I haven't seen any application that makes use of this feature correctly; all apps just discard uncomitted transactions.
A funny thing about current relational databases is that they handle snapshots as advanced, almost sacred, features. Maybe everybody is underusing MVCC.
For example, when we print an invoice, we must keep all data printed on it, like the client's address at that time, because the client's address may change later but we must be able to recall the address at the time of invoice emission. Most system just dump all printed data on a invoice table, which consumes a lot of disk space.
If the database offered versioning and snapshots in a way that applications could easily manipulate them, we could do better. It would be just a matter of storing the invoice timestamp. Even if the most recent client's address is changed many times, we can always recall which was the address at the epoch of invoice, directly from clients table, avoiding completely the duplicate storage of printed invoice data.
Using the SQL tricks we demonstrated above, we could do things that way right now, even without a "true" snapshotting support at database.
Now, the real fun is to implement a MVCC database from scratch. Being honest, it is not that difficult; as we pointed before, the real headache is to keep the indexes, since they are B-tree structures and need to be updated, not only appended. (If you are really interested on a novel MVCC implementation, you should take a look on ZODB/ZEO, which does MVCC, snapshots, ACID and so on. ZODB handles indexes as an auxiliary thing, separate from data, thus keeping the headache restricted to the index engines.)
First of all, we must define the database file layout. The easiest layout to work with, which was the dBase/Clipper one, is the fixed-width row with an opaque header for each row which allows to rebuild the file in case of corruption:
file preamble {
row size
last written timestamp
Message Authentication Code (MAC)
}
row (many per file) {
magic preamble
column 1
column 2
...
column n
creation timestamp (_timestamp)
marked as deleted? (_deleted)
row MAC
}
One thing that dBase lacked was the MAC, so it could not detect whether a row was corrupt, though it could detect half-written rows since the row size was fixed.
As the database is started, the file must be checked, which means check its size and the last row's MAC. If it is wrong, or worse, the last row's size is smaller than expected, it is corrupt and must be discarded. It is a trick analog to the one we employed in this article about ACID transactions using common filesystem files.
A transaction will often involve more than one table, so will involve more than one file. Data written on a single table is not enough to determine whether the transaction was fully comitted. We must have a "master file" which contains the following structure:
last transaction {
timestamp
MAC
}
As we commit a transaction, this master file must be the last one to be updated. This guarantees that, if a row newer than the last timestamp of master file exists, that row must be discarded since it is result of a half-written transaction.
Indeed, this master file must be written in two or more copies, since this file will be rewritten all the time and the chance of corruption is big; having more copies (and overwriting only one at a time) guarantees that we will always have at least one "good" copy.
Our master file above is entirely overwritten every transaction. If, for some reason, we would like to have a list of committed transactions for future audit, it is just a matter of appending transaction log rows at the end of master files instead of rewriting the file from zero.
With this incredibly simple structures, our transactions will be almost ACID and our database will be MVCC. As a final touch, the durability of transactions are achieved just writing the files in synchronous mode.
Since we are creating a database from scratch just for fun, let's elaborate on details. The timestamp that I employed at SQL tricks at the beginning of the article is actually bad, since MySQL timestamps have 1 second resolution. A lot of rows will end up having the same timestamp, which is not good.
The perfect timestamp is 100% unambiguous and unique, that is, it is employed at most once. An alternative timestamp that works well is a monotonically increasing counter (0, 1, 2, 3, ...) which is incremented for each transaction. As long as we prevent the same transaction to update the same row twice, that timestamp would be enough to MVCC, and the master file can hold the last synthetic timestamp.
So, now we are left with a small big problem: indexes. We can't ignore them becaus, as shown at the beginning of the article, we absolutely need fast queries to find the latest versions of each primary key, and primary keys themselves demand indexes to work well.
The index problem is made worse by the multitude of options that we have. Keep the table in order? Bitmap indexes? Hashes? B+ trees? Well, we can argue that B+ trees are the workhorse of most databases, and they will work reasonably well in all conditions.
For the ones lucky enough never having to implement a B+ tree, there goes a brief explanation. The basic structure written in disk, for an order-4 tree, is:
tree node {
record 1
record 2
record 3
number of filled records
branch 0
branch 1
branch 2
branch 3
}
This structure has a fixed size. Normally, the row is not entirely written inside the tree; only the index value and a pointer to the row is kept on the record; the rest of the data is stored in the table file, as usual.
As the index file grows, we can add more nodes to the index file in the same fashion as we do for MVCC files. The problem is that, in index case, we can not write and forget; sometimes we need to update old nodes, as we'll see.
Assuming an empty index, with only one node on disk. At the first, second and third row insertions, we just occupy the records on the node, so the same one will be updated on site three times. Say goodbye to the convenience of redo logs.
+-----+-----+-------+ |Alice| Bob |Charlie| +-----+-----+-------+
At this point, if we need to search by the primary key, it is just a matter of doing a sequential search on the node records. Actual B+ trees will have far bigger orders, because sequential searches are more efficient than tree searches up to a certain size.
When we need to insert the fourth row, this node has no more room. Now we want to insert "Alma", which lies between "Alice" and "Bob". The middle record (Bob) will be "promoted" to a higher, new node. The remaining records are split in two nodes (one that already exists, and a new one). So, at this point, we have three nodes.
+------+--------+--------+ | Bob | * | * | +------+--------+--------+ +-----+-+-+ +-------+-+-+ |Alice|*|*| |Charlie|*|*| +-----+-+-+ +-------+-+-+
Ok, but all we wanted in the first place was to insert "Alma", which is just below Bob in alphabetic order. So:
+------+--------+--------+ | Bob | * | * | +------+--------+--------+ +-----+----+-+ +-------+-+-+ |Alice|Alma|*| |Charlie|*|*| +-----+----+-+ +-------+-+-+
Now, the problem is that we have a loose collection of nodes; the computes still does not know the hierarchy between them, which is necessary to conduct a search. Enter the branches.
The "Bob" node, which is in a higher leve, is the one that must point to the lower level nodes. So, the high node's branch 0 is pointed to "Alice/Alma" node, and branch 1 is pointed to "Charlie" node. Why?
Because the branches of a node guard a relationship with the records of the same node. Branch 0 points to elements "below" record (in this case, Alma is below Bob in alphabetic order). Branch 1 points to elements above record 1 and below record 2. And so on. By following the trails, a computer can conduct a search now.
+------+--------+--------+
| Bob | * | * |
+------+--------+--------+
| |
| |
| |
| |
| |
| |
| |
+-----+----+-+ +-------+-+-+
|Alice|Alma|*| |Charlie|*|*|
+-----+----+-+ +-------+-+-+
| | | | | | | |
x x x x x x x x
If then we came to add Anton and Barbara, they would fit in the same node as Alice and Alma. As we would try to insert Barbara, that node would be fully taken. The medium record of Alice/Alma/Anton is Alma, which is promoted and moved to the higher node, besides Bob. A new node is created and Anton is moved to there. Finally, Barbara would be added to the same node as Anton, since it is below Bob.
Now the root node is Alma and Bob. We need to rearrange the branches. The Alice node still hangs in branch 0 since it is below Alma. Anton/Barbara node is connected to root's branch 1 since they are below Bob. Charlie's node is reconnected to root's branch 2 since Bob was moved to record 2, and Charlie is above Bob. Final arrangement looks like this:
+-------+------+--------+
| Alma | Bob | * |
+-------+------+--------+
| | | |
| | | x
| | |
| | |
| | +-------
| | \
| | \
+-----+-----+-+ +------+-------+-+ +-------+-+-+
|Alice| * |*| |Anton |Barbara|*| |Charlie|*|*|
+-----+-----+-+ +------+-------+-+ +-------+-+-+
| | | | | | | | | | | |
x x x x x x x x x x x x
Index maintenance in "radical" MVCC system as the one we are proposing has one advantage: rows are never updated or removed, they are only appended. This simplifies the index implementation quite a bit, since the tree will hever have to be shrunk.
So, it seems perfectly possible to have several threads querying and/or updating the B+ tree at the same time. It is noteworthy that data queries always run the tree from top to bottom, while insertions always affect the tree from bottom up.
If we had only one kind of thread operating on tree (only queries or only insertions) they could live together trivially. In case of insertions, the node being updated should be locked, and other insertion threads would wait for unlocking in a queue. Locking could be implememented by a simple mutex that lives in memory -- no to use file-based locking.
Both insertions and queries must lock the node being manipulated/read, and the lock must block readers too, since the reader could read corrupt data if crossed with the writer. (But, since reading is in general very fast and atomic operation, and if the node is made small, maybe it is possible to abolish read locking).
If a query thread has already crossed with a insertion thread and is reading the nodes more to the bottom, there is no more risk of clashing, unless the query crosses yet another insertion thread.
If the query thread stops because a node is locked for updates, it cannot simply wait for unlock and continue the query from that point, because it may well happen that a record was moved to the parent node (which the query thread has already queried).
A query thread must "observe" what happened with the insertion, and react accordingly. If the locked node had spare records, insertion was finished right there, and the query thread can continue on that node. But, if the node was full, it was split and two, and a record was moved to the parent node -- and the query must climb back to the parent node, wait for unlock, and requery the node.
Since the parent node can be full as well, everything may happen again. In worst case, the query may be pushed back all the way to the top node and will have to query the whole tree again. And, if the table has a high rate of insertions, it may also happen that all query threads get stuck on the top of the tree, never being able to finish. A simple timeout scheme may be implemented to slow down insertions, should the queries are not completed in a reasonable time. Using high-order nodes (with a lot of records per node) also helps to reduce the number of node splits and hence the chance of query threads having to climb nodes back.