Wednesday, December 14, 2005

Free Space Management

Despite my previous resolve, I have started work on the Tuple Manager module this month. I'll talk about some of the issues and challenges in this module.

The Tuple Manager module provides a low-level interface for managing persistence of table rows. It is low-level in the sense that this module has no knowledge of what is contained in a table row. I use the term tuple instead of table row, but even this is not the right term, as a tuple means a collection of attributes.

To the Tuple Manager, tuples are just blobs of data that can span multiple pages. When a tuple is inserted for the first time, it is assigned a unique Location, which is really an abstraction of the ROWID concept in other databases. The Tuple Manager module implements the Location interface, but other modules do not need to know anything about the internal structure of these objects.

Like BTree indexes, tuples are stored in containers. A container that is specialized for storing tuples is called a Tuple Container. By design, only one type of tuple may be stored in a particular container. In a higher level module, a tuple can be mapped to a table row, and the container to a table.

I have kept the interface of Tuple Manager generic by ensuring that it knows very little about tuples. Unfortunately, this means that tuple updates cannot be handled efficiently, specially with regards to logging, as the contents of the both before and after images of the tuple must be logged. A possible optimisation would be to use some form of binary diff algorithm to generate a change vector, and store the change vector only in the log record. I will initially implement the less efficient logging method, and later on, when time permits, implement the optimisation.

Both BTrees and Tuple Containers need free space management. By free space management we mean the process of identifying pages where new data can go. In the case of BTrees, SimpleDBM uses space map pages that use one bit per page. This is okay, because in a BTree, a page is either allocated or not. My paper examines some of the space management issues in BTrees in more detail.

In case of Tuple Containers, I am using space map pages that use two bits to store the space information for a single page. This means that we can track the following states: full (3), two-thirds full (2), one-third full (1), and empty (0). Initially pages start out empty, but when they are used, their status changes as required.

There are a couple of issues related to space management that merit discussion. Unfortunately, there are very few papers that go into all the details. I have found the following papers very useful:

[1] C.Mohan and D.Haderle. Algorithms for Flexible Space Management in Transaction Systems Supporting Fine-Granularity Locking. In Proceedings of the International Conference on Extending Database Technology, March 1994.

[2] Mark L.McAuliffe, Michael J. Carey and Marvin H. Solomon. Towards Effective and Efficient Free Space Management. ACM SIGMOD Record. Proceedings of the 1996 ACM SIGMOD international conference on Management of Data, June 1996.

The first issue is how to handle deletes. There are two options.

The first option is to delete a tuple physically and update space map page to reflect the change. However, this poses the problem that if the transaction aborts and the tuple needs to be restored, then the restore will fail if some other transaction uses up the released space in the meantime. To prevent this, some extra information needs to be stored in the page to indicate that although the tuple has been deleted, its space is reserved, and cannot be used by any other transaction. One possible approach is to store the transaction id and the amount of space reserved using the space previously occupied by the tuple. If the tuple occupied more than one page, then space must be reserved on all affected pages, since otherwise, when the tuple is to be restored, the pages may no longer have space to hold the tuple data. If logical undo is implemented, then it is possible to avoid reserving space in pages other than the first page, because a logical undo will allow new pages to be commissioned if necessary to accomodate the restored tuple. Since the tuple's unique id (Location) is bound to the first page, this page must always have space available for the tuple to be restored.

If as suggested above, the space map information is updated as soon as the tuple is deleted, then other transactions looking for free space may end up visiting the pages that have been affected by the tuple delete. However, those transactions may discover when they access the page, that space is not actually available. As a solution to this problem, the space map page update could be deferred until the tuple delete is known to have been committed. However, this would be inefficient, as the transaction that performs the delete will have to visit all pages affected by deleted tuples at commit time, and update the free space map information for these pages.

The space map update could also be delegated to the next transaction that needs the space. The problem with this is that if the page remains marked as fully allocated, then no other transaction will visit that page unless the tuples on the page need to be updated. There is the risk that the tuple space will never be reclaimed.

The problem of unnecessary visits to a page containing reserved space can be avoided by techniques described in [2]. This involves maintaining a cache of recently used pages and avoiding scanning of the free space map as long as there is candidate page available in the cache. When a page is affected by a tuple delete, it is added to the cache provided that its total free space, including the space reserved for deleted tuple, is greater than the fullest page in the cache. If a transaction visits such a page and is unable to use the reserved space, it removes the page from the cache.

In summary then, the preferred option appears to be to update the space map information as soon as the tuple is deleted.

Physically deleting tuples affects the amount of logging that must be performed. Since the tuple's data is removed, the log must contain the entire contents of the deleted tuple. Similarly, when undoing the delete, the Compensation log record must again contain the full content of the tuple. Thus the tuple data gets logged twice potentially, once when the delete occurs, and again if the transaction aborts.

This brings us to the second option, which is to use logical deletes. In this solution, the tuple remains as is, but is marked as deleted. No space reservation is needed, as the tuple still exists. The space map information is updated as before, that is, at the time of tuple being deleted. Using logical deletes makes undo of such deletes a simple matter of resetting the deleted flag. Logging overhead is substantially reduced.

With logical deletes, however, none of the space can be released prior to the transaction commit. In contrast, with physical deletes, if logical undo is implemented, at least some of the space can be immediately released.

Whether logical or physical deletes are used, in both cases, we still have the issue of how to inform other transactions that the tuple space is still needed. In both cases, the solution is the same. The Lock Manager can be used to ascertain whether the deleted tuple is still locked. If not, then the transaction can infer that tuple delete has been committed. The Lock Manager solution works even if the ID of the transaction that deleted the tuple is unknown, as it relies upon the tuple's Location only. If each tuple is tagged with the ID of the last transaction that updated the tuple, then it would be possible to directly query the transaction table for the status of the transaction. However in this case, the system would have to maintain the status of all transactions, even those that have committed or aborted.

SimpleDBM maintains the status of only active transactions, and also does not tag tuples with the IDs of transactions. Hence, it is appropriate to use the Lock Manager solution in SimpleDBM.

I mentioned before that there were two issues related to Space Management that I wanted to discuss. The second issue is related to logging of free space map pages. I am still working on this issue, and have not reached any conclusions yet. I will blog on this subject once I have completed work on the Tuple Manager. In the meantime, if you are interested in the logging issues, I suggest reading [1].

Monday, December 05, 2005

December Priorities

The priority for December is to complete the documentation and create sample programs. Both of these are intended to help new developers to get started with SimpleDBM.

So far, I have created a BTreeDemo sample - here is a screenshot. The demo program allows a user to interactively manipulate a BTree index. The program creates two threads and allows user to switch from one to the other. This is a good way to simulate locking and concurrency issues.

I am working on a Transaction Manager demo, which will be based upon the One Bit Resource Manager described in section 10.3.7.2 of the classic book Transaction Processing: Concepts and Techniques, by Jim Gray and Andreas Reuter.

The developer's guide is coming along slowly. The latest version can always be found here.

The other theme of this month is to carry on with more BTree test cases. I am reluctant to start work on anything else until I am completely satisfied with the testing.

I mentioned in my previous post that I would need to enhance the BTree module to allow generic multiple attribute index keys. This is now done, and a sample implementation of multiple attribute keys is available as part of the BTreeDemo program.