Monday, July 09, 2012

Performance optimisations in typesystem

The new typesystem in v2 will only support following types:
  • bool 
  • int 
  • long
  • double
  • utf8 string
  • varbyte (raw)
The older version had types like BigInteger, BigDecimal, Date, VarChar.
The new typesystem attempts to be closer to native types, both for performance reasons and also so that ports to C like languages is easier.

Another major change is the way the row data is serialised. The older typesystem did not store the metadata associated with each column, therefore a separate data dictionary was required to deserialize the data.   The new typesystem encodes the types in the serialised format, therefore a row can be reconstructed from the serialised data without reference to a data dictionary. 

What is unchanged is the status byte per column. Previously this only stored the value type in the column, i.e., Null, PlusInfinity, MinusInfinity or Value. In the new version I am hoping to expand the use of the status byte to encode 3 things:
  • value type - only Null or Value, taking 1 bit
  • If int or long or double, then the number of bytes used to store the value - encoded in 3 bits
  • If bool then the bool value encoded in 1 bit (overlayed with above, 2 bits unused)
  • If utf-8 string or varbyte then 1 bit to encode if the data is zero length or not (overlayed with above, 2 bits unused)
  • The remaining 4 bits to encode the type of the data.
One of the performance killers in the old version is the complete deserialization of data whenever a row is read into memory. This is a killer as the overhead of parsing certain types such as Strings, BigInteger, or BigDecimal is huge. The new version will try to avoid parsing the data whenever possible.

We all know these days that immutable objects are good for multi-threaded applications as they allow us to share data without synchronisation. The old typesystem relies heavily on immutable objects, which is one reason for parsing the row immediately upon deserialization. The problem with lazy parsing is that state must be maintained, and fields initialised upon first access - this makes the row itself mutable, and hence thread unsafe. The solution I am adopting for this is to create separate types. The Row type is immutable, but cannot directly access the bytestream of serialised data. A new type called RowReader is designed to mirror the access methods of Row, but do this over the bytestream. This type is not thread-safe - the caller must ensure that the type is not shared between threads in an unprotected manner. We shall also have a RowBuilder for constructing Row objects incrementally; the RowBuilder is also not thread-safe, and access to it must not be shared across threads in an unprotected manner.