Wednesday, November 29, 2006

Why another database manager?

A friend asked me recently why I spent my time implementing a DBMS, when there are already a number of open source databases. It is a good question because clearly I am not breaking any new ground here. In fact, I find myself incapable of inventing great new technology. Most of what I am implementing is well known stuff. I am a software engineer, rather than a scientist, going by the definitions of engineers and scientists by C.A.R.Hoare. It seems that this project is pure self indulgence, when I could be spending my time more fruitfully, either contributing to projects like Apache Derby, or working on something more relevant.

I guess that if I am honest with myself, I have to admit that there is an element of self indulgence here. But there is also some utility. The knowledge I gain in the process is a benefit to me in my work life. But apart from that, I think that my DBMS implementation is better documented and easier to understand than other opensource implementations. This is for a couple of reasons:
  1. I use well established algorithms, which are well documented in computer science literature. I am also putting more effort into documentation than is typical of many opensource projects.
  2. The system is decomposed into well defined modules which are loosely coupled. I find most other implementations are far more integrated, and therefore difficult to understand. I have traded off performance and efficiency in favour of ease of understanding.
There are still no books that describe how to build a real DBMS, and also show you with real code how DBMS features such as transactions, locking, recovery, btrees, etc. work. The only book that comes close is Transaction Processing: Concepts and Techniques, but the sample code contained in this book is not complete. In some ways, my project provides a sample implementation of many of the techniques described in this book.

Finally, there is the question of pride. When I started this project many years ago in C++, I never thought I could do it. I had no training in this field, and no access to people who do this type of stuff at work. Having come this far, it seems a shame to give up.

Tuesday, November 28, 2006

Premature Optimisation

Here is an example of C.A.R.Hoare's statement that premature optimization is the root of all evils.

In implementing the Lock Manager in SimpleDBM, I was concerned about the impact thread synchronisation would have on the scalability of the Lock Manager. The Lock Manager uses a hash table to speed up lookups of the locks. Many implementations of Lock Managers synchronize the entire hash table while manipulating locks. There is an issue within the Apache Derby project discussing the scalability problems posed by such global synchronisation. The SimpleDBM implementation does not use a global lock. Instead it uses a custom hash table, where every hash bucket contains a ReentrantLock. The hash bucket lock is acquired before any traversal or modification is performed on the bucket's chain. I went even further and put a ReentrantLock into each Lock Header object. A Lock Header is required for each active lock in the system. The Lock Header maintains a queue of lock requests (clients) for the lock. Lock Headers are the objects that get put on the hash chains.

The idea was that the hash bucket lock could be released early by obtaining a lock on the Lock Header, once the Lock Header had been found. This would increase concurrency by reducing the duration for which the hash bucket needs to be kept locked.

In my attempt to further reduce locking, I tried to find ways of reducing the time for which the hash bucket lock was held. One such optimisation involved setting a field in the Lock Header to null rather than manipulating the hash bucket chain when releasing a lock. This code worked fine until the day I tried testing in a multi processor environment (CoreDuo), when I found that one of my test cases started failing. This test case was designed to stress concurrent operations on the same hash bucket. The problem was that in attempting to optimise I had broken the code. The optimisation was flawed because in one section of the code I was modifying the field in the Lock Header while holding the Lock Header lock, while in another section, a search was being performed on the hash bucket while holding the hash bucket lock, and this search was inspecting the field that was being updated.

A lesson from this is that multi-threaded code must be tested on a multi-processor machine.

Anyway, when I hit the bug described above, I decided that it was time to revisit the thread synchronisation code in the Lock Manager. I realised that I was attacking the wrong problem in my tuning efforts, because there were bigger scalability issues that needed fixing compared to the problem I was trying to fix.

The first problem was of memory usage. I was using too many ReentrantLocks; one per hash bucket, plus one for each lock header. In a system with 10,000 concurrent active locks, the total number of ReentrantLocks would be 10,000 plus the number of buckets in the hash chain. If I wanted to make the Lock Manager scalable, I needed to reduce the amount of memory used per lock.

The second problem was that the hash table was not dynamic. Its size was fixed at the time of creating the Lock Manager. This was likely to cause severe performance and scalability problems due to hash collisions. It would be virtually impossible to set the correct hash table size statically, and a poor hash table would cause more hash collisions leading to long hash chains, which would mean greater contention between threads trying to access the same hash bucket.

I decided that locking at the level of Lock Headers was not giving me sufficient bang for money, whereas, if I made the hash table dynamic, and used Java's inbuilt monitors instead of ReentrantLocks to synchronise the buckets, the system's overall efficiency and scalability would be greater than what my current code was achieving. The increase in the number of hash buckets would reduce the number of hash collisions and thereby lock contention amongst threads. The reduction in the number of ReentrantLocks would reduce memory usage and therefore make the Lock Manager more scalable. The refactored code would also be simpler and easier to maintain.

Thursday, November 16, 2006

Lock Manager updates

SimpleDBM's Lock Manager is based upon the algorithms described in Transaction Processing: Concepts and Techniques. Until now, the Lock Manager did not have support for deadlock detection. A simple timeout mechanism was used to abort transactions that were in a deadlock. Yesterday I decided to start implementing a deadlock detector. As I was researching this, I discovered a problem in the way rollback to a savepoint is implemented by the Transaction Manager. When rolling back to savepoint, any locks acquired after the savepoint are released. However, any locks released after the savepoint are not re-acquired. There are times when locks can be released early, for example, when using cursor stability mode, the lock on current row is released when the cursor moves to the next row. When rolling back to a savepoint, we need to restore cursors to their status as at the time of savepoint, and ensure that the cursors reacquire locks on the current row. Since SimpleDBM does not have cursors yet, this is not a problem right now. The difficulty is how to implement this type of interaction between cursors and the Transaction Manager. It seems to me that the Transaction Manager needs to know which cursors are being used by a transaction, and for each cursor, save current row in the savepoint. When the transaction rolls back to a savepoint, the saved information can be used to reposition the cursors to the correct rows and reacquire locks. Since I want to avoid this type of close dependency between the transaction manager and other modules, it will most likely be necessary to implement the Observer pattern.

Speaking of the Observer pattern, I had to recently add a LockListener interface to the Lock Manager module. This was required to allow me to write better test cases. Designing test cases where multiple threads interact in a certain way is complicated. Earlier I used a simple strategy - threads went to sleep for certain intervals, and the pauses between the thread executions were timed just so that the correct thread interaction could be achieved. Unfortunately, this meant that the test cases often had a number of sleeps in them, which slowed everything down. Also, I dislike the approach because it is a hack. In some of the new and revised test cases, I now use the LockListener facility to identify wait points within the locking subsystem. Thus an event is generated when the lock requester is about to start waiting for a lock. This event is trapped by the test cases to synchronize between multiple concurrent threads of activity.

The deadlock detection algorithm I am using is based upon the simple deadlock detector described in the Transaction Processing book alluded to before. However, the code presented in the book requires tight coupling between the Lock Manager and the Transaction Manager, whereas in my implementation, the Lock Manager has no knowledge of the Transaction Manager. Loose coupling makes code more flexible, but there is a cost. Tight coupling can reduce memory consumption quite a bit and improve efficiency, as there is less need for redundant information. For example, in the book, each Transaction contains a pointer to the lock request that is being waited for, and the Lock Manager freely accesses this information. In SimpleDBM, the Lock Manager maintains a separate HashMap where lock waits are recorded, so that the Lock Manager can navigate the lock waits graph without any knowledge of the clients invoking the Lock Manager.

Saturday, November 04, 2006

New source repository for SimpleDBM

SimpleDBM has a new source code repository. I moved the source code from ObjectWeb's repository to Google Code because I found the ObjectWeb repository somewhat slow and unresponsive. The SimpleDBM web pages are still hosted at ObjectWeb.

In the past few months, SimpleDBM's source code repository has move from java.net to ObjectWeb to Google Code, which I hope will be its final resting place. Compared with ObjectWeb and java.net, Google Code has the usual Google stamp of clean uncluttered interface, and relatively faster response. Another thing I like is that as soon as the code is checked in, it appears on the web site.

Thursday, November 02, 2006

Package structure

There has been a revision in the package structure within SimpleDBM.

In the earlier structure, the API and the implementation of a module were designed to be together. Hence, if there was a module named test, the packages would have been org.simpledbm.rss.test for the API, and org.simpledbm.rss.test.impl for the implementation.

The current structure is more favoured towards separating the whole of the API from the implementation. In the new structure, all APIs are under org.simpledbm.rss.api.<module>, and all implementations are under org.simpledbm.rss.impl.<module>. I decided to separate the API from the implementation in this way for two reasons:
  1. It would be easier to generate JavaDoc for the API alone.
  2. The API could be packaged separately in its own jar file without too much effort.

Sunday, October 29, 2006

Unit testing and MVCC in Berkeley DB

After a break of several months, I am about to start working on SimpleDBM again. I am determined not to add any new functionality until existing functionality is thoroughly tested and fully documented.

Currently I am working on improving the unit test cases for the BTree module. This is easily the most complex module in SimpleDBM, and producing comprehensive unit test cases is a significant task in its own right. Anyhow, the effort will be worthwhile as a database system is only useful if it is completely reliable.

It was exciting to learn that Oracle has added support for Multi Version Concurrency in Berkeley DB. I haven't looked at the implementation in great detail but a cursory look seems to indicate that the changes are mainly in the Buffer Pool. To support MVCC, the Buffer Pool has been enhanced to hold multiple versions of pages. Readers can obtain a snapshot of the database using older versions of pages, and thus avoid obtaining locks on rows being read. Unlike Oracle's own version which reconstructs older versions of pages using the undo log, the Berkeley DB implementation appears to be a simple memory based versioning solution. The downside will be that this will not scale to large workloads as it will require huge amounts of memory if the number of pages being versioned increases significantly.

Sunday, June 11, 2006

SimpleDBM moving to ObjectWeb

SimpleDBM is moving to ObjectWeb. The reasons for this move are:
  1. ObjectWeb specializes in building middleware solutions. I hope to build a community of users and developers who are like-minded and interested in middleware technologies.
  2. Java.net is run by and has too much focus on Sun Microsystems. In some ways it is a propaganda tool for Sun. I wanted a more open environment for SimpleDBM, where every project has the same status.

A related news is that I am migrating the version control from CVS to Subversion.

Sunday, May 07, 2006

Creating Oracle DataSource in Glassfish

To setup an Oracle Datasource in Glassfish, follow these steps:

1. Copy the Oracle JDBC drivers to /glassfish/domains/domain1/lib/ext directory. You may need to change the directory /glassfish/domains/domain1 to match your installation of Glassfish and the domain name.

2. Start Glassfish.

3. Login to the admin interface and create a JDBC Connection Pool.
Delete all properties, and add following properties:
user - set this to Oracle userid
password - set this to Oracle password
URL - set this to the URL, example jdbc:oracle:thin:@localhost:1521:xe.
xa-driver-does-not-support-non-tx-operations - set this to true.
Test the connection pool using ping.

4. Create a JDBC DataSource using the Connection Pool.

Wednesday, April 05, 2006

Testing

Due to changes in my work life, I have not been able to devote much time to SimpleDBM in recent weeks. Things are getting back to normal slowly and I am beginning to work on some of the outstanding tasks. The main focus is to provide a usable RSS component, which is feature complete, but needs more test cases, and more testing.

Although test coverage is a useful measure of the effectiveness of test cases, it does not help you much with issues related to multi-threading. Sometimes, obscure bugs are discovered by chance - for example, I discovered a bug in my BTree test harness when I disabled some diagnostic logging. The small time difference caused by this change led to a different interaction between threads, and exposed the bug.

I want to devote more time to testing the RSS components because they provide core services that are fundamental to the rest of the system. Higher level components need to be able to rely completely on the correct functioning of RSS.

Wednesday, March 29, 2006

Oracle JDeveloper as JSF Editor

After trying out various IDEs for editing Java Server Faces (JSF) files, I have finally settled on Oracle's JDeveloper as the best tool for the purpose. I had assumed that JDeveloper would insist on using ADF, but found to my pleasant surprise that it also allows you to design standard JSF pages in a visual manner. Unlike Sun Java Studio Creator which creates its own backing bean class, JDeveloper gives you full control, and lets you decide whether you want all/some/none of your UI components to be in your backing bean.

The Java IDE marketplace is such that no single tool meets all your needs. I find Eclipse is good for general purpose Java programming but not much use for J2EE stuff. Netbeans doesn't work for me - I have tried to use it several times but each time I have given up in frustration because I either cannot resolve some wierd error, or cannot get the IDE to do something simple. To give you an example, I try to edit a JSF file in Netbeans, and it will show code completion for jsp tags, but not for JSF. I don't have the time to work out why.

Thursday, March 09, 2006

Servlets and dependency injection

There is a very useful thread at the Glassfish discussion forum on the why it is not a good idea to use dependency injection to obtain references to EJBs or EntityManagers in Servlets.

Remote or Local interface?

EJB 3.0 takes away much of the pain associated with maintaining the Local and Remote interfaces for Session Beans. However, there are still some issues that you need to be aware of.

By design, the EJB specification disallows the same Business interface to be marked as @Local as well as @Remote (this restriction has been added in a revision to the PFD). The reason for this is that remote interfaces have different semantics to local interfaces, and it is usually inappropriate for the same method to be exposed as a remote interface as well as local interface. The problems are two-fold:

1. Remote interfaces must typically be designed to be coarse-grained.
2. Remote interfaces do not support the pass-by-reference semantics of parameter passing as in local interfaces.

Having separate Remote and Local interfaces forces designers to think about how the interfaces will be used and ensure that the design is optimised for the use case. It also reduces the chances of errors caused by incorrect semantics, such as clients relying upon ability to pass parameters by reference.

A drawback to this approach is that it does not allow transparent redeployment of an EJB from a co-located environment to a distributed environment or vice-versa. While such re-deployment has its dangers, there are times when it can prove useful to have such a facility.

Ideally, you should define your Session beans to have either remote or local interface. This will keep the design simple and allow you to implement Business Logic as POJOs.

If you do want to endow the same Session Bean with both remote and local interfaces, then, I suggest that you use the following example as a model:
public interface CustomerService {

void createCustomer(District d, Customer c);
void removeCustomer(Long id);
List findByDistrict(District d);

@Remote
public interface IRemote extends CustomerService {
}

@Local
public interface ILocal extends CustomerService {
}
}
Note that the local and remote interfaces extend a common business interface. Also note that the local and remote interfaces are nested within the business interface. I like this model because it reduces the clutter, keeps related interfaces together, and eases understanding.

When using dependency injection, you can specify explicitly whether you want the remote or local interface. For example:
@EJB(beanInterface=services.DistrictService.IRemote.class)
public final void setDistrictService(DistrictService districtService) {
this.districtService = districtService;
}

I also suggest that when specifying dependency injection, always annotate your setters rather than instance variables. This will allow you to use your beans outside the J2EE container, for example in a Spring Framework environment.

EJB 3.0 Packaging - Entities

I am looking at the various EJB 3.0 packaging options and trying to determine best practice. My first recommendation is that you should create separate projects for your entities and package your entities in independent jar files. This is important for a number of reasons.

Firstly, you will want to share your entities in multiple projects, so having them maintained independently makes it easier to share them across projects.

Secondly, the teams that define the entities are probably going to work closely with the folks that are responsible for defining the database Entity Model. It will be easier if these teams can work independently from other teams.

Last but not least, it is not possible to define Entities in a portable manner. The main portability issues are with the way Surrogate Keys are generated using Sequence Generators. Having separate projects for your entities will allow you to create drop-in replacements for different database vendors. So you can have one set of entities for Oracle, another for Apache Derby, and so on. For an example of this, please have a look at the EJB3Demo project.

When you create your application archives, place the Entity jar files in the lib sub-directory within the archive. This will ensure that the entity definitions are available to all your EJBs and Web applications within the application. This article by Sahoo explains the benefits of putting your entity jar files in the lib directory.

Thursday, February 23, 2006

More EJB 3.0

For the past few days I have been working on a number of things. I shall blog about them in more detail when I have some time, but in the meantime, here is a summary:

1. Packaging issues. EJB 3.0 brings hassle-free packaging, however, it is still not easy to understand what the rules are. A discussion on packaging issues can be found here and here.

2. Semantics of merge() and persist() and how they are implemented in Glassfish is being discussed here.

3. Does EJB live up to its promise to bring POJO style programming to J2EE? I previously blogged about the fact that EJB 3.0 Entities are not really POJOs because they contain hidden semantics. See here for a discussion of dependency injection of EJBs and whether it is possible to implement Business Logic as POJOs.

Monday, February 20, 2006

Tuple Manager implementation is feature complete

Recently completed the implementation of Tuple Scans. Unlike Index Scans which use next key locking to avoid phantom reads, the Tuple Scan implementation does not protect the gap between one tuple and the next. This means that Tuple Scans cannot provide strictly Serializable behaviour. I wonder if this will be an issue later on.

Wednesday, February 15, 2006

Sharing attributes across Entities

One way to share common attributes across entities is to use inheritance, and put the common attributes in a super class. Another way is to abstract the common attributes into a separate Embeddable class and then embed it within the regular Entity classes. The latter approach has been used for some of the entities in the EJB3Demo project.

The entities Customer, District and Warehouse all contain some very similar columns for storing address information. I created a new class called Address to hold this common data. The Address class is marked as @Embeddable and contains the getters/setters for the attributes. Here is an extract:
@Embeddable public class Address implements Serializable {
String street1;
String city;
@Column(length=20)
public String getCity() {
return city;
}
@Column(length=20)
public String getStreet1() {
return street1;
}
}
Note that the getters have been annotated using @Column - this is useful for capturing the bits that are shared across the entities.

When you embed the Address object within an Entity, you can fine-tune the column mapping. Example:
@Entity
public class Customer {
Address address;
@Embedded
@AttributeOverrides({
@AttributeOverride(name="street1", column=@Column(name="C_STREET_1")),
@AttributeOverride(name="city", column=@Column(name="C_CITY")),
})
public Address getAddress() {
return address;
}
}
The getter for the Address object is decorated with the @Embedded annotation. The address object can be shared amongst multiple entities, and each entity can define its own column names, etc.

When you access embedded objects in EJBQL, you use the property name in the Query statement. For example, to access the city field in Customer, you would write:
SELECT c.address.city FROM Customer c
This syntax is also used when you have relationship mappings. For example, a District belongs to a Warehouse, and hence the District object contains an instance of the Warehouse object. To access the fields within Warehouse, you would write:
SELECT d.warehouse.name FROM District d
Although one could use inheritance to model above, I prefer to use the Embeddable class option as it more accurately reflects what is happening here. A Customer is not an instance of Address - a Customer has an Address. In this situation, Has-A relationship is more natural than Is-A relationship.

EJB3Demo project

I have started a new project at www.java.net. The goal of this project is to try out various features of Java EE 50, such as EJB 3.0, JAX-WS, JAXB, etc., and identify design/code patterns that result in:
  1. Improved portability and reusability of code, in particular, allow Business Logic and Data Access logic to be deployed inside and outside J2EE containers.
  2. Better separation of concerns while exploiting new persistence features of EJB 3.0.

I hope to blog about some of this stuff soon - stay tuned. In the meantime, I would welcome anyone who would like to contribute to the project.

The source code for the project is available for download from https://ejb3demo.dev.java.net/servlets/ProjectDocumentList.

Monday, February 06, 2006

Reusability of Code

I have given up trying to get the TPCC schema work as defined, i.e., using composite keys. Glassfish at present does not support Id Generators with composite keys, which poses a problem, as it means writing extra code for generating keys. As a workaround to this issue, I have modified the schema to use surrogate primary keys.

EJB persistence design favours Entity Relationships based upon surrogate primary keys.

One of the great benefits of EJB 3.0 is that it gets rid of all the boiler plate EJB code, allowing you to code Entity and Session beans as if they were POJOs. This raises the expectation that in this new world, it should be possible to write reusable code that is environment agnostic. By this I mean that I should be able to reuse my J2EE Business Logic classes and the Persistent classes in a J2SE or Servlet Container (Tomcat) environment without having to change any code.

There is an immediate stumbling block to achieving this, however. The unfortunate decision to introduce a new abstraction for Transaction Management for J2SE environment means that tranaction management code is not portable. The first rule for writing reusable code therefore is to avoid transaction management code in your classes. I will report on other pitfalls as I progress with my attempts to create reusable code.

Saturday, January 28, 2006

On licensing

SimpleDBM is licensed under GPL V2 or later. I decided to use GPL because I believe in the values that the GNU movement stands for. It is a pity that so much FUD is generated regarding the GPL, and more pity that there is such a proliferation of OpenSource licences. If GPL was Business Unfriendly, then Linux would never have been successful.

When GPL V3 comes out finally, I will adopt it for SimpleDBM.

On a side note, I finally managed to get around to implementing a few things that were long on my TODO list:
  1. The Log Manager now automatically deletes older archive log files.
  2. There is a background thread for generating Checkpoints.
  3. Rollback to a Savepoint will discard PostCommitActions that were scheduled after the Savepoint was created. This means that if you create a Savepoint, drop a container, and then rollback to the Savepoint, the drop action will be discarded.

Thursday, January 26, 2006

@GeneratedValue limitations

When I saw that the specification of sequence generation has been separated from the @Id annotation, I thought that this meant that sequence generators can be used for any column, and not just for surrogate primary keys. This, however, does not appear to be the case, at least in Glassfish, where if you try to use a sequence generator on any column that is not defined as a surrogate primary key, you get an Exception. You cannot use sequence generators in composite primary keys either.

Update: I could be wrong here because I am getting inconsistent results. More details are in the bug report I have filed with the Glassfish team.

Monday, January 23, 2006

Priorities for January

This month I am working on finishing the Developer's Guide, and also plan to update the Javadoc documentation. While updating the documentation I realized that I should not have used the term BTree when defining the interface for the Index Module. A BTree is an implementation strategy for Indexes, therefore, it is better to use a more generic term when specifying the interface. I am refactoring the code to correct this.

My priorities for January and February are to:
  1. Complete the documentation.
  2. Augment JUnit test cases.
  3. Tie up loose ends and produce a usable RSS component.

If I complete all this by February, from March onwards I shall start working on building the next layer of the system, i.e., type system, system catalogs, tables and indexes with multiple attributes.

Mapping Relationships in EJB 3.0

Rather than create my own schema for testing EJB 3.0 features, I decided to use the schema used in TPC-C Benchmarks. This approach has the benefit that it forces me to design an object-relational mapping for a pre-existing schema, which is probably what most developers will do in practice.

If you are designing a new schema, then bear in mind that EJB 3.0 favours a design where each table has a surrogate primary key, i.e., a meaningless numeric key that is either auto-generated or generated using a sequence. Using surrogate primary keys makes it easier to map the primary key of the table, as you can simply mark the relevant field using the @Id attribute. It also makes it easier to map relationships, as the EntityManager is able to exploit natural Has-A relationships (Composite design pattern) between objects. For example, in the TPC-C schema, an Order has a one-to-many relationship with OrderLines. This can be mapped as follows:
  1. An Order object should contain a Collection of OrderLine objects.
  2. An OrderLine object should have a reference to the Order that it is related to.

Given below is the relevant code:

public class Order {
set<OrderLine> orderLines;

@OneToMany(mappedBy="order")
public Set<OrderLine> getOrderLines() {
return orderLines;
}
}

public class OrderLine {
Order order;

@ManyToOne
@JoinColumn(name="OL_O_ID", referencedColumnname="O_ID")
public Order getOrder() {
return order;
}
}

An important point you must note here is that the OrderLine Entity does not contain an explicit field for the foreign key column "OL_O_ID" that references the primary key column "O_ID" in Order; the Persistence Engine automatically uses the appropriate field from the Order entity when inserting rows in the OrderLine table.

The situation becomes more complex when tables are not designed to use surrogate primary keys. In this scenario, it is often necessary to use composite primary keys.

I will use the Customer and District entities as examples because both have composite primary keys and there exists a many-to-one relationship between a Customer and a District. First of all, let's recall that to map a composite primary key, we need to create a primary key class that identifies the fields that form part of the composite key. This primary class is then associated with the Entity using the @IdClass annotation. The Entity itself also contains the fields used in the primary key; these are annotated using @Id and @Column. Since I described how this works in my previous post, I will only show the resulting key fields in the Customer entity here:
@Entity
public class Customer {

@Id
@Column(name="C_ID", nullable=false)
public int getCustomerId() {
return customerId;
}

@Id
@Column(name="C_D_ID", nullable=false)
public int getDistrictId() {
return districtId;
}

@Id
@Column(name="C_W_ID", nullable=false)
public int getWarehouseId() {
return warehouseId;
}
}

Now in order to map the relationships, we add the following:
@Entity
public class Customer {
District district;

@ManyToOne
@JoinColumns({
@JoinColumn(name="C_W_ID", referencedColumnName="D_W_ID"),
@JoinColumn(name="C_D_ID", referencedColumnName="D_ID")
})
public District getDistrict() {
return district;
}
}

@Entity
public class District {

Set<customer> customers;

@OneToMany(mappedBy="district")
public Set<customer> getCustomers() {
return customers;
}
}

The problem is that now there are two ways of updating the foreign key columns in the Customer entity, because the foreign key columns are present in the referenced District entity, but also present as fields in the entity in order to satisfy the requirements for a composite primary key. If you try to execute this code, you will encounter an Exception such as this (in Glassfish):

Multiple writable mappings exist for the field [TPCC.CUSTOMER.C_W_ID]. Only one may be defined as writable, all others must be specified read-only.

Multiple writable mappings exist for the field [TPCC.CUSTOMER.C_D_ID]. Only one may be defined as writable, all others must be specified read-only.

To resolve this problem, you need to modify the Customer Entity definition and ensure that the columns TPCC.CUSTOMER.C_W_ID and TPCC.CUSTOMER.C_D_ID are marked as readonly, by setting insertable=false and updatable=false:
@Entity
public class Customer {

@Id
@Column(name="C_D_ID", nullable=false,
insertable=false, updatable=false)
public int getDistrictId() {
return districtId;
}

@Id
@Column(name="C_W_ID", nullable=false,
insertable=false, updatable=false)
public int getWarehouseId() {
return warehouseId;
}
}

Thursday, January 19, 2006

EJB 3.0 Entities are not POJOs

Contrary to appearances, EJB 3.0 Entities are not POJOs as they contain hidden data and functionality in order to implement some of the specification requirements. I found this out when I was stepping through some code and out of curiosity, inspected the entities returned by the EntityManager.

When you implement two Entities that are related, for example, in a one-to-many relationship, you typically have a Collection class in the first entity that contains references to instances of the second entity. For example, an Order entity may contain a Collection of OrderLine entities. EJB 3.0 supports "lazy fetches", which means that the data for the Collection is not fetched until required. To support this, the Collection class contains extra functionality and data - you can see this if you inspect the Entity using a debugger. When you try to access the collection, this hidden functionality is triggered; and data is fetched to populate the Collection. This process carries on recursively for all referenced entities.

Lazy fetching is clearly desirable because you do not want data to be unnecessarily fetched if you do not intend to use it. However, the problem is that the Entity is not fully initialized until all the data is fetched. This poses a particular problem if you want to use the Entities outside of their "managed" environment. Any attempt to access the uninitialized data may fail. Section 3.2.4 of the EJB 3.0 specification (PFD) spells out exactly what you can safely access.

See the recent thread at the Glassfish Discussion Forums on this issue, and how it affects client access.

One of the questions that will be debated in future is whether EJB Entities should be exposed outside the Data Access layer. For example, should you expose entities in your Business Logic interfaces, or to your clients. Since EJB 3.0 Entities are touted as POJOs, many developers will assume that they can use these objects ouside of the Data Access layer. However, such use is fraught with danger, due to the semantics of detatched entities.

In my view, the EJB specification should require that detached entities are "cleaned" and made POJOs. This would lead to more predictable behaviour, and less surprises.

Tuesday, January 17, 2006

Some comments on EJB 3.0 PFD

In a previous post, I complained about lack of clarity in the specification of the @IdClass annotation. Well, it seems that this lack of clarity extends to anything to do with composite primary keys. Here is what the specification has to say about composite primary keys:

Composite primary keys typically arise when mapping from legacy databases when the database key is comprised of several columns.
Perhaps this explains why there aren't any real examples to show how Entity relationships ought to be mapped when composite primary keys are involved.

Monday, January 16, 2006

@GeneratedValue annotation support in Glassfish

I reported previously that Glassfish does not yet support the new @GeneratedValue annotation specified in the EJB 3.0 PFD. I downloaded the latest build ( b33) of Glassfish today, and while testing my code, found that the new build supports @GeneratedValue. It is a pity that one has to find this out by trail and error; the Glassfish team should put up some release notes with each build which covers feature changes since last build.

Anyway, the old syntax for @Id annotation allowed atributes for specifying Id Generators. The new method requires the Id Generator to be specified separately using a @GeneratedValue annotation. Here is a comparison between the old approach and the new approach:

Old approach:
@Id (generate=GeneratorType.SEQUENCE, generator="SEQ_GEN") 
public Long id;
New approach:
@Id
@GeneratedValue(strategy=GenerationType.SEQUENCE,
generator="SEQ_GEN")
public Long id;
I think the new approach is cleaner. It also enables Id Generator to be used in fields that are not necessarily the primary key.

Overall, the fact that Glassfish seems to be following the draft specification closely is great, as it enables developers to get familiar with the new technology. Thankfully, the specs are now close to final version, so changes ought to be relatively small.

Talking of changes, I hit another one. The syntax for @TableGenerator has changed. The table is not longer specified using @Table, instead, table and schema are ordinary string attributes. Here's an example of the new syntax:

@TableGenerator(name = "DISTRICT_ID_SEQUENCE", 
table = "IDGENERATOR", schema = "TPCC",
pkColumnName = "SEQ_NAME", valueColumnName = "SEQ_COUNT",
pkColumnValue = "DISTRICT_ID_SEQUENCE",
allocationSize = 10)

Saturday, January 14, 2006

Multiversion Concurrency

Some time ago, I promised to write about some of the techniques used in Oracle. A very early draft of a paper on Multi-Version concurrency is now available. It discussed MVCC implementations in PostgreSQL and Oracle.

SimpleDBM does not implement MVCC on purpose, as I wanted to understand traditional implementations before attempting to implement MVCC. Perhaps one day, a different version of SimpleDBM will implement MVCC.

I am very keen on ensuring that this paper is accurate in its description of PostgreSQL and Oracle implementations. I would also like to add descriptions of other DBMSes like Firebird and MySQL/InnoDb.

Friday, January 13, 2006

Documentation moving to LaTeX

I am ashamed to say that I just discovered LaTeX. Of course, I knew about TeX but never thought I would use it ... well, I have just converted the SimpleDBM Reference Manual to LaTeX, and I love the results. The output is so much better, and the document looks professional. Here's the link to the PDF output.

Thursday, January 12, 2006

Regarding EJB 3.0 Entity lifecycle

EJB 3.0 Entities exist in one of 4 states: new, managed, detached, and removed.

The state transitions of an Entity depend upon the type of operations on the Entity and the life-cycles of associated Persistence and Transaction Contexts.

When an entity is first instantiated using Java new operator, its state is "new".

If you obtain an entity instance via EntityManager.find() or EntityManager.getReference(), or through a Query, its state depends upon whether there is an active Persistence Context or not. If there is one, then the Entity is in "managed" state, otherwise in "detached" state.

When you invoke EntityManager.persist() on a "new", "managed" or "removed" Entity, it becomes "managed". Note that you cannot invoke persist() on a "detached" Entity.

When you invoke EntityManager.merge() on a "new" or "detached" Entity, a "managed" copy of the Enity is created. You cannot invoke merge() on a "removed" Entity. Invoking merge() on a "managed" Entity leaves it in "managed" state.

When you invoke EntityManager.remove() on a "managed" Entity, its new state becomes "removed". Attempts to remove "new" or "removed" entities are ignored, however, it is an error to try and remove a "detached" entity.

If the Persistence Context is TRANSACTION scoped, then all "managed" entities within the Persistence Context become "detached" when the associated transaction ends. If the Persistence Context is scoped as EXTENDED, then the entities remain "managed" even after transaction ends.

In terms of Entity state, "new" and "detached" states are similar in that the Persistence Context has no knowledge of such entities.

The rules governing the lifecycle of Persistence Contexts are complex; I will explore them in another post.

The EntityManager interface provides two mechanisms for persisting an Entity: persist() and merge(). The differences between these two interfaces are subtle, and in my view, a single interface would have been better and less confusing. In terms of usage, I would recommend using merge() at all times.

Tuesday, January 10, 2006

Using @IdClass in Glassfish

Given below is an example of how to use @IdClass in Glassfish. Note that the primary key class must not have any members other than the Id fields, not even serialVersionUID, even though the primary key class is defined as Serializable.

@Entity
@Table(name = "WAREHOUSE", schema = "TPCC")
@IdClass(entity.tpcc.WarehousePK.class)
public class Warehouse {

String name;
String city;

@Id
@Column(name = "W_NAME", length = 10)
public String getName() {
return name;
}

@Id
@Column(name = "W_CITY", length = 20)
public String getCity() {
return city;
}

public void setCity(String city) {
this.city = city;
}

public void setName(String name) {
this.name = name;
}
}

public class WarehousePK implements Serializable {

String name = "";
String city = "";

public WarehousePK() {
}

public String getCity() {
return city;
}

public String getName() {
return name;
}

public void setCity(String city) {
this.city = city;
}

public void setName(String name) {
this.name = name;
}

@Override
public boolean equals(Object arg0) {
if (arg0 == this) {
return true;
}
else if (arg0 instanceof WarehousePK) {
return name.equals(((WarehousePK)arg0).name) &&
city.equals(((WarehousePK)arg0).city);
}
return false;
}

@Override
public int hashCode() {
return name.hashCode() ^ city.hashCode();
}
}
The primary key class is useful when searching for objects by their primary key. Example:

    WarehousePK key = new WarehousePK();
key.setName("LONDON WAREHOUSE");
key.setCity("LONDON");
Warehouse w = em.find(Warehouse.class, key);
When creating new entities of updating existing entities, you should access the properties within the entity as normal. Example:

    Warehouse w = new Warehouse();
w.setName("LONDON WAREHOUSE");
w.setCity("LONDON");

Problems with EJB 3.0 @IdClass specification

Note that comments below are based upon the Proposed Final Draft specification of EJB 3.0.

I feel that the specification of @IdClass is not at all clear. There are several problems:

1. It is not clear what the class that represents the composite primary key is used for.

Is it just an informative construct designed to tell the Enity Manager which fields of the Entity represent the key?

How and when are clients expected to use this class?

From trial and error and guesswork I have deduced that the class that represents primary key is essentially for two reasons:

a) First to tell the Entity Manager which fields of an Entity correspond to the composite primary key.
b) Secondly, for use in the EntityManager.find() and EntityManager.getReference() methods, as the second argument.

It is also not clearly stated anywhere whether the primary key class needs to be annotated in any manner. Clearly, if it is used as @EmbeddedId then it must be annotated using @Embeddable, but if it is used as @IdClass then no annotation is needed.

2. There aren't any examples illustrating the use of @IdClass. There are few code snippets that depict @IdClass usage, but these do not adhere to the requirements defined for primary key classes.

Section 9.1.13 shows an entity called Employee, but does not show the EmployeePK class.

Section 9.1.31 (page 190) shows a primary key class EmpPK but this class does not adhere to the rules - ie, not Serializable, and does not implement hashCode() and equals().

Monday, January 09, 2006

Using bind variables in generated SQLs

I noticed that the generated SQLs in Glassfish/Toplink implementation of EJB 3.0 persistence, use literal values instead of bind variables. Here are a few examples:

Here is the generated SQL when creating a new entity:

INSERT INTO TPCC.WAREHOUSE
(W_CITY, W_STREET_1, W_STREET_2, W_VERSION,
W_TAX, W_STATE, W_YTD, W_NAME, W_ZIP)
VALUES ('LONDON', NULL, NULL, 1, 0, NULL, 0,
'LONDON WAREHOUSE', NULL)
Now, look at the SQL generated for a query:

Code:

Warehouse w = (Warehouse) em.createQuery(
"SELECT w FROM Warehouse w WHERE w.name = :wname")
.setParameter("wname", "LONDON WAREHOUSE")
.getSingleResult();
Generated SQL:

SELECT W_ID, W_CITY, W_STREET_1, W_STREET_2, 
W_VERSION, W_TAX, W_STATE, W_YTD, W_NAME, W_ZIP
FROM TPCC.WAREHOUSE
WHERE (W_NAME = 'LONDON WAREHOUSE')
Finally, have a look at the UPDATE SQL:

UPDATE TPCC.WAREHOUSE 
SET W_STREET_1 = 'Braham Street', W_VERSION = 2
WHERE ((W_ID = 1) AND (W_VERSION = 1))
Clearly, the default implementation is always using literals in SQL statements. I haven't yet found a way to tell Glassfish/Toplink to use bind variables in SQLs.

Fortunately, since Glassfish and Toplink are both OpenSource, I can probably find out how to do this by looking at the source code.

Why EntityTransaction?

When managing entities outside a J2EE container, you need to obtain an instance of EntityTransaction to manage transactions. Here is an example:

        EntityManager em = null;
try {
em = Persistence.createEntityManagerFactory("em1")
.createEntityManager();
System.err.println("Created entity manager");
Warehouse w = new Warehouse();
w.setName("LONDON WAREHOUSE");
w.setCity("LONDON");

EntityTransaction t = em.getTransaction();

boolean success = false;
t.begin();
try {
em.persist(w);
success = true;
}
finally {
if (success) {
t.commit();
}
else {
t.rollback();
}
}

} catch (Exception e) {
e.printStackTrace();
}
finally {
if (em != null) {
em.close();
}
}

I would like to understand the rationale for introducing a new abstraction for transaction management when the UserTransaction interface already exists. By introducing a different mechanism for out-of-container applications, it will be harder to write code that is environment agnostic ... unless you write a wrapper to hide the differences.

Sunday, January 08, 2006

Experiments with EJB 3.0 @Id annotation

I have been messing around with @Id annotation in EJB 3.0 for generating primary keys.

Firstly, there appears to be a change in the latest version (Proposed Final Draft) of the EJB 3.0 specification which is not supported in Glassfish yet. The latest version requires a separate annotation called @GeneratedValue to define the type of Id generator. I downloaded Eclipse Dali yesterday, and this seems to support the new method.

The previous syntax supported five generator types - NONE, AUTO, SEQUENCE, IDENTITY and TABLE. In the new syntax, NONE is no longer required, as it is implied if there is no @GeneratedValue annotation. The remaining four generator types are supported.

By default, if no generator type is specified, then NONE is implied. This means that the developer must set the Id property or field correctly before attempting to persist the entity.

I have some trouble getting to terms with SEQUENCE and IDENTITY generators. Not all databases support both of these, so it seems to me that if you annotate your Entity with either of these then you are going to end up with non-portable code. My recommendation is to avoid these.

IDENTITY generator type is meant to be used when the underlying DBMS supports the IDENTITY column type - this is an autoincrement column that is maintained by the DBMS itself. For example, you can create a table in Apache Derby with a column defined as "GENERATED ALWAYS AS IDENTITY". Note that systems that support this also support a mechanism for retrieving the last generated IDENTITY value. This method is not the equivalent of using a trigger to populate the primary key - as is sometimes done in Oracle. With the trigger approach, there is no support for retrieving the last generated IDENTITY value.

The SEQUENCE generator is meant for systems that support Sequences. I know of Oracle and PostgreSQL that support sequences.

The TABLE generator, as its name implies, relies upon a special table for generating sequences. This is a portable method, as it does not rely upon native generators. Here is an example annotation that specifies a table generator:

@TableGenerator(name="IDGEN",
table=@Table(name = "SEQUENCE", schema = "DIBYENDU"),
pkColumnName="SEQ_NAME",
valueColumnName="SEQ_COUNT",
pkColumnValue="ID_SEQUENCE",
allocationSize=10)
Finally, there is the AUTO generator type. This is meant to use the best strategy available in the DBMS. Unfortunately, Glassfish seems to insist on using a table generator, at least when I tested this with Apache Derby. It also assumes that you have created a table called SEQUENCE in the default schema for the JDBC Connection ID - there is no way to specify the schema. The expected table structure is not documented as far as I can see, but you can guess it by looking at the SQL logs. You can also use the DDL generation facility in Glassfish to get the basic DDL for creating this table.

Speaking of DDL generation facility in Glassfish, I think it is not very useful at the moment. The DDL generated looses information - for example, it does not respect any @Column attributes other than name.

Saturday, January 07, 2006

On Checked Exceptions

Here are a couple of my blogs on Checked Exceptions:
  1. Why I favour Checked Exceptions.
  2. Technique for ensuring method signature stability.

I recommend this excellent blog about Java API Design Guidelines, by Eamonn McManus. Agree with most of it except the bit about using Unchecked Exceptions. Eamonn provides a link to a presentation by Joshua Bloch, which covers the same subject.

Friday, January 06, 2006

More on Exceptions

In a previous post, I blogged about why I favour Checked Exceptions over Unchecked ones. Today, I'll talk about how I am circumventing some of the issues with Checked Exceptions.

The big advantage with Checked Exceptions is that the method signature tells you what Exceptions are likely to be thrown. Great as this is, it is also a liability, because any change in the Exception specification can break client code.

SimpleDBM comprises of several modules. In SimpleDBM, the module is the unit of reusability. Each module has an API which is represented by a set of Interfaces, and one or more implementation. An important objective is to make each Module's API stable so that as the code evolves, other modules are not impacted by changes in the API. This is where Exception specification becomes important.

Let us assume there are two modules, A and B, and also assume that B depends upon A. Now, suppose that some methods in A's API throw an exception called ExceptionA and some of B's methods throw ExceptionB. Since B's methods call A's methods, when defining B's methods, we have following options:


  1. Allow B's methods to throw ExceptionA.
  2. Catch ExceptionA in B's methods and wrap them in ExceptionB.

The problem with first approach is that it makes B's API unstable. What if A's methods start throwing a new Exception type?

The problem with the second approach is that ExceptionA has been wrapped and cannot be caught by a client. A client that invokes B may want to handle ExceptionA in a specific manner. In the first option, the client could write following code, but with the second option, this is not possible:

try {
// call B's API
B.someMethod();
}
catch (ExceptionA e) {
// Catch and handle ExceptionA
}
What we want is to preserve the information that A threw ExceptionA, but still avoid having ExceptionA in B's method signatures.

The solution is to wrap ExceptionA with a specific sub-class of ExceptionB, rather than plan ExceptionB. Let us call this sub-class ExceptionBExpetionA. The trick is that methods in B should only be specified to throw ExceptionB. This is okay because ExceptionBExceptionA is a sub-class of ExceptionB. However, now clients can catch ExceptionBExceptionA and handle this particular exception, while ignoring other instances of ExceptionB.
try {
// call B's API
B.someMethod();
}
catch (ExceptionBExceptionA e) {
// Catch and handle ExceptionBExceptionA
}
Not all exceptions thrown by A need be wrapped in this manner - only those that are specifically useful to the client.

In SimpleDBM, each module defines its own Exception class. Methods of the module can only throw instances of the module Exception. However, where necessary, sub-classes of the Exception are created that represent more specific information, sometimes wrapping Exceptions thrown by other modules.

Thursday, January 05, 2006

Using Glassfish EJB 3.0 Persistence in J2SE

UPDATED 7th May 2006
One of the great things about EJB 3.0 is its support for persistence in J2SE environments. This is great for trying out various persistence features without the overhead of developing an application which must be deployed to an application server before it can be tested.

To use EJB persistence in your J2SE program, you need a couple of jar files that come with Glassfish distribution. These are:

  1. /glassfish/lib/toplink-essentials.jar
  2. /glassfish/lib/toplink-essentials-agent.jar

These jar files are available as a separate standlone bundle here.

You require the toplink-essentials.jar during development. At runtime, you need to add the following option to your Java command line:

-javaagent:/glassfish/lib/toplink-essentials-agent.jar

This will automatically include the toplink-essentials.jar to your classpath.

There is another preliminary step that you need to be aware of. You need to create a file named persistence.xml in a directory called META-INF. This directory must be in your classpath. Given below is an example of a persistence.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<persistence version="1.0" xmlns="http://java.sun.com/xml/ns/persistence">
<persistence-unit name="em1" transaction-type="RESOURCE_LOCAL">
<provider>oracle.toplink.essentials.ejb.cmp3.EntityManagerFactoryProvider</provider>
<class>entity.Table</class>
<properties>
<property name="toplink.jdbc.driver" value="oracle.jdbc.driver.OracleDriver" />
<property name="toplink.jdbc.url" value="jdbc:oracle:thin:@localhost:1521:sample" />
<property name="toplink.jdbc.user" value="scott" />
<property name="toplink.jdbc.password" value="tiger" />
<property name="toplink.logging.level" value="INFO" />
</properties>
</persistence-unit>
</persistence>
If you use set logging level to FINEST you can see the SQLs being generated by Toplink.

To access the EJB 3.0 EntityManager in your program, add lines similar to following:

  EntityManager em = Persistence.createEntityManagerFactory("em1")
.createEntityManager();

Wednesday, January 04, 2006

EJB 3.0

I have recently started exploring EJB 3.0. I am using the latest builds of Glassfish, Sun's OpenSource Application Server. Glassfish contains an OpenSource version of Oracle's Toplink product.

I had stayed away from EJB so far, as I was not convinced that the benefits of EJB outweighed the complexities it introduced into code. I had been looking at alternative light weight frameworks such as SpringFramework as a substitute for EJB. EJB 3.0 has however converted me. In my view, once production implementations of EJB 3.0 are available, there will be fewer occasions to use alternative frameworks.

As I learn more about EJB 3.0 and Glassfish, I will post my findings here. Stay tuned.

Hello

This is where I intend to blog about programming in general, and Java programming in particular.