Saturday, July 28, 2007

Why is validation so hard?

Here I am, trying to validate my persistent business objects before committing a transaction...  since I am using Hibernate... that means they are POJOs...
Hibernate has its validation framework, that allows for validation using Java 5 @Annotations... it is a nice idea... but I don't feel that comfortable validating that way... Annotation based validation is fine for simple validation (not null, min/max size,etc) but is not that good for more complex stuff (validations formulas, stuff like "you can't buy that unless you have money in you account" or "a car has to have 4 wheels or it can not change is status to 'ready to run'").
The problem IMHO with the Hibernate Validator, is that it is triggered on "PreInsert" or "PreUpdate"... and those events are triggered each time a "Flush" is called (automatically or manually) but Flush is called with 2 different purposes, if is called explicitly, it often means "put this in to the database", and, when called automatically often means "put changes in to the database so that I can make queries without risk of inconsistencies" but it doesn't mean "the transaction is committed" (although a lot of people use it with that intention)... now... what if I want to validate only "just before when the transaction is committed", not "on flush"... (that can happen if I want to do complex validation that requires querying the database about its state, taking in consideration the modifications that my uncommitted POJOs will produce when flushed in to the database).

I believe that the main problem of the POJO nature of Hibernate persistence, is that POJOs do not know they are persistent, and therefore do not know that they need to be leave the database in a valid state after being flushed into it... I think Hibernate is missing a mechanism that can be called that does a kind of "fake commit" that applies all the changes to the database, then call a validation api that can check that all the applied changes are valid, and, only after that is verified, allow for the transaction to really commit (and if validation fails, that it should never commit, it should rollback).

In other words, it should be possible to validate stuff "before commit" not "before flushing", and it should be possible to flush 1 or more times before commit without having the validation triggered, since we might need to perform operations with the data flushed in to the database, and only if those operation give a valid result, commit...

The problem... I think, is that Hibernate event system doesn't cover "OnBeforeCommit" and even worse... it lacks a mechanism to inform this OnBeforeCommit of which objects were inserted, updated or deleted. (In fact, Hibernate knows that internally, but it doesn't expose an API to retrieve that information... and therefore transactions are blind to the changes that were flushed before the commit (and that makes it really hard to just call the validation algorithms of those entities that modified inside that particular transaction)

Monday, July 16, 2007

NoResultException is a really stupid idea!

When I saw Query.getSingleResult() I thought, "yes, great idea, I always have to add an utility method like that..."
But then, I met NoResultException... what a great way to screw a great idea!
Why not just return null??!!! getSingleResult should return 1 element, or null if it can not find stuff!

Saturday, July 14, 2007

Unit testing Relational Queries (SQL or HQL or JQL or LINQ...)

Most applications I have built have something to do with a database (I remember that while I was on college I used to think that was not exiting stuff, I used to dream about doing neural networks stuff, logic programming in Prolog, etc) but then I met WebObjects and its Object Relational Mapper (Enterprise Objects Framework) and I got really excited about object oriented data programming... but I always had a problem... to test my object oriented queries I had to "manually" translate them into SQL and test them against the database, and only after they give me what I thought were correct result I would write them using EOF Qualifiers...
Then I met Hibernate's HQL and I realized it is much more powerful than EOF Qualifiers, but I still had to translate it to SQL to test it, I know I can get the SQL that is generated from the HQL from the debug console, and paste it in my favorite SQL editor... but even then if I found a mistake, a lot of times it was easier to tweak it in SQL and the manually translate it HQL.
Currently, there are some extensions for Eclipse (Hibernate Tools) that make this more "direct" but, what if I don't like (or don't want, or can't) use Eclipse... it would be great if someone could, for example, make a plugin for SquirrelSQL, but until then... what options do I have?

Then I learned about unit testing... and the answer came to my mind immediately: I just had to write a unit test for each of my queries. That worked fine... in the beginning... until I started having queries that returned thousands (or millions) of objects, and it wasn't such a good idea to output them to the debug console... and I had another problem... how should I write the "asserts" of query?... and how can I do it so that it doesn't make my test so slow that it becomes unusable? (I can, of course, check the results just by viewing them, but my brain is not that good to say if those 10,000 row really match with the idea I had when I wrote that HQL)

So, I started to look "what do I do" to check if an SQL query is correct, lets say for example, that I write this:

select count(*) from Address,Employee where Address.Id= Employee.AddressId and Employee.Id = 3

(Translated to English: How many address the Employee with Id = 3 has?)

Now... how do I test that? well I could add an assert after getting the result in java (or c#) like this:


But the, what happens if someone deletes the row with Id = 3 from the table Employee? That means my test will fail... or what what if someone deletes all the addresses from employee? and what if I want to test that if there are no addresses for an employee, the answer should be zero...

That is a lot of work just to test if that simple query is right... and I think that work could be done automatically:

Take a look at the SQL, it could be decomposed into:

select count(*) from (select * from Address,Employee where Address.Id= Employee.AddressId and Employee.Id = 3) as EmployeeAddresses

And then we could say, lets automatically check for the case when the resulting set is empty, and for the case when the result is not empty, to check for a case when the result is empty, we need and Employee with Addresses, so we generate:

Select * from Employee where exists(select * from Address where Address.EmployeeId = Employee.Id)

And take the Id first employee we get... and that should give us a non empty set if used in the original sql sentence that we are trying to test... after that, we automatically generate:

Select * from Employee where not exists(select * from Address where Address.EmployeeId = Employee.Id)

And take the Id first employee we get... and that should give us an empty set if used in the original sql sentence that we are trying to test...

I call this queries "inverses" of the original one, it like when one is testing a multiplication, to see if 2 x 3 = 6, just do: 6/3 = 2 and 6/2 = 3, if 2, and 3 match the operands of the multiplication, you multiplication is right. The same thing goes for SQL, one just has to find the way to "invert" it, if I could automate this inversion, the automatically generated queries would help me by telling me things that might not be immediately obvious to me when I look at the original query, and that would help me check if my original query is right.... it would me some kind of "invariants" that would help me to better understand my querying... or maybe I could even write the invariants first, and then create a query and see if it matches my invariants...

Mmmm.... maybe using a select there is another way to "invert" a query to test if it is right, using the actual inverse operation of selecting... that is "inserting", could I derive from:

select count(*) from Address,Employee where Address.Id= Employee.AddressId and Employee.Id = 3

Something like (In pseudocode):

Insert Employee;
Store Employee.Id
Run select count(*) from Address,Employee where Address.Id= Employee.AddressId and Employee.Id = EmployeeId
Assert("The answer should be zero")
Insert Address related to Employee
Run select count(*) from Address,Employee where Address.Id= Employee.AddressId and Employee.Id = EmployeeId
Assert("The answer should be one")

This has the advantage that I don't need a database with data already on it, but it has the disadvantage that takes lot of time to write an unit test like this in java, because to insert an employee, it might be necessary to:

  • Avoid breaking validation rules no related to this particular test, for example, an Employee must be related to a Department, but if the Department table is empty, then I should create a Department or I will not be able to insert an employee.
  • Avoid conflicts with validation rules directly related to this particular test, for example, what if I have an Hibernate interceptor that won't let me insert an address without 1 or more Addresses
The main problem here, I believe, is that f I insert a row leaving a not null column empty most databases won't wait until I try to commit the transaction to say "integrity violation" and rollback my changes... therefore it is impossible to write partial data just for the test that I have in front of me, but... could I automatically generate consistent inserts using as a source just the integrity rules at the database level... and the select that I want to test?
I think it can be done... the question is..
What is the algorithm to generate the inserts needed to satisfy an SQL select statement?

Thursday, July 05, 2007

The perfect infrastructure (framework?) for data systems

The perfect  infrastructure (framework?)  for data systems:

  • Has an object relational query language (something like JQL or LINQ) 
  • Has a database with versioning (like Subversion) so you can always consult the database as it was in a particular moment in time transparently
  • Supports transactions... and distributed transactions. (like Spring)
  • Has a framework to exchange graphs of objects with a remote client, objects can be manipulated freely on the client, filtered and queried without hitting the database without need, and are transparently loaded in to de client without having the n+1 problem. (like an hybrid between Hibernate, Apple's EOF & Carrierwave)
  • Supports "client only transactions" and nested client only transactions (like the Editing Context in WebObject's JavaClient applications) so that it is possible to make rollbacks without hitting the database, and it is even possible to make partial rollbacks... and have savepoint functionality, without going all the way to the database (unless you want to do so)
  • Client objects, server objects and database elements are kept in perfect sync automatically, but it is possible to add logic to a particular tier of the system without too much hassle.
  • Has a validation framework, that make it really easy to write efficient validation code following DRY, and that validates data on the client, on the application server and on the database.
  • Validation code, combined with the versioning capabilities of the infrastructure allows to save information partially, as easily as writing part of a paper, validating only as you completely the information, with multiple integrity levels
  • It is possible to disconnect the client from the server, and it will be able to save your changes until the connection is established again
  • The applications built with this perfect infrastructure auto update automatically.
  • With a very simple configuration tweak, it is possible to download the application "sliced" in pages, or as a complete bundle. This capability is so well integrated that the final user can choose the installation method, and the programmer doesn't even care about this feature.
  • The developer only needs to specify the requirements semi-formally in a language (like Amalgam) and he will receive a running application, that adapts dynamically to specification (unless he chooses to "freeze" a particular feature of the application, in which case, the default procedural code for that feature is automatically generated, and the developer can customize as he wishes ... or decide to un-freeze it.
  • Can be coded in any language compatible with a virtual machine that runs anywhere, or can be compiled to an specific platform.
  • Allows for easy report design... by the developer, or the user.
  • It is opensource (or sharedsource), so that in the extremely unlikely case of needing another feature, or finding a bug, it can be easily fixed by the developer
  • It is freely (as in beer) downloadable from the Internet. (or has a reasonable price)
  • It is fully documented, with lots of examples, going from very simple examples for beginners, to really complex real world applications with best practices for experts
  • Includes the source code with unit-test with 100% coverage of the code
  • Supports design by contract coding (from the database up to the client side).

You know what is the funny (or sad) part of all this? I have met frameworks that do 1, 2 or even 3 or more of this features... but none that does them all... will I ever see such a thing? is even possible to build it?

Tuesday, July 03, 2007

RIAs: Faulting &Uniquing (or Merging?) (Granite, Ajax)

Today I realized that lazy loading support for Granite Data Services is in its infacy... is more like "Partial Loading" (it will load everything not initialized, and not initialized stuff will remain "unloaded" forever).

I am thinking this leads to a pattern like this:
  1. I need to work with persons, so I fetch a list of them from a remote service.
  2. I choose to work with the person with id "3";
  3. present the contacts of person "3". (here is the tricky stuff, all the contacts that I load have a reference to person "3", what do I do about that? do I re fetch it, creating a different object and breaking uniquing, or do I look for a way to prevent that "same but different object" in my application? )
I guess that we will need something like Faulting & Uniquing , and a Client Side EditingContext (or Client Side EntityManger)... to control data in the client side... (our own idea of LDS DataStore ?)

But... until granite has that... what could we do as a first step? it would be nice if we could "merge" a recently obtained object with one a fetched before... something like ADO.NET DataSet... (I can not believe I am writing that I miss the DataSet)

I have been thinking... a fully "AJAX" traditional JavaScript based application would have the same problems if it had a complex enough UI... but I haven't heard of anything like it, it seems that most AJAX application developers build applications so simple that they don't even care about having to write and re-write client side data manipulation code... (or... maybe those applications don't even enough client side behavior to need it?).

I guess that until Granite has his own "Data Management" the way to handle data will be... to imitate the practices of traditional AJAX applications?

(Mmmm, now with Google Gears... will we see how JavaScript based frameworks for automatic handling of DTOs and ORM start to appear everywhere? Parhaps this will revive the interest in something like SDO?)