database Database Handling empty WHERE IN clauses in DBALs

I recently had a look at a project making heavy use of IN clauses. I'm not advocating such an usage, as IN clauses can be pretty expensive, but there's one special case that came to my concern. This case is the most trivial one in the world, where the set we compare our value to is empty.

Doctrine is translating empty ->whereIn() clauses to exactly nothing, resulting in having the pseudo request SELECT * FROM user WHERE login in (); into SELECT * FROM user;. In my opinion, it's the exact opposite of what you expect.

<?php
$q = Doctrine_Core::getTable('Fruit')
        ->where('color = ?', $color)
        ->whereIn('country', $countries)
        ->execute();

This little snippet is getting a collection of fruits of a given color, and coming from a list of countries. Actually, if the country list is empty, it will return the whole list of fruits of $color, completely skipping the country part.

I wondered a bit why this was, and talked some guys about it with who we agreed it is a little weird. x IN () is invalid SQL, but deleting the clause to avoid the error is not a solution.

After a little chat about it with Jonathan, I decided to look some other ORMs around, written in various languages. I already knew how Propel do handle the problem, but I wanted to see other languages doing it. I opened SQLAlchemy and django.db in Python and RubyOnRails / ARel in Ruby, to see what solution they did adopt.

The fun stuff is that they all used different solutions, so I thought it was an interesting point to share.

Understanding the problem

The problem is not as easy at it may seem.

The x IN y statement can be nested in a complex expression, an it can also be in a subquery used in a more complex query.

For a simple SELECT * FROM user WHERE user.id IN (), you can be pretty certain that the query won't return any result, and you should be able to programmatically skip the query, as the logic of "Give the users which ids are within this empty set" is of course easilly simplified to "Give me no users please".

But for complex queries, you must replace this invalid SQL part by something equivalent, to allow its usage within a superquery, or other complex statement. This is what the majority of the ORMs out there are doing, except for django which also has an elegant solution.

The quick and dirty hack - Propel 1.2/1.3

I know it was done this way some years ago and did not want to investigate what is done in recent version of Propel. If some of you have the code examples, I'll be glad to add Propel 1.5/1.6 examples.

Propel does one of the simplest stuff. It replaces the a IN () part by 1 <> 1. Of course this will be false, and even if it's "hackish", it works.

It's important to note the result of 1<>1 in SQL, which is 0 (or maybe false, depending on implementations).

mysql> SELECT 1<>1;
+------+
| 1<>1 |
+------+
|    0 |
+------+
1 row in set (0.00 sec)

SQLAlchemy's solution

SQLAlchemy, a very popular and very powerfull python ORM, is doing yet another hack of the same kind.

What is great is that even if the part handling this case is hard to find, the code itselfs consist about only of comments, so anybody can get it easily.

if len(args) == 0:

    # Special case handling for empty IN's, behave like
    # comparison against zero row selectable.  We use != to
    # build the contradiction as it handles NULL values
    # appropriately, i.e. "not (x IN ())" should not return NULL
    # values for x.

    util.warn('The IN-predicate on "%s" was invoked with an '
              'empty sequence. This results in a '
              'contradiction, which nonetheless can be '
              'expensive to evaluate.  Consider alternative '
              'strategies for improved performance.' % self)
    return self != self

The a IN () statement part is translated to a <> a. What it does change from the other hacks, is that the result value will be different wether a is defined or not.

mysql> SELECT 42 <> 42;
+----------+
| 42 <> 42 |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT null <> 42;
+------------+
| null <> 42 |
+------------+
|       NULL |
+------------+
1 row in set (0.00 sec)

It looks interesting, but I'm wondering if it's an accurate mimic of the desired behaviour. We'll discuss this in the last chapter of this article.

The pythonic way - django.db

Django is the one that makes the less magic possible happen.

It's easy, it does not translate the code into anything, it remains the invalid x IN (). As the RDBMS won't be happy with this, it'll raise an exection. Same behavior for SQL and ORM, nice and damn fuckin simple.

return ('%s IN (%s)' % (field_sql, ', '.join(repeat('%s', len(params)))),

And the latest Ruby on Rails development does the same

Since I started to write this article, this commit changed the way it's handled in RoR, and it's now doing the same stuff as Django.

It tried to do clever stuff before, and that was just "looking" clever, so I'm pretty glad they changed this. If you're curious, you can read the new and old code.

As you may know, RubyOnRails has a relational algebra library named ARel, that is able to parse a ruby query into an abstract syntax tree, and then compile it into any query language, most probably some kind of SQL, but you can write visitors to produce about anything.

module Arel
  module Visitors
    class ToSql < Arel::Visitors::Visitor

      # ... TRUNCATED ...

      def visit_Arel_Nodes_In o
        "#{visit o.left} IN (#{visit o.right})"
      end

      # ... TRUNCATED ...

    end
  end
end


The (true, false, unknown) algebra

To understand all ORMs' solutions, we need to understand what NULL means.

What makes difference between everything here is that relational database engines uses a ternary logic algebra instead of the regular boole's algebra to evaluate logic statements.

There are true values, false values, and unknown values. Logic tables are so enhanced to use those "new" logical values.

For example, true AND unknown is unknown, because the evaluation of the right operand is required to know the expression's value. On the other hand, false AND unknown is false, because false and anything is false. No need to know the value of the right operand to know this.

Looking at the following truth table will give you a good insight of this logic.

ABA OR BA AND BNOT A
TrueTrueTrueTrueFalse
TrueUnknownTrueUnknownFalse
TrueFalseTrueFalseFalse
UnknownTrueTrueUnknownUnknown
UnknownUnknownUnknownUnknownUnknown
UnknownFalseUnknownFalseUnknown
FalseTrueTrueFalseTrue
FalseUnknownUnknownFalseTrue
FalseFalseFalseFalseTrue

Source: wikipedia, "ternary logic" article.

Conclusion

The only conclusion I can come to is that amongst the ORMs I looked at, Django and RoR are obviously correct. Funny right? The only ones who are correct are the ones which did not try to do anything clever or complicated.

Strangely, the Propel hack seems to be correct too, but I'd like to have some comments from database experts out there, because I'm still not 100% certain about two points.

  • Is my ( Unknown IN () ) == false assertion correct? I'm assuming that if your question is "Is this value present in this set, which is empty ?", the answer is "Definately not", and not "I don't know", whether or not you know the value of "this". The doubt is in "can this unknown be nothing at all"?

  • Does the concept of "empty tuple" makes an sense at all, from a relational algebra point of view?

Share the love!

Liked this article? Please consider sharing it on your favorite network, it really helps me a lot!

You can also add your valuable insights by commenting below.