I recently had a look at a project making heavy use of IN clauses. I'm not advocating such an usage, as IN clauses can be pretty expensive, but there's one special case that came to my concern. This case is the most trivial one in the world, where the set we compare our value to is empty.
Doctrine is translating empty ->whereIn()
clauses to exactly nothing,
resulting in having the pseudo request SELECT * FROM user WHERE login in ();
into SELECT * FROM user;
. In my opinion, it's the exact opposite of what you
expect.
<?php
$q = Doctrine_Core::getTable('Fruit')
->where('color = ?', $color)
->whereIn('country', $countries)
->execute();
This little snippet is getting a collection of fruits of a given color, and
coming from a list of countries. Actually, if the country list is empty, it
will return the whole list of fruits of $color
, completely skipping the
country part.
I wondered a bit why this was, and talked some guys about it with who we agreed
it is a little weird. x IN ()
is invalid SQL, but deleting the clause to
avoid the error is not a solution.
After a little chat about it with Jonathan, I decided to look some other ORMs around, written in various languages. I already knew how Propel do handle the problem, but I wanted to see other languages doing it. I opened SQLAlchemy and django.db in Python and RubyOnRails / ARel in Ruby, to see what solution they did adopt.
The fun stuff is that they all used different solutions, so I thought it was an interesting point to share.
Understanding the problem
The problem is not as easy at it may seem.
The x IN y
statement can be nested in a complex expression, an it can also be
in a subquery used in a more complex query.
For a simple SELECT * FROM user WHERE user.id IN ()
, you can be pretty
certain that the query won't return any result, and you should be able to
programmatically skip the query, as the logic of "Give the users which ids are
within this empty set" is of course easilly simplified to "Give me no users
please".
But for complex queries, you must replace this invalid SQL part by something equivalent, to allow its usage within a superquery, or other complex statement. This is what the majority of the ORMs out there are doing, except for django which also has an elegant solution.
The quick and dirty hack - Propel 1.2/1.3
Propel does one of the simplest stuff. It replaces the a IN ()
part by 1 <>
1
. Of course this will be false, and even if it's "hackish", it works.
It's important to note the result of 1<>1
in SQL, which is 0
(or maybe false
,
depending on implementations).
mysql> SELECT 1<>1;
+------+
| 1<>1 |
+------+
| 0 |
+------+
1 row in set (0.00 sec)
SQLAlchemy's solution
SQLAlchemy, a very popular and very powerfull python ORM, is doing yet another hack of the same kind.
What is great is that even if the part handling this case is hard to find, the code itselfs consist about only of comments, so anybody can get it easily.
if len(args) == 0:
# Special case handling for empty IN's, behave like
# comparison against zero row selectable. We use != to
# build the contradiction as it handles NULL values
# appropriately, i.e. "not (x IN ())" should not return NULL
# values for x.
util.warn('The IN-predicate on "%s" was invoked with an '
'empty sequence. This results in a '
'contradiction, which nonetheless can be '
'expensive to evaluate. Consider alternative '
'strategies for improved performance.' % self)
return self != self
The a IN ()
statement part is translated to a <> a
. What it does change
from the other hacks, is that the result value will be different wether a
is
defined or not.
mysql> SELECT 42 <> 42;
+----------+
| 42 <> 42 |
+----------+
| 0 |
+----------+
1 row in set (0.00 sec)
mysql> SELECT null <> 42;
+------------+
| null <> 42 |
+------------+
| NULL |
+------------+
1 row in set (0.00 sec)
It looks interesting, but I'm wondering if it's an accurate mimic of the desired behaviour. We'll discuss this in the last chapter of this article.
The pythonic way - django.db
Django is the one that makes the less magic possible happen.
It's easy, it does not translate the code into anything, it remains the invalid
x IN ()
. As the RDBMS won't be happy with this, it'll raise an exection. Same
behavior for SQL and ORM, nice and damn fuckin simple.
return ('%s IN (%s)' % (field_sql, ', '.join(repeat('%s', len(params)))),
And the latest Ruby on Rails development does the same
Since I started to write this article, this commit changed the way it's handled in RoR, and it's now doing the same stuff as Django.
It tried to do clever stuff before, and that was just "looking" clever, so I'm pretty glad they changed this. If you're curious, you can read the new and old code.
As you may know, RubyOnRails has a relational algebra library named ARel, that is able to parse a ruby query into an abstract syntax tree, and then compile it into any query language, most probably some kind of SQL, but you can write visitors to produce about anything.
module Arel
module Visitors
class ToSql < Arel::Visitors::Visitor
# ... TRUNCATED ...
def visit_Arel_Nodes_In o
"#{visit o.left} IN (#{visit o.right})"
end
# ... TRUNCATED ...
end
end
end
See also
The (true, false, unknown) algebra
To understand all ORMs' solutions, we need to understand what NULL
means.
What makes difference between everything here is that relational database engines uses a ternary logic algebra instead of the regular boole's algebra to evaluate logic statements.
There are true values, false values, and unknown values. Logic tables are so enhanced to use those "new" logical values.
For example, true AND unknown
is unknown
, because the evaluation of the
right operand is required to know the expression's value. On the other hand,
false AND unknown
is false
, because false and anything
is false. No need
to know the value of the right operand to know this.
Looking at the following truth table will give you a good insight of this logic.
A | B | A OR B | A AND B | NOT A |
---|---|---|---|---|
True | True | True | True | False |
True | Unknown | True | Unknown | False |
True | False | True | False | False |
Unknown | True | True | Unknown | Unknown |
Unknown | Unknown | Unknown | Unknown | Unknown |
Unknown | False | Unknown | False | Unknown |
False | True | True | False | True |
False | Unknown | Unknown | False | True |
False | False | False | False | True |
Source: wikipedia, "ternary logic" article.
Conclusion
The only conclusion I can come to is that amongst the ORMs I looked at, Django and RoR are obviously correct. Funny right? The only ones who are correct are the ones which did not try to do anything clever or complicated.
Strangely, the Propel hack seems to be correct too, but I'd like to have some comments from database experts out there, because I'm still not 100% certain about two points.
-
Is my
( Unknown IN () ) == false
assertion correct? I'm assuming that if your question is "Is this value present in this set, which is empty ?", the answer is "Definately not", and not "I don't know", whether or not you know the value of "this". The doubt is in "can this unknown be nothing at all"? -
Does the concept of "empty tuple" makes an sense at all, from a relational algebra point of view?