Well, well, well. Not quite. At the risk of sounding pedantic, perhaps a bit
of (applied) theory would be welcome.
However, when you look closely, you very often realize that the returned
data comes from only some of the tables listed in the FROM clause,
typically something in the guise of:
This is a very simple example of a pattern you meet over and over. Isn't
this a very fine query, which does just what it asks?
In my view, this is a very bad query; it's logically flawed, which isn't very surprising, and its execution plan, execution time, and various statistics prove that
it also performs badly. Poor
performance is more often than not the direct consequence of poor logic. The
good news is that it's possible to improve the speed of such a slow query
significantly. (As anybody who has ever successfully tuned a SQL query can
testify, this is an area where significant means an order of magnitude
or two, not 20 percent).
Where is the logical flaw? As stated above, this is a typical query in which
all the data comes from a single table: CUSTOMERS.
Interestingly, it returns a CUSTOMER_ID. There's no prize for
guessing that this is likely to be the primary key, uniquely identifying each
row in CUSTOMERS. However, we need a DISTINCT; one
customer may have ordered several times in the period of interest. In other
words, we are fixing with DISTINCT a problem of our own
design.
This is a "Penelope" type of query. Penelope, as told in Homer's
Odyssey, was the wife of Ulysses, king of Ithaca. Hoping for the return of
her husband from Troy, she kept suitors at bay by pretending that she would
consider them once she completed her weaving--which she kept going
by undoing at night what she had done during the day.
Like Penelope, we are doing a lot of work by executing a regular join and
bringing back many rows, only to undo a large part of it in order to retrieve the
resulting set. (The "best" examples of Penelope queries are usually
met with queries involving complex views, however.)
An Improved Query
In this case, the ORDERS table doesn't belong to the
FROM clause. It belongs to the WHERE clause, because
it appears only for the purpose of filtering on the ORDERED_DATE
column. Since it belongs to the WHERE clause, it's better as a
subquery--either correlated, which means that it refers to the current
row of the main query and therefore fires each time the check must be
performed, or uncorrelated, which means that it can execute independently of
the main query.
In our simple example, a correlated subquery would be:
select a.CUSTOMER_ID, a.CUSTOMER_NAME
from CUSTOMERS a
where a.ZIP_CODE in ...
and exists (select NULL
from ORDERS b
where b.CUSTOMER_ID = a.CUSTOMER_ID
and b.ORDERED_DATE >= ...)
order by a.CUSTOMER_NAME
while an uncorrelated subquery would be:
select a.CUSTOMER_ID, a.CUSTOMER_NAME
from CUSTOMERS a
where a.ZIP_CODE in ...
and a.CUSTOMER_ID in (select b.CUSTOMER_ID
from ORDERS b
where b.ORDERED_DATE >= ...)
order by a.CUSTOMER_NAME
It may seem surprising, but on a very simple example such
as this one, the RDBMS optimizer--that part of the code in
charge of finding the most efficient way to execute the query--may
sometimes perfectly decide to transform a subquery back into a join, for sound reasons that are beyond the scope of this article.
However, when it does this, the optimizer does it on purpose. It should be
obvious that the really complicated case involves many tables. Then
the complexity of singling out each possible rewriting increases exponentially. As the optimizing phase is part of the overall response time, what the
optimizer tries as a fraction of what is possible falls dramatically when
possibilities widen.
The result is that the more complicated the query, the higher the risk that
the optimizer will take the wrong decision. That is when writing queries in a
sensible way becomes all-important.
Correlated or Uncorrelated?
Because we have several ways to write the filtering subquery, we of course have to consider whether one is significantly better than the other.
Over the past 20 years, many SQL tuning books and articles have advocated the use of NOT EXISTS over NOT IN. (Interestingly,
many such elements of popular technical culture have roots in what used to be
true facts long ago but which are not necessarily true anymore.) True or not, this assertion has become in the mind of many a SQL developer "Use
EXISTS, not IN." In some cases, this is justifiable,
but it's not a good rule or even good practice.
As its name implies, a correlated subquery (the inner query)
depends on an outer query that fires it. How often it fires depends
on the circumstances.
The defining characteristic of a correlated subquery is that it refers to at
least one value from the current row of the outer query--CUSTOMER_ID in our example above. If we have no other filtering
condition, we will have to evaluate the correlated subquery for each row we
inspect during the execution of the outer query. In that case, it will
probably cost much more than the JOIN and DISTINCT,
as a sophisticated optimizer will probably notice. If there are other, simpler
filtering conditions to apply prior to this correlated subquery, we can fire it
only for the candidate rows that have successfully passed all the other
conditions. If those other conditions are highly selective, we may need to
fire the subquery for only a handful of rows more than in our final resulting set,
applying an efficient search criterion (once again, the
CUSTOMER_ID) each time.
The uncorrelated subquery, on the other hand, doesn't refer to the outer
query. As a result, it executes only once. The important question is how many
rows this subquery will return. If it's a (relatively) small number, it is
likely that the engine will use an index (on
CUSTOMERS(CUSTOMER_ID) in our example). It is interesting to point
out that in the correlated subquery case, it would have been
ORDERS(CUSTOMER_ID). If, however, the subquery returns a really
huge number of rows, the engine will use some kind of hash or merge join. This
may prove much more efficient than a correlated subquery if the final, global
result set is large, since a correlated subquery executes at least as
many times as the number of rows returned, and possibly much more if the other
criteria are not very selective.
To summarize, both correlated and uncorrelated subqueries may each in turn
prove to be more efficient. The former wins when the final result set is
relatively small and the other criteria are efficient. The latter
works best when dealing with large tables from which you expect many rows. When
in doubt, test with both.
Conclusion
In short, when trying to correct a query that returns duplicate rows,
a quick fix of adding a DISTINCT is definitely the solution to
avoid. Unfortunately, it is widely practiced. Searching the code for
DISTINCT with joins is a way to improve code performance
easily.
Remember that the FROM clause of a query should refer to only two types of tables:
- those from which you want values returned
- those allowing to join two or more tables in the above
category
Any table that does not fall in either of those two categories should
appear in a subquery. For instance, given a join such as:
...
FROM T1, T2, T3, T4, T5, T6
WHERE ...
...
the various links between the various tables may allow us to draw
a diagram such as:
T1 - T2 - T3 -T6
|
T4 - T5
Assume that the data returned comes from T1 and
T4. Those two tables have their place in the FROM
clause, obviously. So does T2, since it allows the joining T1 to
T4. The "end of link" tables, however--T3,
T6, and T5--belong to subqueries.
These simple rules produce effective results when applied
intelligently.
Return to the www.oreilly.de